WO2020073874A1 - 机器学习运算的分配系统及方法 - Google Patents
机器学习运算的分配系统及方法 Download PDFInfo
- Publication number
- WO2020073874A1 WO2020073874A1 PCT/CN2019/109552 CN2019109552W WO2020073874A1 WO 2020073874 A1 WO2020073874 A1 WO 2020073874A1 CN 2019109552 W CN2019109552 W CN 2019109552W WO 2020073874 A1 WO2020073874 A1 WO 2020073874A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- terminal
- cloud
- machine learning
- instruction
- computing
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
Definitions
- the invention relates to the field of information processing technology, in particular to a machine learning operation distribution system and method.
- Machine learning has made major breakthroughs in recent years. For example, in machine learning technology, neural network models trained with deep learning algorithms have achieved remarkable results in image recognition, speech processing, intelligent robotics and other applications.
- the deep neural network builds a model to simulate the neural connection structure of the human brain.
- the data features are described in layers through multiple transformation stages.
- machine learning techniques have many problems in the actual application process, such as occupying many resources, slow operation speed, and large energy consumption.
- a distribution system for machine learning operations including: a terminal server and a cloud server;
- the terminal server is used to generate a corresponding computing task according to the demand information, and select the first machine learning algorithm running on the terminal server according to the computing task and the hardware performance parameters of the terminal server, and according to the computing task and the cloud
- the hardware performance parameters of the server are selected from the second machine learning algorithm running on the cloud server;
- a terminal server control instruction is generated according to the first machine learning algorithm and the operation task, and a cloud server control instruction is generated according to the second machine learning algorithm and the operation task.
- a method for distributing machine learning operations including:
- a terminal server control instruction is generated according to the first machine learning algorithm and the operation task, and a cloud server control instruction is generated according to the second machine learning algorithm and the operation task.
- the above machine learning computing distribution system and method when it is necessary to complete the computing task according to the user's demand information, execute the computing task in the terminal server and the cloud server respectively, so as to achieve the same using different machine learning algorithms
- the purpose of the calculation task and you can get the calculation results with different degrees of accuracy.
- Machine learning algorithms Based on different machine learning algorithms, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
- the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
- a computing device including:
- the computing device is used to perform network model calculations, and the computing device is used to perform neural network operations;
- the computing device includes: an arithmetic unit, a controller unit, and a storage unit;
- the storage unit is used to store weights and input neurons, and the weights include important bits and non-important bits;
- the controller unit is used to obtain the important bits and non-important bits of the weight and the input neuron, and convert the important bits and non-important bits of the weight and the input nerve Yuan is transferred to the arithmetic unit;
- the operation unit is configured to perform operation on the input neuron and the important bit to obtain the first operation result of the output neuron;
- the input neuron and the non-significant bit are operated to obtain a second operation result, and the first operation result and the second operation The sum of the results is used as the output neuron.
- a machine learning computing device includes one or more computing devices according to the first aspect, for acquiring input data and control information to be computed from other processing devices, and Perform specified machine learning operations, and pass the execution results to other processing devices through the I / O interface;
- the machine learning computing device includes a plurality of the computing devices
- the plurality of the computing devices can be connected and transmit data through a specific structure
- multiple computing devices interconnect and transmit data through a PCIE bus, a fast external device interconnect bus, to support larger-scale machine learning operations; multiple computing devices share the same control system or have their own control systems Multiple computing devices share memory or have their own memory; the interconnection method of multiple computing devices is any interconnection topology.
- a combined processing device includes the machine learning computing device according to the second aspect, a universal interconnection interface, and other processing devices;
- the machine learning operation device interacts with the other processing device to jointly complete the calculation operation specified by the user.
- an embodiment of the present application provides a neural network chip.
- the neural network chip includes the machine learning computing device according to the second aspect or the combined processing device according to the fifth aspect.
- an embodiment of the present application provides an electronic device, where the electronic device includes the chip according to the sixth aspect.
- an embodiment of the present application provides a board card, wherein the board card includes: a storage device, an interface device, and a control device, and the neural network chip described in the sixth aspect;
- the neural network chip is respectively connected to the storage device, the control device and the interface device;
- the storage device is used for storing data
- the interface device is used to realize data transmission between the chip and an external device
- the control device is used for monitoring the state of the chip.
- an embodiment of the present application provides a calculation method, including:
- the first operation result is greater than the preset threshold, an operation is performed between the input neuron and the non-significant bit to obtain a second operation result, and the first operation result and the first The sum of the two calculation results is used as the output neuron.
- an embodiment of the present application provides a computer-readable storage medium, wherein the computer-readable storage medium stores a computer program for electronic data exchange, wherein the computer program causes the computer to execute Part or all of the steps described in the nine aspects.
- an embodiment of the present application provides a computer program product, wherein the computer program product includes a non-transitory computer-readable storage medium that stores a computer program, and the computer program can be operated by a computer to execute the embodiment of the present application Part or all of the steps described in the ninth aspect.
- the computer program product may be a software installation package.
- the computing device obtains the first operation result of the output neuron by obtaining the weighted important bits and non-important bits, and the input neuron, and calculating the input neuron and the important bit. If the result is less than or equal to the preset threshold, the operation of the current output neuron is skipped. If the first operation result is greater than the preset threshold, the input neuron and the non-significant bits are operated to obtain the second operation result. The sum of the operation result and the second operation result is used as an output neuron. Furthermore, if the prediction result of an output neuron does not require an operation, the operation process of the output neuron is skipped.
- the new computing device integrates computing methods to predict and skip output neurons that do not need to be computed. Thereby reducing the calculation time and calculation energy consumption of the neural network.
- FIG. 1-1 is a schematic structural diagram of a machine learning computing distribution system according to an embodiment
- 1-2 is a schematic structural diagram of a machine learning computing distribution system according to another embodiment
- 1-3 is a schematic structural diagram of a machine learning operation distribution system according to another embodiment
- FIG. 1-4 is a diagram of an operation mode of operation-storage-communication of an embodiment
- 1-5A is a schematic structural diagram of a computing device according to an embodiment
- 1-5B is a structural diagram of a computing device according to an embodiment
- 1-5C is a structural diagram of a computing device provided by another embodiment
- 1-5D is a structural diagram of a main processing circuit of an embodiment
- FIGS. 1-5E are structural diagrams of another computing device according to an embodiment
- 1-5F is a schematic structural diagram of a tree module according to an embodiment
- FIGS. 1-5G are structural diagrams of yet another computing device according to an embodiment
- FIGS. 1-5H are structural diagrams of still another computing device according to an embodiment
- 1-5I is a schematic structural diagram of a computing device according to an embodiment
- FIG. 1-6 is a flowchart of a machine learning operation distribution method according to an embodiment.
- FIG. 2-1A is a schematic structural diagram of a computing device according to an embodiment of the present invention.
- FIG. 2-1B is a schematic structural diagram of a layered storage device according to an embodiment of the present application.
- FIG. 2-1C is a schematic structural diagram of a 3T SRAM memory cell provided by an embodiment of the present application.
- FIG. 2-1D is a schematic structural diagram of a data processing device according to an embodiment of the present application.
- FIG. 2-1E is a schematic structural diagram of another data processing apparatus provided by an embodiment of the present application.
- FIG. 2-2 is a schematic structural diagram of a computing device provided by an embodiment of the present application.
- Figure 2-3 is a structural diagram of a computing device provided by an embodiment of the present application.
- 2-4 is a structural diagram of a computing device provided by another embodiment of the present application.
- 2-5 is a structural diagram of a main processing circuit provided by an embodiment of the present application.
- FIGS. 2-6 are structural diagrams of another computing device provided by embodiments of the present application.
- FIGS. 2-7 are schematic structural diagrams of a tree module provided by an embodiment of the present application.
- FIGS. 2-8 are structural diagrams of yet another computing device provided by an embodiment of the present application.
- FIGS. 2-9 are structural diagrams of still another computing device provided by an embodiment of the present application.
- FIGS. 2-10 are structural diagrams of a combined processing device provided by an embodiment of the present application.
- FIGS. 2-11 are schematic structural diagrams of a computing device provided by an embodiment of the present application.
- FIGS. 2-12 are structural diagrams of another combined processing device provided by an embodiment of the present application.
- 2-13 are schematic structural diagrams of a board provided by an embodiment of the present application.
- 2-14 are schematic flowcharts of a calculation method provided by an embodiment of the present invention.
- a machine learning computing distribution system includes: a cloud server 10 and a terminal server 20.
- the user inputs corresponding demand information through a terminal device according to his actual needs.
- the terminal device includes an input acquisition unit containing a control function, and the input acquisition unit can be selected by the user, such as an APP or an API of other programs. Interface etc.
- the demand information input by the user is mainly determined by three aspects, one is the function demand information, the other is the accuracy demand information, and the other is the memory demand information.
- the computing tasks include functional requirements tasks, accuracy requirements tasks and memory requirements tasks. It needs to be clear that the computing task of the first machine learning algorithm and the computing task of the second machine learning algorithm are the same computing task.
- Hardware performance parameters include but are not limited to computing power, energy consumption, accuracy and speed.
- machine learning algorithms include but are not limited to neural network algorithms and deep learning algorithms.
- the machine learning algorithm has obvious stage-by-stage characteristics, such as the operation of each layer of neural network, each iteration of the clustering algorithm, and so on.
- the machine learning algorithm is divided into multiple-stage algorithms.
- the machine learning algorithm is a multi-layer neural network algorithm, and multiple stages include multiple layers.
- the machine learning algorithm is a clustering algorithm, and multiple stages are multiple iterations. In each stage of calculation, the terminal server 20 and the cloud server 10 can be used for calculation.
- the computing power of the terminal server is low, the computing performance of the corresponding first machine learning algorithm is also low.
- the computing power of the cloud server is high, and the computing performance of the corresponding second machine learning algorithm is also high.
- calculating the corresponding computing task of the first machine learning algorithm at each stage in the terminal server 20 can more quickly obtain a terminal computing result with lower accuracy.
- the computing task of computing the second machine learning algorithm in each stage corresponding to the cloud server 10 takes a long time, it can obtain a cloud computing result with high accuracy. Therefore, although the terminal operation result can be obtained faster than the cloud operation result, the cloud operation result is more accurate than the terminal operation result.
- the terminal server 20 may obtain the image faster than the cloud server 10
- the animal in is the result of a cat, but the cloud server 10 may also obtain more accurate calculation results such as the cat's breed.
- the above machine learning computing distribution system and method when it is necessary to complete the computing task according to the user's demand information, execute the computing task in the terminal server and the cloud server respectively, so as to achieve the same using different machine learning algorithms
- the purpose of the calculation task and you can get the calculation results with different degrees of accuracy.
- Machine learning algorithms Based on different machine learning algorithms, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
- the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
- the terminal server 20 is further used to parse the terminal server control instruction to obtain a terminal control signal, and calculate the corresponding first machine learning algorithm at each stage according to the terminal control signal To obtain the terminal operation result, and send the cloud server control instruction to the cloud server 10.
- the cloud server 10 is configured to receive the cloud server control instruction, parse the cloud server control instruction to obtain a cloud control signal, and calculate the corresponding second machine at each stage according to the cloud control signal Learning the computing task of the algorithm to get the cloud computing result.
- the hardware performance parameter includes computing capability
- the terminal server 20 is specifically used to obtain the computing capability of the terminal server 20 and the computing capability of the cloud server 10; according to the computing task and The computing power of the terminal server selects the first machine learning algorithm, and the second machine learning algorithm is selected according to the computing task and the computing power of the cloud server.
- the hardware performance parameters of the terminal server 20 include the computing capabilities of the terminal server 20
- the hardware performance parameters of the cloud server 10 include the computing capabilities of the cloud server 10.
- the computing capability can be obtained from the configuration information preset by the computing module.
- the computing power of the server affects the computing speed of the server. According to the computing power of the computing module, a more suitable machine learning algorithm can be further accurately obtained.
- the first machine learning algorithm includes a first neural network model
- the second machine learning algorithm includes a second neural network model.
- a neural network model is used as an example to specifically describe that the machine learning operation distribution system is specifically applied to the distribution of neural network operations, and the distribution system includes:
- the terminal server 20 is used to obtain demand information, hardware performance parameters of the terminal server 20 and hardware performance parameters of the cloud server 10; generate corresponding computing tasks according to the demand information, and according to the computing tasks and all
- the hardware performance parameters of the terminal server 20 are selected from the first neural network model running on the terminal server 20, and the second from the computing task and the hardware performance parameters of the cloud server 10 are selected from the second running on the cloud server 10 Neural network model; generating terminal server control instructions based on the selected first neural network model and the computing task, and generating cloud server control instructions based on the selected second neural network model and the computing task;
- the terminal server control instruction is analyzed to obtain a terminal control signal, and the corresponding first neural network model calculation task is calculated according to the terminal control signal to obtain a terminal operation result, and the cloud server control instruction is sent to the cloud server 10 .
- the cloud server 10 is used to receive the cloud server control instruction, parse the cloud server control instruction to obtain a cloud control signal, and calculate the corresponding second neural network model operation task according to the cloud control signal to obtain the cloud Operation result.
- the calculation task when the calculation task needs to be completed according to the user's demand information, the calculation task is executed in the terminal server and the cloud server respectively, so as to achieve the purpose of using different neural network models to complete the same calculation task, and can Obtain calculation results with different degrees of accuracy.
- Neural network model Based on different neural network models, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
- the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
- the terminal server 20 is further configured to, after outputting the terminal operation result, upon receiving the operation stop instruction, send the operation stop instruction to the cloud server 10 to terminate The computing work of the cloud server 10.
- the user can obtain an operation result with low accuracy. If the user wants to obtain a more accurate calculation result, he can wait for the cloud server 10 to complete the calculation and output the cloud calculation result through the terminal server 20. As a result, the user obtains an operation result with lower accuracy and an operation result with higher accuracy, respectively. However, if the user obtains an operation result with low accuracy and thinks that the operation result has met his needs, and therefore does not want to obtain an operation result with higher accuracy, the user can input a stop operation instruction through the user terminal. After receiving the stop operation instruction, the distribution system terminates the calculation work of the cloud server 10, that is, the operation result with high accuracy is in a state of not being completed or even if it is completed but no longer output.
- the user can choose to get only one operation result with lower accuracy, which can save the user's time, and can guarantee the operation performance of the machine learning operation distribution system, and avoid the waste of operation resources.
- the terminal server 20 includes a terminal controller unit 210, a terminal arithmetic unit 220, and a terminal communication unit 230; the terminal controller unit 210 communicates with the terminal arithmetic unit 220 and the terminal communication unit, respectively 230 connections.
- the terminal controller unit 210 is used to obtain demand information, hardware performance parameters of the terminal server 20 and hardware performance parameters of the cloud server 10; generate a corresponding computing task according to the demand information, and according to the
- the calculation task and the hardware performance parameters of the terminal server 20 are selected in the first machine learning algorithm running on the terminal server 20, and the cloud server 10 is selected in accordance with the calculation tasks and the hardware performance parameters of the cloud server 10 Running second machine learning algorithm; generating terminal server control instructions based on the first machine learning algorithm and the computing task, and generating cloud server control instructions based on the second machine learning algorithm and the computing task, and
- the terminal server control command is analyzed to obtain a terminal control signal.
- the terminal operation unit 220 is used to calculate the operation task of the corresponding first machine learning algorithm according to the terminal control signal to obtain a terminal operation result; the terminal communication unit 230 is used to send the cloud server control instruction to the ⁇ Server10.
- the terminal controller unit 210 obtains the demand information input by the user and generates corresponding computing tasks, and evaluates according to the hardware performance parameters of the terminal server 20 and the cloud server 10, such as computing capability, energy consumption, accuracy, speed, etc. result. Then select a suitable first machine learning algorithm for the terminal server and a suitable second machine learning algorithm for the cloud server based on the demand information and the evaluation results, and generate different according to the computing power of the above different machine learning algorithms Control instructions.
- the instruction set including the control instruction is pre-stored in the terminal server 20 and the cloud server 10, and the terminal controller unit 210 generates the terminal server control instruction for the terminal server 20 and the cloud server 10 according to the input demand information. Cloud server control instructions.
- the following mathematical model may be selected as an embodiment.
- the index is the maximum number of floating point / fixed-point operations per second, which is recorded as the parameter C; then analyze the computing needs, here is first to judge the macro neural network model Function g (x), that is, whether to choose CNN, RNN or DNN, etc.
- CNN and DNN are more used in the field of image vision, and RNN is more used in the field of text and audio. Through basic filtering, you can judge the suitable one more quickly.
- Neural network type then filter based on energy consumption W, accuracy R and speed S.
- the terminal controller unit 210 evaluates by establishing a mathematical model of parameters such as energy consumption, speed, and accuracy, and then selects the machine learning algorithm most suitable for the terminal server 20 and the cloud server 10, and performs training or inference.
- the hardware configuration of the terminal server 20 can be directly obtained through the system, such as system calls such as Android / IOS; the hardware configuration of the cloud server 10 is sent by the terminal server 20 to the cloud server 10 through the terminal communication unit 230 to obtain the returned configuration information .
- the terminal controller unit 210 also parses the terminal server control instruction to obtain a terminal control signal, and the terminal controller unit 210 sends the terminal control signal to the terminal arithmetic unit 220 and the terminal communication unit 230.
- the terminal operation unit 220 receives the corresponding terminal control signal, and calculates the operation task of the corresponding first machine learning algorithm according to the terminal control signal to obtain the terminal operation result.
- the terminal communication unit 230 is used to send the cloud server control instruction to the cloud server 10.
- the above-mentioned first machine learning algorithm includes a first neural network model.
- the cloud server 10 includes a cloud controller unit 110, a cloud computing unit 120, and a cloud communication unit 130; the cloud controller unit 110 communicates with the cloud computing unit 120 and the cloud communication unit, respectively 130 is connected, and the cloud communication unit 130 is connected to the terminal communication unit 230 for data interaction between the cloud server 10 and the terminal server 20.
- the cloud communication unit 130 is used to receive the cloud server control instruction, send the cloud server control instruction to the cloud controller unit 110, and obtain the cloud computing result and send it to the terminal server 20;
- the cloud controller unit 110 is used to receive the cloud server control instruction, and parse the cloud server control instruction to obtain a cloud control signal;
- the cloud computing unit 120 is used to calculate the corresponding second according to the cloud control signal The operation task of the machine learning algorithm is to obtain the cloud computing result, and the cloud computing result is sent to the terminal server 20 through the cloud communication unit 130.
- the terminal controller unit 210 sends the generated cloud server control instruction to the cloud server 10 through the terminal communication unit 230.
- the cloud communication unit 130 receives the cloud server control instruction and sends it to the cloud controller unit 110.
- the cloud controller unit 110 parses the cloud server control instruction to obtain the cloud control signal and sends it to the cloud computing unit 120 and the cloud Communication unit 130.
- the cloud computing unit 120 receives the corresponding cloud control signal, calculates the computing task of the corresponding second machine learning algorithm according to the cloud control signal, and obtains the cloud computing result.
- the above second machine learning algorithm includes a second neural network model.
- the data communication between the cloud server 10 and the terminal server 20 is accompanied by the calculation process of the cloud server 10 and the terminal server 20 separately.
- the terminal communication unit 230 sends data to the cloud communication unit 130 according to the corresponding terminal control signal; in turn, the cloud communication unit 130 also sends data to the terminal communication unit 230 according to the corresponding cloud control signal. Since the terminal server 20 is to obtain a low-accuracy operation result, the operation time consumed is short. After the operation of the terminal server 20 is completed, the terminal operation result is first sent to the user's terminal device.
- the cloud communication unit 130 sends the cloud calculation result to the terminal communication unit 230, and the terminal The server 20 sends the cloud computing result to the user's terminal device.
- the terminal communication unit 230 and the cloud communication unit 130 respectively perform data transmission between the terminal server 20 and the cloud server 10 through a communication protocol.
- the terminal server 20 further includes a terminal storage unit 240.
- the terminal storage unit 240 is connected to the terminal arithmetic unit 220 and the terminal controller unit 210, respectively.
- the terminal storage unit 240 is used to receive input data from the terminal server 20 and perform Terminal data storage.
- the terminal storage unit 240 may determine the input data of the terminal and store the data and store the operation process of the terminal according to the terminal server control instruction generated by the terminal instruction generation circuit 210b.
- the stored data format may be a floating point number or a quantized fixed point number.
- the terminal storage unit 240 may be a device or storage space capable of storing data, such as sram, dram, etc., for storing data of the terminal and instructions of the terminal.
- the data includes but is not limited to at least one of input neurons, output neurons, weights, images, and vectors.
- the terminal operation unit 220 and the terminal storage unit 240 are two separate components. After the operation of the terminal operation unit 220 is completed, the terminal operation result is first transferred to the terminal storage unit 240, and then Then, the terminal storage unit 240 and the terminal communication unit 230 encode and transmit the terminal operation result, and in the process of encoding and transmission communication, the terminal arithmetic unit 220 has already started the next round of operation. Using this working mode will not cause excessive waiting delay.
- the equivalent computing time of each round is the actual computing time + dump time. Since the transfer time is much shorter than the encoding transmission time, this method can fully mobilize the computing power of the terminal computing unit 220, so that the terminal computing unit 220 works as full as possible.
- the corresponding terminal server control command can be generated in the terminal command generation circuit 210b according to the above-mentioned working mode.
- the implementation of this part may be entirely implemented by an algorithm, and the CPU device of the terminal server 20 itself may be used.
- the cloud server 10 further includes a cloud storage unit 140.
- the cloud storage unit 140 is connected to the cloud computing unit 120 and the cloud controller unit 110 respectively.
- the cloud storage unit 140 is used to receive cloud input data and perform cloud data Storage.
- the cloud storage unit 140 may determine the cloud input data according to the cloud server control instruction and store the data and store the cloud computing process.
- the stored data format may be a floating point number or a quantized fixed point number.
- the cloud storage unit 140 may be a device or storage space capable of storing data, such as sram, dram, etc., for storing data in the cloud and instructions in the cloud.
- the data includes but is not limited to at least one of input neurons, output neurons, weights, images, and vectors.
- the cloud computing unit 120 and the cloud storage unit 140 are two separate components. After the cloud computing unit 120 completes the operation, the cloud computing result is first transferred to the cloud storage unit 140, and then Then, the cloud storage unit 140 and the cloud communication unit 130 encode and transmit the cloud operation result, and in the process of encoding and transmission communication, the cloud operation unit 120 has already started the next round of calculation. Using this working mode will not cause excessive waiting delay.
- the equivalent computing time of each round is the actual computing time + dump time. Since the transfer time is much shorter than the encoding transmission time, this method can fully mobilize the computing capability of the cloud computing unit 120, so that the cloud computing unit 120 works as full as possible. It should be noted that the corresponding cloud server control command can be generated in the terminal command generation circuit 210b according to the above working mode.
- the terminal controller unit 210 includes a terminal evaluation circuit 210a, a terminal instruction generation circuit 210b, and a terminal instruction analysis circuit 210c; the terminal instruction generation circuit 210b evaluates the terminal
- the circuit 210a is connected to the terminal instruction analysis circuit 210c.
- the terminal evaluation circuit 210a, the terminal instruction generation circuit 210b and the terminal instruction analysis circuit 210c are connected to the terminal operation unit 220 and the terminal storage unit 240, respectively.
- the terminal communication unit 230 is connected.
- the terminal evaluation circuit 210a is used to obtain demand information, hardware performance parameters of the terminal server 20 and hardware performance parameters of the cloud server 10; generate a corresponding computing task according to the demand information, and according to the computing task and
- the hardware performance parameters of the terminal server 20 are selected from the first machine learning algorithm running on the terminal server 20, and the first computer learning algorithm running on the cloud server 10 is selected according to the calculation task and the hardware performance parameters of the cloud server 10
- Two machine learning algorithms the terminal instruction generating circuit 210b is used to generate terminal server control instructions based on the first machine learning algorithm and the computing task, and generate a cloud server according to the second machine learning algorithm and the computing task Control instruction;
- the terminal instruction analysis circuit 210c is used to analyze the terminal server control instruction to obtain a terminal control signal.
- the terminal evaluation circuit 210a obtains the demand information input by the user, and selects a first machine learning algorithm for the terminal with a low computing power and an operation based on the demand information and according to the hardware performance parameters of the terminal server 20 and the cloud server 10, respectively.
- the second machine learning algorithm for the cloud with higher capabilities.
- the terminal instruction generation circuit 210b After the selection is completed, the terminal instruction generation circuit 210b generates corresponding terminal server control instructions respectively according to the low computing power of the first machine learning algorithm for the terminal server 20 and the high computing power of the second machine learning algorithm for the cloud server 10 And cloud server control instructions.
- the control instructions in the terminal server control instructions and the cloud server control instructions can include operation allocation instructions, memory access instructions and data communication instructions, respectively.
- the terminal server control instruction is used for control in the terminal server 20.
- the cloud server control instruction is sent to the cloud communication unit 130 through the terminal communication unit 230, and then sent to the cloud controller unit 110 by the cloud communication unit 130 to be stored in the cloud server 10.
- the terminal instruction analysis circuit 210c is used to analyze the terminal server control instruction to obtain a terminal control signal, and to cause the terminal operation unit 220, the terminal storage unit 240, and the terminal communication unit 230 to operate according to the terminal server control instruction according to the terminal control signal.
- the allocation method used by the arithmetic allocation scheme may be: the same arithmetic task is allocated according to the different arithmetic capabilities, precision, speed and energy consumption of the machine learning algorithm, that is, the Different machine learning algorithms but complete the same computing task.
- the terminal server 20 and the cloud server 10 can calculate the same calculation task at the same time, or can calculate the same calculation task at different times, or select a pair of calculation tasks according to the user's needs for calculation.
- the AlexNet neural network model has low computing power, but its space-time cost is minimal.
- the computing power of the ResNet neural network model is based on more energy consumption.
- the neural network model with low computing power can give a less accurate operation result, and the operation result can be within the range accepted by the user.
- the neural network model with low computing power requires lower power consumption and proper inference time. Therefore, for the lower performance of the terminal server 20 compared to the cloud server 10, the lower computing power in the terminal server 20 can be selected
- the calculation of the first neural network model is completed in the cloud server 10 with the calculation of the second neural network model with high calculation capability. And it is up to the user's needs to decide whether to further obtain high-precision operation classification results. In this way, it is possible to provide the user with a low-accuracy calculation result first, avoiding a long waiting time, and at the same time providing the user with a choice of scenarios.
- the memory access instruction is a memory management instruction based on calculation allocation, and is used to control the terminal storage unit 240 or the cloud storage unit 140 to perform data storage.
- the data communication instruction is a data interaction instruction to the cloud server 10 and the terminal server 20, and is used to control the data communication between the terminal communication unit 230 and the cloud communication unit 130.
- system-level scheduling of multiple terminal servers 20 and one cloud server 10 can be performed, and multiple terminal servers 20 and one cloud server 10 jointly complete a system-level task with high complexity.
- the cloud controller unit 110 includes a cloud command parsing circuit 110a, and the cloud command parsing circuit 110a is connected to the cloud computing unit 120, the cloud storage unit 140, and the cloud communication unit 130, respectively.
- the cloud instruction parsing circuit 110a is used to receive the cloud server control instruction, and parse the cloud server control instruction to obtain the cloud control signal, and enable the cloud computing unit 120 and the cloud storage unit according to the cloud control signal 140 and the cloud communication unit 130 operate according to the cloud server control instructions.
- the operation principles of the cloud computing unit 120, the cloud storage unit 140, and the cloud communication unit 130 are the same as the above-described terminal computing unit 220, terminal storage unit 240, and terminal communication unit
- the operating principle of 230 is the same, so I won't repeat it here.
- the cloud command parsing circuit 110a obtains the cloud control signal by parsing the cloud server control command, and sends the cloud control signal to other components of the cloud server 10, so that the cloud server 10 can complete the calculation of the cloud neural network in an orderly manner, greatly speeding up The computing speed of the cloud neural network.
- the terminal arithmetic unit 220 is connected to the terminal communication unit 230, and the terminal storage unit 240 is connected to the terminal communication unit 230.
- the terminal communication unit 230 may encode and send the output data of the terminal operation unit 220 and the terminal storage unit 240 to the cloud communication unit 130. Conversely, the terminal communication unit 230 may also receive the data sent by the cloud communication unit 130, decode the data, and send it to the terminal arithmetic unit 220 and the terminal storage unit 240 again.
- the task amount of the terminal controller unit 210 can be reduced, so that the terminal controller unit 210 can complete the generation process of the control instruction in more detail.
- the cloud computing unit 120 is connected to the cloud communication unit 130
- the cloud storage unit 140 is connected to the cloud communication unit 130.
- the cloud communication unit 130 may encode the output data of the cloud computing unit 120 and the cloud storage unit 140 and send it to the terminal communication unit 230. Conversely, the cloud communication unit 130 may also receive the data sent by the terminal communication unit 230 and decode the data and send it to the cloud computing unit 120 and the cloud storage unit 140 again.
- the terminal computing unit 220 may be a computing component of the terminal server 20 itself, and the cloud computing unit 120 may be a computing component of the cloud server 10 itself.
- the computing component can be a CPU, a GPU, or a neural network chip.
- the terminal operation unit 220 and the cloud operation unit 120 may be operation units in the data processing unit of the artificial neural network chip, which are used to control data according to control instructions stored in the storage unit (terminal storage unit 240 or cloud storage unit 140) Perform the corresponding operation.
- the cloud controller unit 110 and the terminal controller unit 210 are the controller unit 311
- the cloud computing unit 120 and the terminal computing unit 220 are the computing unit 312
- the arithmetic unit 312 includes: a master processing circuit 3101 and a plurality of slave processing circuits 3102;
- the controller unit 311 is used to obtain input data and calculation instructions; in an optional solution, specifically, the input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or more Data I / O interface or I / O pin.
- the above calculation instructions include but are not limited to: forward operation instructions or reverse training instructions, or other neural network operation instructions, etc., such as convolution operation instructions.
- the specific implementation of the present application does not limit the specific expression form of the above calculation instructions.
- the controller unit 311 is further configured to parse the calculation instruction to obtain a plurality of calculation instructions, and send the plurality of calculation instructions and the input data to the main processing circuit;
- the main processing circuit 3101 is configured to perform pre-processing on the input data and transfer data and operation instructions with the multiple slave processing circuits;
- a plurality of slave processing circuits 3102 configured to execute intermediate operations in parallel based on data transmitted from the master processing circuit and operation instructions to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;
- the main processing circuit 3101 is configured to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
- the technical solution provided in this application sets the computing unit to a master multi-slave structure.
- it can split the data according to the calculation instructions of the forward operation, so that multiple slave processing circuits can The part with a large amount of calculation performs parallel operation, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
- the above machine learning calculation may specifically include: artificial neural network operation
- the above input data may specifically include: input neuron data and weight data.
- the above calculation result may specifically be: the result of the operation of the artificial neural network outputs the neuron data.
- the operation in the neural network it can be a layer of operation in the neural network.
- the implementation process is that in the forward operation, when the previous layer of artificial neural network is completed, the operation of the next layer The instruction will use the output neuron calculated in the arithmetic unit as the input neuron of the next layer (or perform some operations on the output neuron and then use it as the input neuron of the next layer), and at the same time, the weight It is also replaced with the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the next layer of computing instructions will use the input neuron gradient calculated in the computing unit as the next The output neuron gradient of the first layer is operated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weights are replaced by the weights of the next layer.
- the above machine learning calculation may also include support vector machine operation, k-nearest neighbor (k-nn) operation, k-mean (k-means) operation, principal component analysis operation and so on.
- k-nn k-nearest neighbor
- k-means k-mean
- principal component analysis operation k-means
- the input neurons and output neurons of the multi-layer operations do not refer to the neurons in the input layer and the output layers of the entire neural network, but for In any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are input neurons, and the neurons in the upper layer of the network forward operation are output neurons.
- K 1, 2, ..., L-1.
- K + 1th layer we divide the Kth layer This is called the input layer, where the neurons are the input neurons, and the K + 1th layer is called the output layer, and the neurons are the output neurons.
- each layer can be used as an input layer, and the next layer is the corresponding output layer.
- the controller unit includes: an instruction cache unit 3110, an instruction processing unit 3111, and a storage queue unit 3113;
- the instruction cache unit 3110 is used to store the calculation instructions associated with the artificial neural network operation
- the instruction processing unit 3111 is configured to parse the calculation instruction to obtain multiple operation instructions
- the storage queue unit 3113 is used to store an instruction queue.
- the instruction queue includes a plurality of operation instructions or calculation instructions to be executed in the order of the queue.
- the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically used to decode instructions into microinstructions.
- the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically used to receive and process microinstructions.
- the above microinstruction may be the next level instruction of the instruction.
- the microinstruction can be obtained by splitting or decoding the instruction, and can be further decoded into control signals of each component, each unit, or each processing circuit.
- the structure of the calculation instruction may be as shown in the following table.
- the calculation instruction may include: one or more operation fields and an operation code.
- the calculation instruction may include a neural network operation instruction. Taking neural network operation instructions as an example, as shown in Table 1, among them, register number 0, register number 1, register number 2, register number 3, and register number 4 can be operation domains. Wherein, each register number 0, register number 1, register number 2, register number 3, register number 4 may be the number of one or more registers.
- the above register may be an off-chip memory. Of course, in actual application, it may also be an on-chip memory for storing data.
- controller unit may further include:
- the dependency processing unit 3112 is configured to determine whether there is an association between the first operation instruction and the zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction has an association relationship, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction is completed, the first operation instruction is extracted from the instruction storage unit Transmitted to the arithmetic unit;
- the determining whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:
- Extracting the first storage address interval of the data (such as a matrix) required in the first arithmetic instruction according to the first arithmetic instruction, and extracting the zeroth of the required matrix in the zeroth arithmetic instruction according to the zeroth arithmetic instruction A storage address interval, if the first storage address interval and the zeroth storage address interval have overlapping areas, it is determined that the first arithmetic instruction and the zeroth arithmetic instruction have an association relationship, such as the first storage If the address interval does not overlap with the zeroth storage address interval, it is determined that the first operation instruction and the zeroth operation instruction do not have an association relationship.
- the arithmetic unit 312 may include a master processing circuit 3101 and multiple slave processing circuits 3102.
- multiple slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the multiple slave processing circuits K slave processing circuits, the k slave processing circuits are: n slave processing circuits in the first row, n slave processing circuits in the mth row, and m slave processing circuits in the first column, it should be noted that , The K slave processing circuits shown in FIG.
- the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
- K slave processing circuits are used to transfer data and instructions between the master processing circuit and the plurality of slave processing circuits.
- the main processing circuit 3101 may further include one or any combination of a conversion processing circuit 3101a, an activation processing circuit 3101b, and an addition processing circuit 3101c;
- the conversion processing circuit 3101a is used to perform the exchange between the first data structure and the second data structure (such as the conversion of continuous data and discrete data) of the data block or intermediate result received by the main processing circuit; or to receive the main processing circuit
- the data block or the intermediate result performs the exchange between the first data type and the second data type (for example, conversion of fixed-point type and floating-point type);
- the activation processing circuit 3101b is used to execute the activation operation of the data in the main processing circuit
- the addition processing circuit 3101c is used to perform addition operation or accumulation operation.
- the main processing circuit is used to determine that the input neuron is broadcast data, the weight value is distribution data, the distribution data is distributed into multiple data blocks, and at least one of the multiple data blocks and multiple At least one of the calculation instructions is sent to the slave processing circuit;
- the plurality of slave processing circuits are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the master processing circuit;
- the main processing circuit is configured to process a plurality of intermediate results sent from the processing circuit to obtain the result of the calculation instruction, and send the result of the calculation instruction to the controller unit.
- the slave processing circuit includes: a multiplication processing circuit
- the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result
- the forwarding processing circuit (optional) is used to forward the received data block or product result.
- An accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain the intermediate result.
- the operation instruction is a matrix multiplying matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions.
- the actual formula that needs to be executed can be: Among them, the weight w is multiplied by the input data x i , and the sum is added, and then the offset b is added to do the activation operation s (h) to obtain the final output result s.
- the arithmetic unit includes: a tree module 340, and the tree module 340 includes: a root port 3401 and a plurality of branch ports 3402, the The root port of the tree module is connected to the main processing circuit, and the multiple branch ports of the tree module are respectively connected to one slave processing circuit of the multiple slave processing circuits;
- the tree module has a sending and receiving function.
- the tree module is a sending function
- the tree module is a receiving function.
- the tree module is used to forward data blocks, weights, and operation instructions between the master processing circuit and the multiple slave processing circuits.
- the tree module is a selectable result of the computing device, and it may include at least one layer of nodes.
- the node has a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
- the tree-shaped module may be an n-ary tree structure, for example, a binary tree structure as shown in FIGS. 1-5F, and of course it may also be a trigeminal tree structure, where n may be an integer greater than or equal to 2.
- the specific implementation of the present application does not limit the specific value of the above-mentioned n.
- the above-mentioned number of layers may also be 2.
- the slave processing circuit may be connected to nodes of layers other than the penultimate layer node, for example, as shown in FIG. 1-5F The nodes in the penultimate layer shown.
- the above operation unit may carry a separate cache, as shown in FIGS. 1-5G, and may include: a neuron cache unit 363, which stores the input neuron vector data and output nerve of the slave processing circuit Metadata.
- the operation unit may further include: a weight buffer unit 364 for buffering weight data required by the slave processing circuit in the calculation process.
- the operation unit 312 is shown in FIG. 1-5B, and may include a branch processing circuit 3103; its specific connection structure is shown in FIG. 1-5B, where,
- the main processing circuit 3101 is connected to the branch processing circuit 3103 (one or more), and the branch processing circuit 3103 is connected to one or more slave processing circuits 3102;
- the branch processing circuit 3103 is used to perform forwarding of data or instructions between the main processing circuit 3101 and the slave processing circuit 3102.
- the controller unit obtains the input neuron matrix x, the weight matrix w and the fully connected operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the fully connected operation instruction to the main processing circuit;
- the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, and then distributes the 8 sub-matrices to 8 slave processes through the tree module Circuit, broadcasting the input neuron matrix x to 8 slave processing circuits,
- the slave processing circuit executes the multiplication and accumulation operations of 8 sub-matrices and input neuron matrix x in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to the main processing circuit;
- the main processing circuit is used to sort the 8 intermediate results to obtain the operation result of wx, perform the operation of the offset b to perform the activation operation to obtain the final result y, and send the final result y to the controller unit
- the final result y is output or stored in the storage unit.
- the method for the computing device shown in FIG. 1-5A to execute the neural network forward operation instruction may specifically be:
- the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, and the controller unit transmits the operation domain to the data access unit and sends the at least one operation code To the arithmetic unit.
- the controller unit extracts the weight w and offset b corresponding to the operation domain from the storage unit (when b is 0, there is no need to extract the offset b), and transmits the weight w and offset b to the main processing of the arithmetic unit Circuit, the controller unit extracts the input data Xi from the storage unit and sends the input data Xi to the main processing circuit.
- the main processing circuit determines the multiplication operation according to the at least one operation code, determines the input data Xi as broadcast data, determines the weight data as distribution data, and splits the weight w into n data blocks;
- the instruction processing unit of the controller unit determines the multiplication instruction, the offset instruction and the accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the main processing circuit, and the main processing circuit inputs the multiplication instruction and the input data Xi sends it to multiple slave processing circuits in a broadcast manner, and distributes the n data blocks to the multiple slave processing circuits (for example, with n slave processing circuits, each slave processing circuit sends one data block); multiple The slave processing circuit is used to perform multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to the main processing circuit, which according to the accumulation instruction sends multiple slaves
- the intermediate result sent by the processing circuit performs an accumulation operation to obtain an accumulation result, executes the accumulation result to add an offset b according to the offset instruction to obtain a final result, and sends the final result to the controller unit.
- the technical solution provided by this application realizes the multiplication and offset operations of the neural network through one instruction, that is, the neural network operation instruction, and the intermediate results of the neural network calculation do not need to be stored or extracted, reducing the storage and extraction operations of the intermediate data Therefore, it has the advantages of reducing the corresponding operation steps and improving the calculation effect of the neural network.
- This application also discloses a machine learning computing device, which includes one or more computing devices mentioned in this application, for obtaining data to be calculated and control information from other processing devices, performing specified machine learning operations, and executing The result is transferred to the peripheral device through the I / O interface.
- Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
- the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support larger-scale machine learning operations.
- the interconnection method can be any interconnection topology.
- the machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.
- the distribution method includes the following steps:
- S702 Obtain demand information, hardware performance parameters of the terminal server, and hardware performance parameters of the cloud server.
- the user inputs his own demand through the terminal device, and the terminal server obtains the demand information input by the user.
- the demand information input by the user is mainly determined by three aspects, one is the function demand information, the other is the accuracy demand information, and the other is the memory demand information.
- functional requirements information such as the data set required to identify all animals and the data set required only to identify cats, there is an inclusive relationship. If the user only needs the functional requirements of a vertical field, then only the user ’s
- the demand is input through the input acquisition unit of the control part, and the corresponding data set is selected according to the size of its own memory and the required precision.
- the terminal server obtains demand information, hardware performance parameters of the terminal server and hardware performance parameters of the cloud server.
- the hardware performance parameters may include computing power, energy consumption, speed, and accuracy.
- S704 Generate a corresponding computing task according to the demand information, and select the first machine learning algorithm to run on the terminal server according to the computing task and the hardware performance parameters of the terminal server, and according to the computing task and all
- the hardware performance parameter of the cloud server is a second machine learning algorithm running on the cloud server.
- the terminal controller unit in the terminal server, the terminal controller unit generates a corresponding calculation task according to the demand information.
- the terminal evaluation circuit in the terminal controller unit evaluates the computing capacity, energy consumption, speed, and accuracy of the terminal server and the cloud server to establish a mathematical model, and then selects the most suitable machine learning algorithm for the terminal server and the cloud server. And then train or reason.
- S706 Generate a terminal server control instruction according to the first machine learning algorithm and the operation task, and generate a cloud server control instruction according to the second machine learning algorithm and the operation task.
- the terminal controller unit allocates computing tasks according to the scale of the first machine learning algorithm for the terminal server and according to the computing power of the first machine learning algorithm; and according to the cloud server The scale of the second machine learning algorithm, and according to the computing power of the second machine learning algorithm, the above-mentioned computing tasks are distributed, so that the terminal server and the cloud server respectively complete the same computing task.
- the terminal instruction generation circuit generates corresponding terminal server control instructions and cloud server control instructions based on user needs and selected data sets, and based on the computing power of different machine learning algorithms.
- the terminal communication unit and the cloud communication unit transmit control instructions between the terminal server and the cloud server. Specifically, after the control instruction is generated, the terminal communication unit and the cloud communication unit respectively transmit between the terminal server and the cloud server through a communication protocol.
- the above machine learning operation distribution method when it is necessary to complete the operation task according to the user's demand information, execute the operation task in the terminal server and the cloud server respectively, so as to achieve the same operation task by using different machine learning algorithms For different purposes, and you can get calculation results with different degrees of accuracy. Specifically, first evaluate the hardware performance parameters of the terminal server and the cloud server, and select a first machine learning algorithm running on the terminal server with low computing power and a second machine running on the cloud server with high computing power. Machine learning algorithms. Based on different machine learning algorithms, terminal server control instructions that can be controlled in the terminal server and cloud server control instructions that can be controlled in the cloud server are generated in the terminal server.
- the terminal computing results can be output first, which avoids the need for users to wait for a long time, improves processing efficiency, and makes full use of the computing resources of the terminal server and the cloud server, making the same A computing task can be performed on the terminal server and the cloud server device.
- the method further includes the following steps:
- S708 Analyze the terminal server control instruction and the cloud server control instruction separately, obtain a terminal control signal according to the terminal server control instruction, and obtain a cloud control signal according to the cloud server control instruction.
- the cloud instruction parsing circuit in the cloud controller unit analyzes the sent cloud server control instruction to obtain the cloud control signal
- the terminal instruction analysis circuit analyzes the terminal server control instruction to obtain the terminal control signal.
- S710 Extract terminal data to be processed according to the terminal control signal, and extract cloud data to be processed according to the cloud control signal.
- the data to be processed includes one or more of training data or test data.
- the cloud controller unit extracts the corresponding cloud training data or cloud test data according to the cloud control signal, and sends it to the buffer of the cloud computing unit.
- a certain memory space can be pre-allocated to realize the data of the intermediate process of computing Interaction.
- the terminal controller unit extracts the corresponding terminal training data or terminal test data according to the terminal control signal, and sends it to the buffer of the terminal computing unit.
- a certain memory space can be pre-allocated for the data of the intermediate process of computing Interaction.
- S712 Calculate the operation task of the first machine learning algorithm at each stage corresponding to the terminal server according to the terminal to-be-processed data to obtain terminal operation results, and / or calculate the cloud server according to the cloud-to-be-processed data Corresponding to the computing task of the second machine learning algorithm in each stage corresponding to in order to obtain the cloud computing result.
- the terminal controller unit sends the terminal pending data to the terminal computing unit, and the terminal computing unit calculates the computing task of the first machine learning algorithm corresponding to each stage in the terminal server according to the transmitted terminal pending data .
- the cloud controller unit sends the cloud pending data to the cloud computing unit, and the cloud computing unit calculates the computing task of the second machine learning algorithm corresponding to each stage in the cloud server according to the transmitted cloud pending data.
- the terminal communication unit sends data to the cloud communication unit according to the corresponding terminal control signal.
- the cloud communication unit is also based on The corresponding cloud control signal sends data to the terminal communication unit, and sends the terminal operation result and the cloud operation result to the user's terminal device through the terminal server.
- S704 includes:
- S7044 Select a first machine learning algorithm based on the computing task and the computing power of the terminal server, and select a second machine learning algorithm based on the computing task and the computing power of the cloud server.
- the computing power of the terminal server is weaker than that of the cloud server. Therefore, correspondingly, a first machine learning algorithm with low computing power is selected according to the computing power of the terminal server, and a second machine learning algorithm with high computing power is selected according to the computing power of the cloud server.
- the level of computing power affects the calculation time and calculation accuracy. For example, the second machine learning algorithm with higher computing power can obtain a more accurate calculation result, but the calculation time may be longer.
- the distribution method further includes:
- the user can obtain an operation result with lower accuracy. If the user wants to obtain a more accurate operation result, he can wait for the cloud server operation to be completed, and then output the cloud operation result through the terminal server. At this time, the user respectively obtains a less accurate operation result and a more accurate operation result. High calculation results. However, if the user does not want to obtain a more accurate operation result after obtaining a lower accuracy operation result, the user terminal inputs a stop operation instruction, the distribution system receives the stop operation instruction, and terminates the cloud server operation , That is, the operation result with higher accuracy is in a state of not being completed or even completed but no longer output.
- S708 specifically includes:
- the terminal server is used to analyze the terminal server control instruction to obtain a terminal control signal
- S7084 Extract corresponding terminal training data or terminal test data according to the terminal control signal.
- the terminal instruction parsing circuit is used to parse the terminal server control instruction to obtain a terminal control signal, and extract corresponding terminal training data or terminal test data according to the terminal control signal.
- the data includes images, audio, text, etc. Images include still pictures, pictures that make up videos, or videos. Audio includes vocal audio, music, noise, etc. Text includes structured text, text characters in various languages, etc.
- S708 also includes:
- the cloud server is used to parse the cloud server control instruction to obtain a cloud control signal
- the cloud instruction parsing circuit is used to parse the cloud server control instruction to obtain a cloud control signal, and extract corresponding cloud training data or cloud test data according to the cloud control signal.
- S712 specifically includes:
- S712 specifically includes:
- S7124 Use the cloud server and calculate the computing task of the second machine learning algorithm at each stage corresponding to the cloud server according to the cloud training data or cloud test data to obtain cloud computing results.
- the cloud computing unit executes the operation of the corresponding second machine learning algorithm at each stage according to the cloud training data or the cloud test data to obtain the cloud computing result.
- the terminal operation unit executes the operation of the corresponding first machine learning algorithm at each stage according to the terminal training data or the terminal test data to obtain the terminal operation result.
- the data communication between the terminal server and the cloud server is completed through the cloud communication unit and the terminal communication unit.
- the data communication between the computing part and the storage part between the cloud server and the terminal server is forwarded through the cloud controller unit and the terminal communication unit respectively, and finally the cloud communication unit and the terminal communication unit interact together.
- the neural network with low computing power is used in the terminal server to calculate the above computing tasks, a lower accuracy computing result can be obtained first, and then, based on the user's further demand information, it can be further obtained in the cloud server.
- the neural network with computing power obtains a highly accurate computing result.
- FIGS. 1-6 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 1-6 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. The execution order of is not necessarily sequential, but may be executed in turn or alternately with at least a part of other steps or sub-steps or stages of other steps.
- neural networks In the field of data processing, neural networks (neural networks) have been very successful applications, but large-scale neural network operations require a large amount of computing time and energy consumption, which poses serious challenges to the processing platform. Therefore, reducing the computation time and energy consumption of neural networks has become an urgent problem to be solved.
- FIG. 2-1A is a schematic structural diagram of a computing device according to an embodiment of the present invention.
- the computing device 100 includes:
- the storage unit 1019 is configured to store weights and input neurons, and the weights include important bits and non-important bits;
- the controller unit 1029 is configured to obtain the important bits and non-important bits of the weight and the input neuron, and convert the important bits and non-important bits of the weight and the input Neurons are transmitted to the arithmetic unit 1039;
- the operation unit 1039 is configured to perform operation on the input neuron and the important bit to obtain a first operation result of the output neuron;
- the input neuron and the non-significant bit are operated to obtain a second operation result, and the first operation result and the second operation The sum of the results is used as the output neuron.
- the data stored in the storage unit 1019 input neurons or weights, which include floating-point data and fixed-point data, the sign bit and exponent part of the floating-point data are designated as important bits, and the base part Specified as non-important bits, the sign bit in fixed-point data and the first x bits of the numerical part are designated as important bits, and the remaining bits of the numerical part are designated as non-important bits, where x is greater than or equal to 0 and Positive integer less than m, m is the total bit of fixed-point data.
- ECC Error Correcting Code: ECC
- the above-mentioned preset threshold can be set by the user or the system default.
- the preset threshold can be 0, or can also be other integers, or decimals.
- the input neuron is represented by N in
- the input neuron includes n bits, where n bits include n1 important bits and n2 non-important bits, if n1
- the value corresponding to the important bits is represented by N1 in
- the value corresponding to the n2 non-significant bits is represented by N2 in
- n1 + n2 n
- N in N1 in + N2 in
- n is a positive integer
- n1 is Natural number and less than n.
- the positions of the n1 important bits are continuous or discontinuous.
- the positions of the n1 important bits are continuous or discontinuous.
- the operation unit 1039 when there are multiple input neurons, the operation unit 1039 includes multiple multipliers and at least one adder;
- the plurality of multipliers and the at least one adder are used to calculate the output neuron according to the following formula:
- the operation unit 1039 includes a plurality of multipliers and at least one adder, and the operation unit completes the above operations through a plurality of multipliers and at least one adder.
- T is the number of input neurons
- N out is the output neuron
- N1 in (i) is the important bit of the i-th input neuron
- N2 in (i) is the non-important bit of the i-th input neuron
- W1 (i) is the important bit of the i-th weight
- W2 (i) is the non-important bit of the i-th weight
- N in (i) represents the value of the i-th input neuron
- W (i) Represents the value of the ith weight
- connection layer convolution layer or lstm of the neural network model
- the operation unit 1039 further includes a comparator, and the operation unit 1039 is specifically configured to: when the comparison result of the comparator is that the first operation result is less than or equal to a preset threshold, then Skip the operation of the output neuron; if the first operation result is greater than the preset threshold, the input neuron and the non-significant bit are operated to obtain a second operation result, and the In terms of the sum of the first operation result and the second operation result as an output neuron, the operation unit is specifically used to:
- the above arithmetic unit 1039 further includes a comparator, which is mainly used for comparison operations. If the above first operation result is less than or equal to the preset threshold, the operation of the current input neuron is skipped, and the inner product operation of the next input neuron is executed.
- the final output neuron N out is as follows:
- the important bits and non-important bits of the weight are obtained, as well as the input neurons, and the input neurons and important bits are operated to obtain the first operation of the output neuron
- the first operation result is less than or equal to the preset threshold, the operation of the current output neuron is skipped, and if the first operation result is greater than the preset threshold, the operation between the input neuron and the non-important bits is obtained
- the second operation result the sum of the first operation result and the second operation result is used as the output neuron.
- the prediction result of an output neuron does not require an operation, the operation process of the output neuron is skipped.
- the new computing device integrates computing methods to predict and skip output neurons that do not need to be computed. Thereby reducing the calculation time and calculation energy consumption of the neural network.
- the arithmetic unit includes: a master processing circuit and a plurality of slave processing circuits;
- the master processing circuit is used to split the input neuron into multiple data blocks, broadcast the important bits of the weight to the multiple slave processing circuits, and distribute the multiple data blocks to all Describe multiple slave processing circuits;
- the slave processing circuit is used to calculate the received data block and the important bits of the weight to obtain a partial result, and send the partial result to the master processing circuit;
- the main processing circuit is also specifically used for splicing all the received partial results to obtain the first operation result.
- the arithmetic unit further includes one or more branch processing circuits, each branch processing circuit is connected to at least one slave processing circuit,
- the branch processing circuit is configured to forward data blocks, broadcast data, and operation instructions between the main processing circuit and the plurality of slave processing circuits.
- the plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and the master processing circuit is connected to k of the plurality of slave processing circuits
- the slave processing circuit, the k slave processing circuits are: p slave processing circuits in the first row, p slave processing circuits in the q row, and q slave processing circuits in the first column;
- the K slave processing circuits are used for forwarding data and instructions between the master processing circuit and a plurality of slave processing circuits
- the main processing circuit is used to determine that the input neuron is distribution data, the important bit of the weight is broadcast data, distribute one distribution data into multiple data blocks, and divide at least one of the multiple data blocks At least one operation instruction among the data block and the plurality of operation instructions is sent to the K slave processing circuits;
- the K slave processing circuits are used to convert data between the master processing circuit and the plurality of slave processing circuits.
- the main processing circuit includes one or any combination of an activation processing circuit and an addition processing circuit.
- the slave processing circuit includes: a multiplication processing circuit
- the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result.
- the slave processing circuit further includes an accumulation processing circuit configured to perform an accumulation operation on the product result.
- FIG. 2-1B is a schematic structural diagram of a layered storage device provided by an embodiment of the present application.
- the device includes: an accurate storage unit and an inaccurate storage unit, and an accurate storage unit It is used to store important bits in data, and the non-precision storage unit is used to store non-important bits in data.
- Accurate storage units use error checking and correction ECC memory, and non-accurate storage units use non-ECC memory.
- the data stored in the hierarchical storage device are neural network parameters, including input neurons, weights and output neurons, accurate storage units store input neurons, output neurons and important bits of weights, and non-precision storage units Store non-important bits of input neurons, output neurons, and weights.
- the data stored in the hierarchical storage device includes floating-point data and fixed-point data, the sign bit and exponent part of the floating-point data are designated as important bits, the base part is designated as non-important bits, and the fixed-point
- the sign bit in the type data and the first x bits of the value part are designated as important bits, and the remaining bits in the value part are designated as non-significant bits, where x is a positive integer greater than or equal to 0 and less than m, and m is a fixed point
- the total bits of type data Store important bits in ECC memory for accurate storage, and store non-important bits in non-ECC memory for non-precision storage.
- the ECC memory includes DRAM (Dynamic Random Access Memory, DRAM) dynamic random access memory with ECC check and SRAM (Static Random-Access Memory, SRAM) static random access memory with ECC check; wherein, SRAM with ECC check can use 3T SRAM.
- DRAM Dynamic Random Access Memory
- SRAM Static Random-Access Memory
- non-ECC memory includes non-ECC check DRAM and non-ECC check SRAM, and the non-ECC check SRAM may use 3TSRAM.
- the unit of each bit stored in 3T SRAM consists of 3 MOS tubes.
- FIG. 2-1C is a schematic structural diagram of a 3T SRAM memory cell provided by an embodiment of the present application.
- the 3T SRAM memory cell is composed of 3 MOS, which are M1 (Part 1 A MOS tube), M2 (second MOS tube) and M3 (third MOS tube). M1 is used for gating, and M2 and M3 are used for storage.
- the gate of M1 is electrically connected to the word line WL (Word Line), the source is electrically connected to the bit line BL (Bit Line); the gate of M2 is connected to the source of M3, and is connected to the operating voltage Vdd through the resistor R2, and the drain of M2 is grounded ; The gate of M3 is connected to the source of M2 and the drain of M1, and to the working voltage Vdd through resistor R1, and the drain of M3 is grounded.
- WL is used to control the gated access of the memory cell, and BL is used to read and write the memory cell. When performing a read operation, pull WL high to read the bit from BL. When a write operation is performed, WL is pulled high, BL is pulled high or low. Since the driving capability of BL is stronger than that of the memory cell, it will force the original state to be overwritten.
- the storage device of the present application uses approximate storage technology, which can fully exploit the fault tolerance of the neural network and store the neural parameters approximately.
- the important bits in the parameters are stored accurately, and the unimportant bits are stored inaccurately, thereby reducing storage overhead. And memory access energy consumption.
- FIG. 2-1D is a schematic structural diagram of a data processing device according to an embodiment of the present application
- the data processing device includes: an inaccurate arithmetic unit, an instruction control unit and the above-mentioned hierarchical storage device.
- the hierarchical storage device receives instructions and operation parameters, and stores important bits and instructions in the operation parameters in an accurate storage unit, and stores non-important bits in the operation parameters in an inaccurate storage unit.
- the instruction control unit receives the instructions in the hierarchical storage device, and decodes the instructions to generate control information to control the inexact computing unit to perform calculation operations.
- the non-precision calculation unit receives the calculation parameters in the layered storage device, performs calculation according to the control information, and transmits the calculation result to the layered storage device for storage or output.
- the non-precision computing unit is a neural network processor.
- the above operation parameters are neural network parameters
- the hierarchical storage device is used to store the neurons, weights and instructions of the neural network, and store the important bits of the neurons, the important bits of the weights and the instructions in the precise storage unit
- the non-important bits of the neuron and the non-important bits of the weight are stored in the non-precision storage unit.
- the non-precision computing unit receives the input neurons and weights in the layered storage device, completes the neural network operation according to the control information to obtain output neurons, and retransmits the output neurons to the layered storage device for storage or output.
- the non-precision arithmetic unit can have two calculation modes: (1) The non-accurate arithmetic unit directly receives the important bits of the input neuron and the important bits of the weight from the precision storage unit of the layered storage device for calculation ; (2) The non-precision computing unit receives the input neurons and weights of important bits and non-important bits to complete the calculation, in which the important bits and non-important bits of the input neurons and weights are in the storage unit Splice when reading.
- the data processing device further includes a preprocessing module for preprocessing the input original data and transmitting it to the storage device.
- the preprocessing includes segmentation and Gaussian filtering , Binarization, regularization, normalization, etc.
- the data processing device further includes an instruction cache, an input neuron hierarchical cache, a weighted hierarchical cache, and an output neuron hierarchical cache, where the instruction cache is provided between the hierarchical storage device and the instruction control unit for Store special instructions; the input neuron hierarchical cache is set between the storage device and the non-precision arithmetic unit, and is used to cache the input neuron.
- the input neuron hierarchical cache includes the input neuron precise cache and the input neuron inexact cache, respectively Cache important bits and non-important bits of input neurons; weighted layered cache is set between the storage device and the non-precision computing unit, used to cache weighted data, weighted layered cache includes weighted precise cache and weight Value inaccurate caching, which caches the important bits and non-important bits of the weight separately; the output neuron hierarchical cache is set between the storage device and the inaccurate arithmetic unit, and is used to cache the output neurons.
- Layer caching includes output neuron precise caching and output neuron inexact caching. Bits and unimportant bits.
- the data processing device further includes a direct data access unit DMA (direct memory access), which is used to perform storage device, instruction cache, weight layered cache, input neuron layered cache and output neuron layered cache Reading or writing data or instructions.
- DMA direct data access unit
- the inexact operation unit includes but is not limited to three parts, the first part is a multiplier, the second part is an addition tree, and the third part is an activation function unit.
- the data (in1) is accumulated by the addition tree and added to the input data (in2) to obtain the output data (out).
- the non-precision computing unit may also include a pooling unit, which pools the input data (in) to obtain the output data (out) through the pooling operation.
- the process is out pool (in), where pool is the pooling operation, and the pooling operation Including but not limited to: average pooling, maximum pooling, median pooling, input data in is data in a pooling core related to output out.
- the operation performed by the non-precision arithmetic unit includes several parts.
- the first part is to multiply the input data 1 and the input data 2 to obtain the multiplied data;
- the second part performs the addition tree operation, which is used to divide the input data 1 through the addition tree. Add the input data 1 or add the input data 1 step by step through the addition tree and add the input data 2 to obtain the output data;
- the third part performs the activation function operation, and the output data is obtained by the active function operation (active) operation .
- the operations of the above parts can be freely combined to achieve various functions.
- the data processing device of the present application can make full use of the approximate storage technology, and fully exploit the fault tolerance of the neural network to reduce the calculation amount of the neural network and the memory access amount of the neural network, thereby reducing the calculation energy consumption and memory access energy consumption.
- Through the use of special SIMD instructions and customized computing units for multi-layer artificial neural network operations it solves the problems of insufficient computing performance of CPU and GPU and large front-end decoding overhead, and effectively improves the support of multi-layer artificial neural network computing algorithms;
- Through the use of dedicated on-chip cache for multi-layer artificial neural network arithmetic algorithms the importance of input neurons and weight data is fully tapped, avoiding repeated reading of these data into memory, reducing memory access bandwidth and avoiding Memory bandwidth has become a bottleneck in the performance of multi-layer artificial neural network operations and training algorithms.
- the data processing device may include a non-neural network processor, for example, a general-purpose arithmetic processor, and the general-purpose arithmetic has corresponding general-purpose arithmetic instructions and data, for example, scalar arithmetic operations , Scalar logic operations, etc.
- a general-purpose operation processor includes, for example but not limited to, one or more multipliers, one or more adders, and performs basic operations such as addition and multiplication.
- the computing device 100 is presented in the form of a module.
- Module here may refer to an application-specific integrated circuit (ASIC), a processor and memory that execute one or more software or firmware programs, an integrated logic circuit, and / or other devices that can provide the above functions .
- ASIC application-specific integrated circuit
- controller unit 1029, and arithmetic unit 1039 may be implemented by the devices shown in FIGS. 2-2 to 2-13.
- a computing device for performing machine learning calculations.
- the computing device includes: a controller unit 11 and an arithmetic unit 12, wherein the controller unit 11 is connected to the arithmetic unit 12, the
- the arithmetic unit 12 includes: a master processing circuit and a plurality of slave processing circuits;
- the controller unit 11 is used to obtain input data and calculation instructions; in an optional solution, specifically, the method of obtaining input data and calculation instructions may be obtained through a data input and output unit, and the data input and output unit may specifically be one or Multiple data I / O interfaces or I / O pins.
- the above calculation instructions include but are not limited to: forward operation instructions or reverse training instructions, or other neural network operation instructions, etc., such as convolution operation instructions.
- the specific implementation of the present application does not limit the specific expression form of the above calculation instructions.
- the controller unit 11 is further configured to parse the calculation instruction to obtain a plurality of calculation instructions, and send the plurality of calculation instructions and the input data to the main processing circuit;
- the main processing circuit 101 is configured to perform pre-processing on the input data and transfer data and operation instructions with the multiple slave processing circuits;
- a plurality of slave processing circuits 102 configured to execute intermediate operations in parallel based on data transmitted from the master processing circuit and operation instructions to obtain a plurality of intermediate results, and transmit the plurality of intermediate results to the master processing circuit;
- the main processing circuit 101 is configured to perform subsequent processing on the plurality of intermediate results to obtain the calculation result of the calculation instruction.
- the technical solution provided in this application sets the computing unit to a master multi-slave structure.
- it can split the data according to the calculation instructions of the forward operation, so that multiple slave processing circuits can The part with a large amount of calculation performs parallel operation, thereby increasing the operation speed, saving operation time, and further reducing power consumption.
- the above machine learning calculation may specifically include: artificial neural network operation
- the above input data may specifically include: input neuron data and weight data.
- the above calculation result may specifically be: the result of the operation of the artificial neural network outputs the neuron data.
- the operation in the neural network it can be a layer of operation in the neural network.
- the implementation process is that in the forward operation, when the previous layer of artificial neural network is completed, the operation of the next layer The instruction will use the output neuron calculated in the arithmetic unit as the input neuron of the next layer (or perform some operations on the output neuron and then use it as the input neuron of the next layer), and at the same time, the weight It is also replaced with the weight of the next layer; in the reverse operation, when the reverse operation of the artificial neural network of the previous layer is completed, the next layer of computing instructions will use the input neuron gradient calculated in the computing unit as the next The output neuron gradient of the first layer is operated (or some operation is performed on the input neuron gradient and then used as the output neuron gradient of the next layer), and the weights are replaced by the weights of the next layer.
- the above machine learning calculation may also include support vector machine operation, k-nearest neighbor (k-nn) operation, k-mean (k-means) operation, principal component analysis operation and so on.
- k-nn k-nearest neighbor
- k-means k-mean
- principal component analysis operation k-means
- the input neurons and output neurons of the multi-layer operations do not refer to the neurons in the input layer and the output layers of the entire neural network, but for In any two adjacent layers in the network, the neurons in the lower layer of the network forward operation are input neurons, and the neurons in the upper layer of the network forward operation are output neurons.
- K 1, 2, ..., L-1.
- K + 1th layer we divide the Kth layer This is called the input layer, where the neurons are the input neurons, and the K + 1th layer is called the output layer, and the neurons are the output neurons.
- each layer can be used as an input layer, and the next layer is the corresponding output layer.
- the above computing device may further include the storage unit 10 and the direct memory access unit 50.
- the storage unit 10 may include one or any combination of registers and caches. Specifically, the cache is used to store the Calculation instruction; the register, used to store the input data and scalar; the cache is a high-speed temporary storage cache.
- the direct memory access unit 50 is used to read or store data from the storage unit 10.
- the controller unit includes: an instruction storage unit 110, an instruction processing unit 111, and a storage queue unit 113;
- the instruction storage unit 110 is used to store calculation instructions associated with the artificial neural network operation
- the instruction processing unit 111 is configured to parse the calculation instruction to obtain multiple operation instructions
- the storage queue unit 113 is configured to store an instruction queue, and the instruction queue includes a plurality of operation instructions or calculation instructions to be executed in the order of the queue.
- the main operation processing circuit may also include a controller unit, and the controller unit may include a main instruction processing unit, which is specifically used to decode instructions into microinstructions.
- the slave operation processing circuit may also include another controller unit, and the other controller unit includes a slave instruction processing unit, which is specifically used to receive and process microinstructions.
- the above microinstruction may be the next level instruction of the instruction.
- the microinstruction can be obtained by splitting or decoding the instruction, and can be further decoded into control signals of each component, each unit, or each processing circuit.
- the structure of the calculation instruction may be as shown in the following table.
- the calculation instruction may include: one or more operation fields and an operation code.
- the calculation instruction may include a neural network operation instruction. Taking neural network operation instructions as an example, as shown in Table 1, among them, register number 0, register number 1, register number 2, register number 3, and register number 4 can be operation domains. Wherein, each register number 0, register number 1, register number 2, register number 3, register number 4 may be the number of one or more registers.
- the above register may be an off-chip memory. Of course, in practical applications, it may also be an on-chip memory for storing data.
- controller unit may further include:
- the dependency processing unit 108 is configured to determine whether there is an association between the first operation instruction and the zeroth operation instruction before the first operation instruction when there are multiple operation instructions, such as the first operation instruction and all If the zeroth operation instruction is associated, the first operation instruction is cached in the instruction storage unit, and after the execution of the zeroth operation instruction is completed, the first operation instruction is extracted from the instruction storage unit Transmitted to the arithmetic unit;
- the determining whether there is an association relationship between the first operation instruction and the zeroth operation instruction before the first operation instruction includes:
- Extracting the first storage address interval of the data (such as a matrix) required in the first arithmetic instruction according to the first arithmetic instruction, and extracting the zeroth of the required matrix in the zeroth arithmetic instruction according to the zeroth arithmetic instruction A storage address interval, if the first storage address interval and the zeroth storage address interval have overlapping areas, it is determined that the first arithmetic instruction and the zeroth arithmetic instruction have an association relationship, such as the first storage If the address interval does not overlap with the zeroth storage address interval, it is determined that the first operation instruction and the zeroth operation instruction do not have an association relationship.
- the arithmetic unit 12 may include a master processing circuit 101 and multiple slave processing circuits 102.
- a plurality of slave processing circuits are distributed in an array; each slave processing circuit is connected to an adjacent other slave processing circuit, and a master processing circuit is connected to the plurality of slave processing circuits K slave processing circuits, the k slave processing circuits are: p slave processing circuits in the first row, p slave processing circuits in the q row, and q slave processing circuits in the first column, it should be noted that , The K slave processing circuits shown in FIGS.
- the slave processing circuit is a slave processing circuit directly connected to the master processing circuit among the plurality of slave processing circuits.
- K slave processing circuits are used to transfer data and instructions between the master processing circuit and the plurality of slave processing circuits.
- the main processing circuit may further include one or any combination of a conversion processing circuit 114, an activation processing circuit 115, and an addition processing circuit 116;
- the conversion processing circuit 114 is used to perform the interchange between the first data structure and the second data structure (such as the conversion of continuous data and discrete data) of the data block or intermediate result received by the main processing circuit; or to receive the main processing circuit
- the data block or the intermediate result performs the exchange between the first data type and the second data type (for example, conversion of fixed-point type and floating-point type);
- the activation processing circuit 115 is used to execute the activation operation of the data in the main processing circuit
- the addition processing circuit 116 is used to perform an addition operation or an accumulation operation.
- the main processing circuit is used to determine that the input neuron is broadcast data, the weight value is distribution data, the distribution data is distributed into multiple data blocks, and at least one of the multiple data blocks and multiple At least one operation instruction among the operation instructions is sent to the slave processing circuit;
- the plurality of slave processing circuits are configured to perform an operation on the received data block according to the operation instruction to obtain an intermediate result, and transmit the operation result to the master processing circuit;
- the main processing circuit is configured to process a plurality of intermediate results sent from the processing circuit to obtain the result of the calculation instruction, and send the result of the calculation instruction to the controller unit.
- the slave processing circuit includes: a multiplication processing circuit
- the multiplication processing circuit is configured to perform a product operation on the received data block to obtain a product result
- the forwarding processing circuit (optional) is used to forward the received data block or product result.
- An accumulation processing circuit is configured to perform an accumulation operation on the product result to obtain the intermediate result.
- the operation instruction is a matrix multiplying matrix instruction, an accumulation instruction, an activation instruction, and other calculation instructions.
- the actual formula that needs to be executed can be: Among them, the weight w is multiplied by the input data x i , and the sum is added, and then the offset b is added to do the activation operation s (h) to obtain the final output result s.
- the operation unit includes: a tree module 40, and the tree module includes: a root port 401 and a plurality of branch ports 404, the tree The root port of the module is connected to the master processing circuit, and the multiple branch ports of the tree module are respectively connected to one of the slave processing circuits in the slave processing circuits;
- the tree module has a sending and receiving function.
- the tree module is a sending function
- the tree module is a receiving function.
- the tree module is used to forward data blocks, weights, and operation instructions between the master processing circuit and the multiple slave processing circuits.
- the tree module is a selectable result of the computing device, and it may include at least one layer of nodes.
- the node has a line structure with a forwarding function, and the node itself may not have a computing function. If the tree module has zero-level nodes, the tree module is not required.
- the tree module may be a p-tree structure, for example, a binary tree structure as shown in FIGS. 2-7, and of course, a tri-tree structure, and p may be an integer greater than or equal to 2.
- the specific implementation of the present application does not limit the specific value of the above-mentioned p.
- the above-mentioned number of layers may also be 2.
- the slave processing circuit may be connected to nodes of layers other than the penultimate layer node. The nodes in the penultimate layer shown.
- the operation unit may carry a separate buffer, as shown in FIGS. 2-8, and may include: a neuron buffer unit, and the neuron buffer unit 63 buffers input neuron vector data and output neurons of the slave processing circuit Value data.
- the operation unit may further include a weight buffer unit 64 for buffering weight data required by the slave processing circuit in the calculation process.
- the arithmetic unit 12 may include a branch processing circuit 103; its specific connection structure is shown in FIG. 2-3, where,
- the main processing circuit 101 is connected to the branch processing circuit 103 (one or more), and the branch processing circuit 103 is connected to one or more slave processing circuits 102;
- the branch processing circuit 103 is used to perform forwarding of data or instructions between the main processing circuit 101 and the slave processing circuit 102.
- the controller unit obtains the input neuron matrix x, the weight matrix w and the fully connected operation instruction from the storage unit, and transmits the input neuron matrix x, the weight matrix w and the fully connected operation instruction to the main processing circuit;
- the main processing circuit determines the input neuron matrix x as broadcast data, determines the weight matrix w as distribution data, splits the weight matrix w into 8 sub-matrices, and then distributes the 8 sub-matrices to 8 slave processes through the tree module Circuit, broadcasting the input neuron matrix x to 8 slave processing circuits,
- the slave processing circuit executes the multiplication and accumulation operations of 8 sub-matrices and input neuron matrix x in parallel to obtain 8 intermediate results, and sends the 8 intermediate results to the main processing circuit;
- the main processing circuit is used to sort the 8 intermediate results to obtain the operation result of wx, perform the operation of the offset b to perform the activation operation to obtain the final result y, and send the final result y to the controller unit
- the final result y is output or stored in the storage unit.
- the method for the computing device shown in Figure 2-2 to execute the neural network forward operation instruction may specifically be:
- the controller unit extracts the neural network forward operation instruction, the operation domain corresponding to the neural network operation instruction and at least one operation code from the instruction storage unit, and the controller unit transmits the operation domain to the data access unit and sends the at least one operation code To the arithmetic unit.
- the controller unit extracts the weight w and offset b corresponding to the operation domain from the storage unit (when b is 0, there is no need to extract the offset b), and transmits the weight w and offset b to the main processing of the arithmetic unit Circuit, the controller unit extracts the input data Xi from the storage unit and sends the input data Xi to the main processing circuit.
- the main processing circuit determines the multiplication operation according to the at least one operation code, determines the input data Xi as broadcast data, determines the weight data as distribution data, and splits the weight w into p data blocks;
- the instruction processing unit of the controller unit determines the multiplication instruction, the offset instruction and the accumulation instruction according to the at least one operation code, and sends the multiplication instruction, the offset instruction and the accumulation instruction to the main processing circuit, and the main processing circuit inputs the multiplication instruction and the input data Xi sends it to multiple slave processing circuits in a broadcast manner, and distributes the p data blocks to the multiple slave processing circuits (for example, with p slave processing circuits, each slave processing circuit sends one data block); multiple The slave processing circuit is used to perform multiplication operation on the input data Xi and the received data block according to the multiplication instruction to obtain an intermediate result, and send the intermediate result to the main processing circuit, which according to the accumulation instruction sends multiple slaves
- the intermediate result sent by the processing circuit performs an accumulation operation to obtain an accumulation result, executes the accumulation result to add an offset b according to the offset instruction to obtain a final result, and sends the final result to the controller unit.
- the technical solution provided by this application realizes the multiplication and offset operations of the neural network through one instruction, that is, the neural network operation instruction, and the intermediate results of the neural network calculation do not need to be stored or extracted, reducing the storage and extraction operations of the intermediate data Therefore, it has the advantages of reducing the corresponding operation steps and improving the calculation effect of the neural network.
- This application also discloses a machine learning computing device, which includes one or more computing devices mentioned in this application, for obtaining data to be calculated and control information from other processing devices, performing specified machine learning operations, and executing The result is transferred to the peripheral device through the I / O interface.
- Peripheral equipment such as camera, monitor, mouse, keyboard, network card, wifi interface, server.
- the computing devices can link and transmit data through a specific structure, for example, interconnect and transmit data through the PCIE bus to support larger-scale machine learning operations.
- the interconnection method can be any interconnection topology.
- the machine learning computing device has high compatibility, and can be connected with various types of servers through the PCIE interface.
- the present application also discloses a combined processing device, which includes the above-mentioned machine learning computing device, a universal interconnection interface, and other processing devices.
- the machine learning computing device interacts with other processing devices to complete the operation specified by the user.
- 2-10 are schematic diagrams of combined processing devices.
- processing devices include one or more types of general-purpose / special-purpose processors such as central processing unit CPU, graphics processor GPU, neural network processor.
- the number of processors included in other processing devices is not limited.
- Other processing devices serve as an interface between the machine learning computing device and external data and control, including data handling, to complete the basic control of starting and stopping the machine learning computing device; other processing devices can also cooperate with the machine learning computing device to complete the computing task.
- a universal interconnection interface is used to transfer data and control instructions between the machine learning computing device and other processing devices.
- the machine learning computing device obtains the required input data from other processing devices and writes them into the on-chip storage device of the machine learning computing device; it can obtain control instructions from other processing devices and write them into the control cache of the machine learning computing device; also The data in the storage module of the machine learning computing device can be read and transmitted to other processing devices.
- the structure may further include a storage device, which is respectively connected to the machine learning operation device and the other processing device.
- the storage device is used to store data stored in the machine learning computing device and the other processing device, and is particularly suitable for data that cannot be saved in the internal storage of the machine learning computing device or other processing device.
- the combined processing device can be used as an SOC on-chip system for mobile phones, robots, drones, video surveillance equipment, etc., effectively reducing the core area of the control part, increasing processing speed, and reducing overall power consumption.
- the general interconnection interface of the combined processing device is connected to some components of the device. Some components such as camera, monitor, mouse, keyboard, network card, wifi interface.
- a chip is also applied, which includes the aforementioned machine learning arithmetic device or combined processing device.
- a chip packaging structure is applied, which includes the above chip.
- a board card is applied, which includes the above chip packaging structure.
- FIG. 2-13 provides a board card.
- the board card may also include other supporting components.
- the supporting components include but are not limited to: a storage device 390 and an interface device 391. And control device 392;
- the storage device 390 is connected to the chip in the chip packaging structure through a bus, and is used to store data.
- the storage device may include multiple sets of storage units 393. Each group of the storage unit and the chip are connected by a bus. It can be understood that each group of memory cells may be DDR SDRAM (English: Double Data Rate SDRAM, double rate synchronous dynamic random access memory).
- DDR can double the speed of SDRAM without increasing the clock frequency. DDR allows data to be read on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM.
- the storage device may include 4 sets of the storage unit. Each group of the memory cells may include multiple DDR4 particles (chips). In one embodiment, the chip may include four 72-bit DDR4 controllers. Among the 72-bit DDR4 controllers, 64 bits are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are used in each group of the memory cells, the theoretical bandwidth of data transmission can reach 25600MB / s.
- each group of the storage units includes multiple double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transfer data twice in one clock cycle.
- a controller for controlling DDR is provided in the chip for controlling data transmission and data storage of each storage unit.
- the interface device is electrically connected to the chip in the chip packaging structure.
- the interface device is used to realize data transmission between the chip and an external device (such as a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to implement data transfer.
- the interface device may also be other interfaces.
- the present application does not limit the specific expressions of the other interfaces described above, and the interface unit may implement the transfer function.
- the calculation result of the chip is still transmitted back to an external device (such as a server) by the interface device.
- the control device is electrically connected to the chip.
- the control device is used to monitor the state of the chip.
- the chip and the control device may be electrically connected through an SPI interface.
- the control device may include a microcontroller (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- the chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and may drive multiple loads. Therefore, the chip may be in different working states such as multiple loads and light loads.
- the control device can realize the regulation of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the chip.
- an electronic device is applied, which includes the above-mentioned board.
- Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headphones , Mobile storage, wearable devices, vehicles, household appliances, and / or medical devices.
- the vehicles include airplanes, ships, and / or vehicles;
- the household appliances include televisions, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods; and
- the medical equipment includes nuclear magnetic resonance instruments, B-ultrasound and / or electrocardiograph.
- the master + interconnection module + slave architecture it can also be accumulated in the interconnection (for example, K-tree (as shown in Figure 2-7)) module.
- the multiplier in the slave operation module may be a parallel multiplier or a serial multiplier. Because this patent is divided into important bits and non-important bits, the bit width of important bits is floating. For example, the total number of bits is 16 bits, and the important bits can be 3, 5, 8 bits. Therefore, to use a parallel multiplier to perform calculations, 16 * 16 must be done, which is very wasteful. Conversely, if serial is used, 3, 5, 8 multiplication can be achieved with only a part of multipliers, and the power consumption is more ideal.
- FIG. 2-14 is a schematic flowchart of a calculation method provided by an embodiment of the present invention. As shown in Figure 2-14, the method includes:
- the positions of the n1 important bits are continuous or discontinuous.
- the weight is represented by W
- the weight includes w bits, where w1 bits are important bits and w2 bits are non-significant bits
- W1 + w2 w
- W W1 + W2
- w is a positive integer
- w1 is a natural number and less than w.
- the positions of the n1 important bits are continuous or discontinuous.
- N out is the output neuron
- N1 in (i) is the important bit of the i-th input neuron
- N2 in (i) is the non-important bit of the i-th input neuron Bits
- W1 (i) is the important bit of the i-th weight
- W2 (i) is the non-important bit of the i-th weight
- N in (i) represents the value of the i-th input neuron
- W ( i) represents the value of the ith weight
- N in (i) N1 in (i) + N2 in (i)
- W (i) W1 (i) + W2 (i);
- the operation of the output neuron is skipped; if the first operation result is greater than the preset threshold , Then the input neuron and the non-important bit are operated to obtain a second operation result, and the sum of the first operation result and the second operation result is used as the output neuron, which may include the following steps :
- An embodiment of the present invention also provides a computer storage medium, wherein the computer storage medium stores a computer program for electronic data exchange, which causes the computer to perform some or all of the steps of any method described in the above method embodiments ,
- the above computer includes electronic devices.
- An embodiment of the present invention also provides a computer program product, where the computer program product includes a non-transitory computer-readable storage medium that stores the computer program, and the computer program enables the computer to execute any of the methods described in the above method embodiments. Part or all steps.
- the computer program product may be a software installation package, and the computer includes an electronic device.
- the disclosed device may be implemented in other ways.
- the device embodiments described above are only schematic.
- the division of the unit is only a logical function division.
- there may be another division manner for example, multiple units or components may be combined or may Integration into another system, or some features can be ignored, or not implemented.
- the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
- the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
- each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
- the above integrated unit can be implemented in the form of hardware.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Software Systems (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Data Mining & Analysis (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Biophysics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Medical Informatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Neurology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Feedback Control In General (AREA)
Abstract
Description
操作码 | 寄存器或立即数 | 寄存器/立即数 | ... |
操作码 | 寄存器或立即数 | 寄存器/立即数 | … |
Claims (29)
- 一种机器学习运算的分配系统,其特征在于,包括:终端服务器和云端服务器;所述终端服务器用于根据需求信息生成对应的运算任务,并根据所述运算任务和终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
- 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述终端服务器还用于对所述终端服务器控制指令进行解析得到终端控制信号,并根据所述终端控制信号计算对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果,以及将所述云端服务器控制指令发送至所述云端服务器。
- 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述云端服务器用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号,并根据所述云端控制信号计算对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
- 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述硬件性能参数包括运算能力,所述终端服务器根据所述运算任务和终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法,包括:获取所述终端服务器的运算能力和所述云端服务器的运算能力;根据所述运算任务和所述终端服务器的运算能力选取第一机器学习算法,以及根据所述运算任务和所述云端服务器的运算能力选取第二机器学习算法。
- 根据权利要求1所述的机器学习运算的分配系统,其特征在于,所述第一机器学习算法包括第一神经网络模型,所述第二机器学习算法包括第二神经网络模型。
- 根据权利要求1-5任一所述的机器学习运算的分配系统,其特征在于,所述终端服务器还用于将所述终端运算结果输出后,在接收到停止运算指令时,发送所述停止运算指令至所述云端服务器,以终止所述云端服务器的运算工作。
- 根据权利要求1-5任一所述的机器学习运算的分配系统,其特征在于,所述终端服务器包括终端控制器单元、终端运算单元和终端通信单元;所述终端控制器单元分别与所述终端运算单元和所述终端通信单元连接;其中,所述终端控制器单元用于获取需求信息、所述终端服务器的硬件性能参数和所述云端服务器的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令,并对所述终端服务器控制指令进行解析得到终端控制信号;所述终端运算单元用于根据所述终端控制信号计算对应的第一机器学习算法的运算任务 以得到终端运算结果;所述终端通信单元用于将所述云端服务器控制指令发送至所述云端服务器。
- 根据权利要求7所述的机器学习运算的分配系统,其特征在于,所述云端服务器包括云端控制器单元、云端运算单元和云端通信单元;所述云端控制器单元分别与所述云端运算单元和所述云端通信单元连接,所述云端通信单元与所述终端通信单元通信连接,用于在所述云端服务器与所述终端服务器之间进行数据交互;其中,所述云端通信单元用于接收所述云端服务器控制指令,并将所述云端服务器控制指令发送至所述云端控制器单元,以及获取云端运算结果并发送至所述终端服务器;所述云端控制器单元用于接收所述云端服务器控制指令,对所述云端服务器控制指令进行解析得到云端控制信号;所述云端运算单元用于根据所述云端控制信号计算对应的第二机器学习算法的运算任务以得到云端运算结果,并将所述云端运算结果通过所述云端通信单元发送至所述终端服务器。
- 根据权利要求8所述的机器学习运算的分配系统,其特征在于,所述终端运算单元或所述云端运算单元包括:一个主处理电路和多个从处理电路;所述终端控制器单元或所述云端控制器单元,用于获取输入数据以及计算指令;所述终端控制器单元或所述云端控制器单元,还用于解析该计算指令得到多个运算指令,将该多个运算指令以及所述输入数据发送给所述主处理电路;所述主处理电路,用于对所述输入数据执行前序处理以及与所述多个从处理电路之间传输数据和运算指令;所述多个从处理电路,用于依据从所述主处理电路传输的数据以及运算指令并行执行中间运算得到多个中间结果,并将多个中间结果传输给所述主处理电路;所述主处理电路,用于对所述多个中间结果执行后续处理得到所述计算指令的计算结果。
- 根据权利要求9所述的机器学习运算的分配系统,其特征在于,所述主处理电路包括:依赖关系处理单元;所述依赖关系处理单元,用于确定第一运算指令与所述第一运算指令之前的第零运算指令是否存在关联关系,如所述第一运算指令与所述第零运算指令存在关联关系,将所述第一运算指令缓存在所述指令存储单元内,在所述第零运算指令执行完毕后,从所述指令存储单元提取所述第一运算指令传输至所述运算单元;所述确定该第一运算指令与第一运算指令之前的第零运算指令是否存在关联关系包括:依据所述第一运算指令提取所述第一运算指令中所需数据的第一存储地址区间,依据所述第零运算指令提取所述第零运算指令中所需数据的第零存储地址区间,如所述第一存储地址区间与所述第零存储地址区间具有重叠的区域,确定所述第一运算指令与所述第零运算指令具有关联关系,如所述第一存储地址区间与所述第零存储地址区间不具有重叠的区域,确定所述第一运算指令与所述第零运算指令不具有关联关系。
- 根据权利要求8所述的机器学习运算的分配系统,其特征在于,所述终端运算单元或所述云端运算单元还包括:树型模块,所述树型模块包括:一个根端口和多个支端口,所述树型模块的根端口连接所述主处理电路,所述树型模块的多个支端口分别连接多个从处理电路中的一个从处理电路;所述树型模块,用于转发所述主处理电路与所述多个从处理电路之间的数据块、权值以及运算指令。
- 根据权利要求9所述的机器学习运算的分配系统,其特征在于,所述多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,所述主处理电路连接所述多个从处理电路中的k个从处理电路,所述k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;所述K个从处理电路,用于在所述主处理电路以及多个从处理电路之间的数据以及指令的转发;所述主处理电路,用于确定输入神经元为广播数据,权值为分发数据,将一个输入数据分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述K个从处理电路;所述K个从处理电路,用于转换所述主处理电路与所述多个从处理电路之间的数据;所述多个从处理电路,用于依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述K个从处理电路;所述主处理电路,用于将所述K个从处理电路发送的中间结果进行后续处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
- 根据权利要求7所述的机器学习运算的分配系统,其特征在于,所述终端服务器还包括终端存储单元;所述终端存储单元分别与所述终端控制器单元、所述终端运算单元连接,用于接收所述终端服务器的输入数据并存储。
- 根据权利要求8所述的机器学习运算的分配系统,其特征在于,所述云端服务器还包括云端存储单元;所述云端存储单元分别与所述云端控制器单元、所述云端运算单元连接,用于接收所述云端服务器的输入数据并存储。
- 根据权利要求13所述的机器学习运算的分配系统,其特征在于,所述终端控制器单元包括终端评估电路、终端指令生成电路和终端指令解析电路;所述终端指令生成电路分别与所述终端评估电路和所述终端指令解析电路连接,所述终端评估电路、所述终端指令生成电路和所述终端指令解析电路分别与所述终端运算单元、所述终端存储单元和所述终端通信单元连接;所述终端评估电路用于获取需求信息、所述终端服务器的硬件性能参数和所述云端服务器的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;所述终端指令生成电路用于根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令;所述终端指令解析电路用于对所述终端服务器控制指令进行解析得到终端控制信号。
- 根据权利要求13所述的机器学习运算的分配系统,其特征在于,所述终端运算单元与所述终端通信单元连接,且所述终端存储单元与所述终端通信单元连接。
- 根据权利要求14所述的机器学习运算的分配系统,其特征在于,所述云端控制器单元包括云端指令解析电路;所述云端指令解析电路分别与所述云端运算单元、所述云端存储单元和所述云端通信单元连接。
- 根据权利要求14所述的机器学习运算的分配系统,其特征在于,所述云端运算单元与所述云端通信单元连接,且所述云端存储单元与所述云端通信单元连接。
- 一种机器学习运算的分配方法,其特征在于,包括:获取需求信息、终端服务器的硬件性能参数和云端服务器的硬件性能参数;根据所述需求信息生成对应的运算任务,并根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数选取在所述云端服务器运行的第二机器学习算法;根据所述第一机器学习算法和所述运算任务生成终端服务器控制指令,以及根据所述第二机器学习算法和所述运算任务生成云端服务器控制指令。
- 根据权利要求19所述的机器学习运算的分配方法,其特征在于,还包括:分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号;根据所述终端控制信号提取终端待处理数据,以及根据所述云端控制信号提取云端待处理数据;根据所述终端待处理数据计算所述终端服务器中对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果,和/或根据所述云端待处理数据计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
- 根据权利要求19所述的机器学习运算的分配方法,其特征在于,所述根据所述运算任务和所述终端服务器的硬件性能参数选取在所述终端服务器运行的第一机器学习算法,以及根据所述运算任务和所述云端服务器的硬件性能参数在所述云端服务器运行的第二机器学习算法,包括:获取所述终端服务器的运算能力和所述云端服务器的运算能力;根据所述运算任务、所述终端服务器的运算能力选取第一机器学习算法,以及根据所述运算任务、所述云端服务器的运算能力选取第二机器学习算法。
- 根据权利要求19所述的机器学习运算的分配方法,其特征在于,所述第一机器学习算法包括第一神经网络模型,所述第二机器学习算法包括第二神经网络模型。
- 根据权利要求19-22任一所述的机器学习运算的分配方法,其特征在于,还包括:将所述终端运算结果输出后,在接收到停止运算指令时,终止所述云端服务器的运算工作。
- 根据权利要求20或22所述的机器学习运算的分配方法,其特征在于,所述分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号,包括:利用终端服务器对所述终端服务器控制指令进行解析,获得终端控制信号;根据所述终端控制信号提取相对应的终端训练数据或者终端测试数据。
- 根据权利要求20或22所述的机器学习运算的分配方法,其特征在于,所述分别对所述终端服务器控制指令和所述云端服务器控制指令进行解析,根据所述终端服务器控制指令获得终端控制信号,以及根据所述云端服务器控制指令获得云端控制信号,还包括:利用云端服务器对所述云端服务器控制指令进行解析,获得云端控制信号;根据所述云端控制信号提取相对应的云端训练数据或者云端测试数据。
- 根据权利要求24所述的机器学习运算的分配方法,其特征在于,所述根据所述终端待处理数据计算所述终端服务器中对应的每个阶段的第一机器学习算 法的运算任务以得到终端运算结果,包括:利用终端服务器并根据所述终端训练数据或者终端测试数据,计算所述终端服务器中对应的每个阶段的第一机器学习算法的运算任务以得到终端运算结果。
- 根据权利要求25所述的机器学习运算的分配方法,其特征在于,所述根据所述云端待处理数据计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果,包括:利用云端服务器并根据所述云端训练数据或者云端测试数据,计算所述云端服务器中对应的每个阶段的第二机器学习算法的运算任务以得到云端运算结果。
- 根据权利要求19所述的机器学习运算的分配方法,其特征在于,多个从处理电路呈阵列分布;每个从处理电路与相邻的其他从处理电路连接,主处理电路连接所述多个从处理电路中的k个从处理电路,k个基础电路为:第1行的n个从处理电路、第m行的n个从处理电路以及第1列的m个从处理电路;所述K个从处理电路在所述主处理电路以及多个从处理电路之间的数据以及指令的转发;所述主处理电路确定所述输入神经元为广播数据,权值为分发数据,将一个输入数据分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块以及多个运算指令中的至少一个运算指令发送给所述K个从处理电路;所述K个从处理电路转换所述主处理电路与所述多个从处理电路之间的数据;所述多个从处理电路依据该运算指令对接收到的数据块执行运算得到中间结果,并将运算结果传输给所述K个从处理电路;所述主处理电路将所述K个从处理电路发送的中间结果进行后续处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
- 根据权利要求19所述的机器学习运算的分配方法,其特征在于,运算单元还包括一个或多个分支处理电路,每个分支处理电路连接至少一个从处理电路;主处理电路确定输入神经元为广播数据,权值为分发数据,将一个输入神经元分发数据分配成多个数据块,将所述多个数据块中的至少一个数据块、权值广播数据以及多个运算指令中的至少一个运算指令发送给所述分支处理电路;所述分支处理电路转发所述主处理电路与所述多个从处理电路之间的数据块、广播数据权值以及运算指令;所述多个从处理电路依据该运算指令对接收到的数据块以及广播数据权值执行运算得到中间结果,并将中间结果传输给所述分支处理电路;所述主处理电路将分支处理电路发送的中间结果进行后续处理得到该计算指令的结果,将该计算指令的结果发送给所述控制器单元。
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811190161.6A CN111047045B (zh) | 2018-10-12 | 2018-10-12 | 机器学习运算的分配系统及方法 |
CN201811190161.6 | 2018-10-12 | ||
CN201811424173.0A CN111222632B (zh) | 2018-11-27 | 2018-11-27 | 计算装置、计算方法及相关产品 |
CN201811424173.0 | 2018-11-27 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020073874A1 true WO2020073874A1 (zh) | 2020-04-16 |
Family
ID=70163774
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/109552 WO2020073874A1 (zh) | 2018-10-12 | 2019-09-30 | 机器学习运算的分配系统及方法 |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2020073874A1 (zh) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101784060A (zh) * | 2009-01-19 | 2010-07-21 | 华为技术有限公司 | 参数处理方法、网络诊断方法、终端、服务器和系统 |
CN103945545A (zh) * | 2014-04-15 | 2014-07-23 | 南京邮电大学 | 一种异构网络资源优化方法 |
CN104767833A (zh) * | 2015-05-04 | 2015-07-08 | 厦门大学 | 一种移动终端的计算任务的云端转移方法 |
US20160092794A1 (en) * | 2013-06-29 | 2016-03-31 | Emc Corporation | General framework for cross-validation of machine learning algorithms using sql on distributed systems |
CN106816057A (zh) * | 2017-01-25 | 2017-06-09 | 公安部上海消防研究所 | 一种虚拟消防训练系统 |
CN107943463A (zh) * | 2017-12-15 | 2018-04-20 | 清华大学 | 交互式自动化大数据分析应用开发系统 |
-
2019
- 2019-09-30 WO PCT/CN2019/109552 patent/WO2020073874A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101784060A (zh) * | 2009-01-19 | 2010-07-21 | 华为技术有限公司 | 参数处理方法、网络诊断方法、终端、服务器和系统 |
US20160092794A1 (en) * | 2013-06-29 | 2016-03-31 | Emc Corporation | General framework for cross-validation of machine learning algorithms using sql on distributed systems |
CN103945545A (zh) * | 2014-04-15 | 2014-07-23 | 南京邮电大学 | 一种异构网络资源优化方法 |
CN104767833A (zh) * | 2015-05-04 | 2015-07-08 | 厦门大学 | 一种移动终端的计算任务的云端转移方法 |
CN106816057A (zh) * | 2017-01-25 | 2017-06-09 | 公安部上海消防研究所 | 一种虚拟消防训练系统 |
CN107943463A (zh) * | 2017-12-15 | 2018-04-20 | 清华大学 | 交互式自动化大数据分析应用开发系统 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543832B (zh) | 一种计算装置及板卡 | |
CN109522052B (zh) | 一种计算装置及板卡 | |
WO2020073211A1 (zh) | 运算加速器、处理方法及相关设备 | |
WO2019218896A1 (zh) | 计算方法以及相关产品 | |
TWI795519B (zh) | 計算裝置、機器學習運算裝置、組合處理裝置、神經網絡芯片、電子設備、板卡及執行機器學習計算的方法 | |
US20210065005A1 (en) | Systems and methods for providing vector-wise sparsity in a neural network | |
WO2019127838A1 (zh) | 卷积神经网络实现方法及装置、终端、存储介质 | |
US20210241095A1 (en) | Deep learning processing apparatus and method, device and storage medium | |
CN110163357B (zh) | 一种计算装置及方法 | |
CN111105023B (zh) | 数据流重构方法及可重构数据流处理器 | |
CN110059797B (zh) | 一种计算装置及相关产品 | |
CN111047045B (zh) | 机器学习运算的分配系统及方法 | |
CN111860773B (zh) | 处理装置和用于信息处理的方法 | |
CN111047022A (zh) | 一种计算装置及相关产品 | |
US11775808B2 (en) | Neural network computation device and method | |
CN111353591A (zh) | 一种计算装置及相关产品 | |
CN111930681A (zh) | 一种计算装置及相关产品 | |
WO2021082725A1 (zh) | Winograd卷积运算方法及相关产品 | |
CN109740730B (zh) | 运算方法、装置及相关产品 | |
CN109711538B (zh) | 运算方法、装置及相关产品 | |
WO2020073874A1 (zh) | 机器学习运算的分配系统及方法 | |
Gonçalves et al. | Exploring data size to run convolutional neural networks in low density fpgas | |
WO2021082746A1 (zh) | 运算装置及相关产品 | |
Bai et al. | An OpenCL-based FPGA accelerator with the Winograd’s minimal filtering algorithm for convolution neuron networks | |
CN111078625B (zh) | 片上网络处理系统和片上网络数据处理方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19871156 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19871156 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19871156 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 270921) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19871156 Country of ref document: EP Kind code of ref document: A1 |