WO2021036893A1 - 数据处理方法、装置、计算机设备和存储介质 - Google Patents
数据处理方法、装置、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2021036893A1 WO2021036893A1 PCT/CN2020/110144 CN2020110144W WO2021036893A1 WO 2021036893 A1 WO2021036893 A1 WO 2021036893A1 CN 2020110144 W CN2020110144 W CN 2020110144W WO 2021036893 A1 WO2021036893 A1 WO 2021036893A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- neural network
- network model
- data
- operators
- operator
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/447—Target code generation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Definitions
- the present disclosure relates to the field of artificial intelligence technology, and in particular to a data processing method, device, computer equipment, and storage medium.
- deep learning is implemented by online models.
- online models for deep learning will in turn perform neural network model training on the artificial intelligence processor.
- Each operator in the network model is compiled into a binary instruction to run, and the calculation result and status are returned.
- the training method of this neural network model takes up too much CPU resources, and the power consumption of the CPU is high.
- a data processing method including:
- an execution file including a binary instruction obtained after compiling a neural network model and used for execution on an artificial intelligence processor;
- the execution file is input to an artificial intelligence processor, so that the artificial intelligence processor trains the neural network model according to the input data and the execution file, and obtains the training result of the neural network model.
- a data processing device including:
- An obtaining module for obtaining an execution file including a binary instruction obtained by compiling a neural network model and used for execution on an artificial intelligence processor;
- the execution module is configured to input the execution file into an artificial intelligence processor, so that the artificial intelligence processor trains the neural network model according to the input data and the execution file, and obtains the training result of the neural network model .
- an artificial intelligence chip is provided, and the chip includes the data processing device described in any one of the foregoing.
- an electronic device including the aforementioned artificial intelligence chip.
- a board card includes: a storage device, an interface device, a control device, and the aforementioned artificial intelligence chip;
- the artificial intelligence chip is connected to the storage device, the control device, and the interface device respectively;
- the storage device is used to store data
- the interface device is used to implement data transmission between the artificial intelligence chip and external equipment
- the control device is used to monitor the state of the artificial intelligence chip.
- an electronic device including:
- a memory for storing processor executable instructions
- the processor is configured to call instructions stored in the memory to execute any one of the aforementioned data processing methods.
- a computer-readable storage medium having computer program instructions stored thereon, wherein the computer program instructions implement the data processing described in any one of the foregoing when the computer program instructions are executed by a processor. method.
- the neural network model can be pre-compiled into an executable file and saved, and then when the neural network model is trained, the executable file can be directly input into the manual
- the artificial intelligence processor is made to train the neural network model according to the input data and the execution file.
- the online compilation operation of each operator in the neural network model by the processor is reduced, and the training can be reduced.
- the occupation of CPU resources in the process reduces the power consumption of the CPU.
- Fig. 1 shows a schematic diagram of an application scenario of a data processing method provided by an embodiment of the present disclosure
- Fig. 2 shows a schematic diagram of an artificial intelligence processor of a data processing method according to an embodiment of the present disclosure
- Fig. 3 shows a flowchart of a data processing method according to an embodiment of the present disclosure
- Fig. 4 shows a schematic diagram of an exemplary data processing method of the present disclosure
- Figure 5 shows a block diagram of a data processing device according to an embodiment of the present disclosure
- Figure 6 shows a structural block diagram of a board according to an embodiment of the present disclosure
- FIG. 7 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure
- FIG. 8 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
- the term “if” can be interpreted as “when” or “once” or “in response to determination” or “in response to detection” depending on the context.
- the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
- Fig. 1 shows a schematic diagram of an application scenario of a data processing method provided by an embodiment of the present disclosure.
- the embodiment of the present disclosure can be applied to the computer system shown in Fig. 1.
- the computer system includes a memory and a processor, and the memory can be used for storage.
- a computer program, and the processor may be used to perform data processing operations, for example, may be used to perform the data processing method provided by the embodiments of the present disclosure.
- the CPU resources occupied in the training process of the neural network can be reduced, and the power consumption of the CPU can be reduced.
- the artificial intelligence processor involved in the embodiments of the present disclosure may be an artificial intelligence processor for performing artificial intelligence operations.
- Artificial intelligence operations can include machine learning operations, brain-like operations, and so on.
- machine learning operations include neural network operations, k-means operations, support vector machine operations, and so on.
- the artificial intelligence processor may include, for example, GPU (Graphics Processing Unit), NPU (Neural-Network Processing Unit, neural network processing unit), MLU (Machine Learning Unit, machine learning processing unit), and DSP (Digital Signal Processing Unit). , Digital signal processing unit), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) chip or a combination.
- the present disclosure does not limit the specific types of artificial intelligence processors.
- the artificial intelligence processor mentioned in the present disclosure may include multiple processing units, and each processing unit can independently run various tasks assigned to it, such as: convolution operation tasks, pools Optimized tasks or fully connected tasks, etc.
- the present disclosure does not limit the processing unit and the tasks run by the processing unit.
- Fig. 2 shows a schematic diagram of an artificial intelligence processor of a data processing method according to an embodiment of the present disclosure.
- the artificial intelligence processor 100 includes multiple processing units 101 and a storage unit 102.
- the multiple processing units 101 are used to execute instruction sequences, and the storage unit 102 is used to store data, and may include random access memory (RAM, Random Access). Memory) and register file.
- RAM random access memory
- Memory random access memory
- the multiple processing units 101 in the processor 100 can not only share part of the storage space, for example, share part of the RAM storage space and the register file, but also have their own storage space at the same time. Among them, one or more of the multiple processing units 101 may be used to execute the data processing method provided in the embodiments of the present disclosure.
- the general-purpose processor compiles according to the operators in the acquired model in the entire training process. It does not obtain the full graph information of the neural network model, and cannot obtain the full graph information from the full graph.
- the angle optimization calculation process makes the whole calculation process generate a lot of unnecessary calculation overhead.
- the general-purpose processor compiles each operator in the calculation process of the neural network model into a binary instruction, it controls the artificial intelligence processor to execute the binary instruction and returns the calculation result and status to Through the processor.
- the process of online training includes the process of operator compilation, which reduces the execution efficiency of training.
- the general-purpose processor controls the artificial intelligence processor to perform the calculation of each operator according to the calculation results and status of each operator, that is, the artificial intelligence processor performs data interaction with the general-purpose processor when executing each operation of the operator.
- /O Input/Output, input/output
- FIG. 3 shows a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in Figure 3, the method is applied to a processor, and the method includes:
- step S31 Obtain an execution file, the execution file including the binary instructions obtained after compiling the neural network model and used for execution on the artificial intelligence processor.
- a compiler may be a program used to translate a program written in one language (source language) into an equivalent program written in another language (target language).
- Neural network programming frameworks such as caffe (Convolutional Architecture for Fast Feature) and tensorflow can usually be used to construct neural network models.
- the language bound to the framework is usually a high-level language.
- the language bound to the tensorflow framework can be high-level languages such as JavaScript, C++, Java, Go, and Swift.
- the artificial intelligence processor cannot directly process high-level languages. It is necessary to compile the neural network model constructed with the high-level language through a compiler to obtain a language that can be processed by the artificial intelligence processor.
- the neural network model may be compiled to obtain a corresponding execution file, and the execution file includes the binary instructions obtained by compiling the neural network model for execution on the artificial intelligence processor.
- Binary instructions can be machine instructions corresponding to operators in the neural network model, and can be directly run on the artificial intelligence processor to achieve the purpose of using the artificial intelligence processor to run the neural network model.
- the executable file can be saved to the corresponding storage area of the memory, and when the neural network model is to be trained, the executable file can be obtained from the storage area.
- the memory may be a memory in the processor or a memory other than the processor.
- the above-mentioned execution file may include the address of the weight of the neural network model, and the binary instructions of each operator in the neural network model and the connection relationship between the operators.
- the address of the weight of the neural network model, and the binary instructions of each operator in the neural network model and the connection relationship between the operators can be directly stored in the execution file, or in the execution file Create a weight file and a model file, the weight file includes the address of the weight of the neural network model, and the model file includes the binary instructions of the operators in the neural network model and the connections between the operators relationship.
- step S32 the execution file is input to the artificial intelligence processor, so that the artificial intelligence processor trains the neural network model according to the input data and the execution file to obtain the training result of the neural network model .
- the execution file can be input into the artificial intelligence processor, where the execution file includes the address of the weight of the neural network model, and the binary instructions and the binary instructions of the operators in the neural network model.
- the connection relationship between the operators, so that the artificial intelligence processor reads the input data can be the sample data used to train the neural network model
- the artificial intelligence processor reads the input data
- the input data can be the sample data used to train the neural network model
- the execution file Perform corresponding training operations with the input data to finally obtain the training results of the neural network model, and further operations such as parameter adjustment and iteration can be performed according to the training results to complete the training of the neural network model.
- the general-purpose processor since the neural network model is compiled into an executable file in advance in the disclosure, the general-purpose processor has already obtained the full graph information of the neural network model before executing the calculation of the neural network model, and can optimize the neural network from the perspective of the full graph.
- the calculation process of the network model can reduce unnecessary calculation overhead in the entire calculation process.
- the corresponding weight data can be obtained according to the address of the weight of the neural network model in the execution file to perform the corresponding calculation of the neural network model to complete a forward propagation; in the neural network training During the back propagation process, the weight data stored in the weight address can be updated according to the address of the weight value of the neural network model in the execution file to complete a back propagation.
- the neural network model can be pre-compiled into an execution file and saved, and then when the neural network model is trained, the execution file is directly input into the artificial intelligence processor to make the artificial intelligence
- the processor trains the neural network model according to the read input data and execution files. In this way, the online compilation operation of each operator in the neural network model by the processor is reduced, and the CPU resource occupation during the training process can be reduced. Reduce the power consumption of the CPU.
- the execution file may include a weight file and a model file, wherein the weight file includes the address of the weight of the neural network model, and the model file includes each of the neural network models.
- the artificial intelligence processor can train the neural network model according to The address of the weight value obtains the weight value of the neural network model for training the neural network model, and when the weight value is updated according to the training result of the neural network model, the weight value can be directly updated at the specified location.
- the neural network model corresponds to No update operation is required to execute the file.
- the execution file includes the address of the weight of the neural network model, and the binary instructions of each operator in the neural network model and the connection relationship between the operators.
- an execution file can be created, and the binary instructions of each operator in the neural network model and the connection relationship between the operators can be written into the execution file, and the neural network model Write the weight value of the weight into the specified location, and write the specified location as the address of the weight value into the execution file, so that in the process of training the neural network model by the artificial intelligence processor, the weight of the neural network model is obtained according to the address of the weight value.
- Values are used to train the neural network model, and when the weights are updated according to the training results of the neural network model, the weights can be updated directly at the specified location, and the executable file corresponding to the neural network model does not need to be updated.
- the foregoing method may further include:
- the optimization processing at least including at least one of operator fusion, data multiplexing, and redundant removal operator;
- the optimized neural network model is compiled to obtain the executable file.
- the neural network model may be optimized based on the full graph information of the neural network model, such as: operator fusion, data reuse, and redundancy removal. At least one optimization processing such as operator, and compile the optimized neural network model to obtain the execution file corresponding to the neural network model.
- the calculation process of the neural network model can be optimized, and some unnecessary calculation processes can be reduced , It can reduce the computational overhead of the artificial intelligence processor in the training process of the neural network model, and reduce the power consumption of the artificial intelligence processor.
- the aforementioned operator fusion may include:
- At least two operators to be fused are determined from operators corresponding to the neural network model, wherein the at least two operators to be fused are at least two single operators having a connection relationship in the network model, and the The output data of the former one of the at least two single operators is the only input data of the latter one;
- At least two single operators can be determined from the operators of the neural network model.
- the at least two single operators perform operations on the same artificial intelligence processor and have a connection relationship between the two.
- the output data of a single operator is the only input data of the latter single operator, then the at least two single operators can be determined as the operators to be fused, and the at least two operators to be fused are fused to obtain the fusion The latter fusion operator.
- the at least two operators to be fused can be spliced, and one operator obtained after the splicing process can be processed to obtain the fused operator, wherein
- the operator consists of two parts, one of which is an operation part related to the input data, and the other part is an operation part that has nothing to do with the input data.
- the resulting fusion operator can be:
- (a/var*filter)*x1 is the arithmetic part related to the input data x1
- (bias-mean)/var*a+b is the arithmetic part that has nothing to do with the input data x1.
- the three operators that were originally serially processed can be processed in parallel after the fusion optimization process is performed. Moreover, since the fusion operator is processed into two parts, the operation process of the operator is also simplified, so it is improved.
- the computing speed of the artificial intelligence processor reduces the power consumption of the artificial intelligence processor.
- the aforementioned data multiplexed data includes at least one of weight data, input neuron data, output neuron data, bias, and gradient.
- the first operator and at least one second operator can be determined from the operators corresponding to the neural network model, where the second operator multiplexes the data of the first operator, then the first operator can be The data block address is linked to the second operator, so that the artificial intelligence processor can directly obtain the corresponding data of the first operator according to the data block address when multiplexing data is involved in the operation process of the second operator. Data and perform the corresponding operation of the second operator.
- the artificial intelligence processor can reuse data in the calculation process during the training process of the neural network model, which can reduce some calculations, increase the calculation speed of the artificial intelligence processor, and reduce the power consumption of the artificial intelligence processor.
- the artificial intelligence processor can reduce the number of data during the training process of the neural network model. Some operations can increase the computing speed of the artificial intelligence processor and reduce the power consumption of the artificial intelligence processor.
- the foregoing method may further include:
- the neural network model is compiled according to each sub-graph to obtain the execution file, and the execution file includes each sub-graph identifier, and the sub-graph identifier is used to instruct the artificial intelligence processor to complete the sub-graph After all the operators in the operation, return the operation result of the subgraph.
- the above-mentioned division granularity can be determined during the construction of the neural network model, or can be determined during the compilation of the neural network model. It is used to indicate the division of the neural network model into multiple subgraphs, including the subgraphs contained in each subgraph. operator.
- the computer system can divide the operators of the neural network model into multiple subgraphs according to the granularity of the division. During the division, the computer system can determine whether multiple operators in the same subgraph are performing operations on the same artificial intelligence processor. The multiple operators are divided into one subgraph; otherwise, the division operation of the subgraph is not performed.
- the neural network model in the embodiment of the present disclosure may be the aforementioned neural network model after performing optimization processing.
- the neural network model can be compiled according to the subgraph.
- the instructions corresponding to each subgraph include the binary instructions and subgraph identifiers corresponding to the operators in the subgraph, so that the artificial processor can be
- the operation processing is performed in the unit of a subgraph, the operations of each operator in the subgraph are sequentially executed, and in response to the above subgraph identifier, after the operations of all operators in the subgraph are completed, the operation result of the subgraph is returned.
- the foregoing method may further include:
- the neural network model is compiled according to each of the sub-graphs to obtain an execution file corresponding to each of the sub-graphs.
- the execution file corresponding to each subgraph can be generated.
- the execution file corresponding to each subgraph can contain the binary instructions of each operator in the subgraph and the connection relationship between the operators, And the address of the weight, that is, the execution file of the neural network model includes the execution file corresponding to each subgraph.
- Operator 1 and Operator 2 correspond to Sub Figure 1
- Operator 3, Operator 4, Operator Subgraph 5 and operator 6 correspond to subgraph 2.
- operator 1 and operator 2 are divided into subgraph 1
- operator 3, operator 4, operator 5, and operator 6 are all To perform operations on the MLU, operator 3, operator 4, operator 5, and operator 6 are divided into subgraph 2.
- the artificial intelligence processor can perform operations in sub-figure 1 and operations in sub-figure 2, and after performing operations in sub-figure 1, return to the operation result of sub-figure 1, and execute the sub-figure according to the operation result of sub-figure 1.
- the operation of Fig. 2 in this way, since it is not necessary to return an operation result of an operator every time an operation of an operator is completed, the I/O (Input/Output, input/output) overhead can be effectively reduced.
- the operation result of the above subgraph includes:
- the artificial intelligence processor may return the operation result of the subgraph after completing the operation of the subgraph.
- the operation result may include the operation result of the final operator in the subgraph (for example: the subgraph 2 in the above example).
- the operation result can include the operation result of operator 6); or the artificial intelligence processor caches the operation result of each operator after performing the operation of each operator in the subgraph, and returns to each operator after completing the operation of the subgraph Or, you can set the operation result of the subgraph as needed, for example: set all subgraphs to return the operation result of the final operator, or set all subgraphs to return the operation result of each operator in the subgraph, or according to Part of the subgraphs required to set up returns the result of each operator, and other subgraphs return the result of the final operator.
- Fig. 5 shows a block diagram of a data processing device according to an embodiment of the present disclosure. As shown in Figure 5, the device may include:
- the obtaining module 51 may be used to obtain an execution file, where the execution file includes a binary instruction obtained after compiling a neural network model and used for execution on an artificial intelligence processor;
- the execution module 52 may be used to input the execution file into an artificial intelligence processor, so that the artificial intelligence processor trains the neural network model according to the input data and the execution file to obtain the neural network model Training results.
- the neural network model can be pre-compiled into an execution file and saved, and then when the neural network model is trained, the execution file is directly input into the artificial intelligence processor to make the artificial intelligence
- the processor trains the neural network model according to the input data and execution files that are read in.
- the artificial intelligence processor avoids the online compilation operation of each operator in the neural network model, which can reduce the occupation of CPU resources and reduce The power consumption of the CPU.
- the device may further include:
- An optimization module configured to perform optimization processing on the neural network model, the optimization processing at least including at least one of operator fusion, data multiplexing, and redundant operator removal;
- the first compiling module is used to compile the optimized neural network model to obtain the executable file.
- the optimization module may also be used to:
- At least two operators to be fused are determined from operators corresponding to the neural network model, wherein the at least two operators to be fused are at least two single operators having a connection relationship in the network model, and the The output data of the former one of the at least two single operators is the only input data of the latter one;
- the data used for data multiplexing includes at least one of weight data, input neuron data, output neuron data, bias, and gradient.
- the device further includes:
- a dividing module configured to divide the neural network model into a plurality of subgraphs according to the division granularity, at least one subgraph of the plurality of subgraphs includes more than two operators;
- the second compiling module is configured to compile the neural network model according to each sub-graph to obtain the execution file.
- the execution file includes each of the sub-graph identifiers, and the sub-graph identifiers are used to instruct manual After the intelligent processor completes the operations of all operators in the sub-graph, it returns the result of the operations in the sub-graph.
- the device further includes:
- the third compiling module is configured to compile the neural network model according to each sub-graph to obtain an execution file corresponding to each sub-graph.
- the operation result of the subgraph includes:
- the execution file includes a weight file and a model file, where:
- the weight file includes the address of the weight of the neural network model
- the model file includes the binary instructions of each operator in the neural network model and the connection relationship between the operators.
- the execution file includes the address of the weight of the neural network model, and the binary instructions of each operator in the neural network model and the connection relationship between the operators.
- the functional units/modules in the various embodiments of the present disclosure may be integrated into one unit/module, or each unit/module may exist alone physically, or two or more units/modules may exist.
- the modules are integrated together.
- the above-mentioned integrated unit/module can be implemented in the form of hardware or software program module.
- the hardware may be a digital circuit, an analog circuit, and so on.
- the physical realization of the hardware structure includes but is not limited to transistors, memristors and so on.
- the artificial intelligence processor may be any appropriate hardware processor, such as CPU, GPU, FPGA, DSP, ASIC, and so on.
- the storage unit may be any suitable magnetic storage medium or magneto-optical storage medium, such as RRAM (Resistive Random Access Memory), DRAM (Dynamic Random Access Memory), Static random access memory SRAM (Static Random-Access Memory), enhanced dynamic random access memory EDRAM (Enhanced Dynamic Random Access Memory), high-bandwidth memory HBM (High-Bandwidth Memory), hybrid storage cube HMC (Hybrid Memory Cube), etc. Wait.
- RRAM Resistive Random Access Memory
- DRAM Dynamic Random Access Memory
- Static random access memory SRAM Static Random-Access Memory
- enhanced dynamic random access memory EDRAM Enhanced Dynamic Random Access Memory
- high-bandwidth memory HBM High-Bandwidth Memory
- hybrid storage cube HMC Hybrid Memory Cube
- the integrated unit/module is implemented in the form of a software program module and sold or used as an independent product, it can be stored in a computer readable memory.
- the technical solution of the present disclosure essentially or the part that contributes to the prior art or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present disclosure.
- the aforementioned memory includes: U disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes.
- an artificial intelligence chip is also disclosed, which includes the above-mentioned data processing device.
- a board card which includes a storage device, an interface device, a control device, and the above-mentioned artificial intelligence chip; wherein the artificial intelligence chip is related to the storage device and the control device.
- the interface devices are respectively connected; the storage device is used to store data; the interface device is used to implement data transmission between the artificial intelligence chip and external equipment; the control device is used to The state of the artificial intelligence chip is monitored.
- Fig. 6 shows a structural block diagram of a board card according to an embodiment of the present disclosure.
- the board card may include other supporting components in addition to the chip 389 described above.
- the supporting components include, but are not limited to: a storage device 390, Interface device 391 and control device 392;
- the storage device 390 is connected to the artificial intelligence chip through a bus for storing data.
- the storage device may include multiple groups of storage units 393. Each group of the storage unit and the artificial intelligence chip are connected through a bus. It can be understood that each group of the storage units may be DDR SDRAM (English: Double Data Rate SDRAM, double-rate synchronous dynamic random access memory).
- the storage device may include 4 groups of the storage units. Each group of the storage unit may include a plurality of DDR4 particles (chips).
- the artificial intelligence chip may include four 72-bit DDR4 controllers. In the 72-bit DDR4 controller, 64 bits are used for data transmission and 8 bits are used for ECC verification. It can be understood that when DDR4-3200 particles are used in each group of the storage units, the theoretical bandwidth of data transmission can reach 25600MB/s.
- each group of the storage unit includes a plurality of double-rate synchronous dynamic random access memories arranged in parallel.
- DDR can transmit data twice in one clock cycle.
- a controller for controlling the DDR is provided in the chip, which is used to control the data transmission and data storage of each storage unit.
- the interface device is electrically connected with the artificial intelligence chip.
- the interface device is used to implement data transmission between the artificial intelligence chip and an external device (such as a server or a computer).
- the interface device may be a standard PCIE interface.
- the data to be processed is transferred from the server to the chip through a standard PCIE interface to realize data transfer.
- the interface device may also be other interfaces. The present disclosure does not limit the specific manifestations of the other interfaces mentioned above, as long as the interface unit can realize the switching function.
- the calculation result of the artificial intelligence chip is still transmitted by the interface device back to an external device (such as a server).
- the control device is electrically connected with the artificial intelligence chip.
- the control device is used to monitor the state of the artificial intelligence chip.
- the artificial intelligence chip and the control device may be electrically connected through an SPI interface.
- the control device may include a single-chip microcomputer (Micro Controller Unit, MCU).
- MCU Micro Controller Unit
- the artificial intelligence chip may include multiple processing chips, multiple processing cores, or multiple processing circuits, and can drive multiple loads. Therefore, the artificial intelligence chip can be in different working states such as multi-load and light-load.
- the control device can realize the regulation and control of the working states of multiple processing chips, multiple processing and or multiple processing circuits in the artificial intelligence chip.
- an electronic device which includes the aforementioned artificial intelligence chip.
- Electronic equipment includes data processing devices, robots, computers, printers, scanners, tablets, smart terminals, mobile phones, driving recorders, navigators, sensors, cameras, servers, cloud servers, cameras, cameras, projectors, watches, headsets , Mobile storage, wearable devices, vehicles, household appliances, and/or medical equipment.
- the transportation means include airplanes, ships, and/or vehicles;
- the household appliances include TVs, air conditioners, microwave ovens, refrigerators, rice cookers, humidifiers, washing machines, electric lights, gas stoves, and range hoods;
- the medical equipment includes nuclear magnetic resonance, B-ultrasound and/or electrocardiograph.
- the embodiments of the present disclosure also provide a computer-readable storage medium on which computer program instructions are stored, and the computer program instructions implement the above-mentioned method when executed by a processor.
- the computer-readable storage medium may be a non-volatile computer-readable storage medium.
- An embodiment of the present disclosure also provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute the above method.
- FIG. 7 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure.
- the electronic device 800 may be a mobile phone, a computer, a digital broadcasting terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and other terminals.
- the electronic device 800 may include one or more of the following components: a processing component 802, a memory 804, a power supply component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, and a sensor component 814 , And communication component 816.
- the processing component 802 generally controls the overall operations of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations.
- the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the foregoing method.
- the processing component 802 may include one or more modules to facilitate the interaction between the processing component 802 and other components.
- the processing component 802 may include a multimedia module to facilitate the interaction between the multimedia component 808 and the processing component 802.
- the memory 804 is configured to store various types of data to support operations in the electronic device 800. Examples of these data include instructions for any application or method operating on the electronic device 800, contact data, phone book data, messages, pictures, videos, etc.
- the memory 804 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable and Programmable read only memory (EPROM), programmable read only memory (PROM), read only memory (ROM), magnetic memory, flash memory, magnetic disk or optical disk.
- SRAM static random access memory
- EEPROM electrically erasable programmable read-only memory
- EPROM erasable and Programmable read only memory
- PROM programmable read only memory
- ROM read only memory
- magnetic memory flash memory
- flash memory magnetic disk or optical disk.
- the power supply component 806 provides power for various components of the electronic device 800.
- the power supply component 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.
- the multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and the user.
- the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
- the touch panel includes one or more touch sensors to sense touch, sliding, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure related to the touch or slide operation.
- the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capabilities.
- the audio component 810 is configured to output and/or input audio signals.
- the audio component 810 includes a microphone (MIC), and when the electronic device 800 is in an operation mode, such as a call mode, a recording mode, and a voice recognition mode, the microphone is configured to receive an external audio signal.
- the received audio signal may be further stored in the memory 804 or transmitted via the communication component 816.
- the audio component 810 further includes a speaker for outputting audio signals.
- the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module.
- the above-mentioned peripheral interface module may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: home button, volume button, start button, and lock button.
- the sensor component 814 includes one or more sensors for providing the electronic device 800 with various aspects of status evaluation.
- the sensor component 814 can detect the on/off status of the electronic device 800 and the relative positioning of the components.
- the component is the display and the keypad of the electronic device 800.
- the sensor component 814 can also detect the electronic device 800 or the electronic device 800.
- the position of the component changes, the presence or absence of contact between the user and the electronic device 800, the orientation or acceleration/deceleration of the electronic device 800, and the temperature change of the electronic device 800.
- the sensor component 814 may include a proximity sensor configured to detect the presence of nearby objects when there is no physical contact.
- the sensor component 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
- the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
- the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
- the electronic device 800 can access a wireless network based on a communication standard, such as WiFi, 2G, or 3G, or a combination thereof.
- the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
- the communication component 816 further includes a near field communication (NFC) module to facilitate short-range communication.
- the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
- RFID radio frequency identification
- IrDA infrared data association
- UWB ultra-wideband
- Bluetooth Bluetooth
- the electronic device 800 may be implemented by one or more application-specific integrated circuits (ASIC), digital signal processors (DSP), digital signal processing devices (DSPD), programmable logic devices (PLD), field-available A programmable gate array (FPGA), controller, microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
- ASIC application-specific integrated circuits
- DSP digital signal processors
- DSPD digital signal processing devices
- PLD programmable logic devices
- FPGA field-available A programmable gate array
- controller microcontroller, microprocessor, or other electronic components are implemented to implement the above methods.
- a non-volatile computer-readable storage medium such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to complete the foregoing method.
- FIG. 8 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
- the electronic device 1900 may be provided as a server.
- the electronic device 1900 includes a processing component 1922, which further includes one or more processors, and a memory resource represented by a memory 1932 for storing instructions executable by the processing component 1922, such as application programs.
- the application program stored in the memory 1932 may include one or more modules each corresponding to a set of instructions.
- the processing component 1922 is configured to execute instructions to perform the above-described methods.
- the electronic device 1900 may also include a power supply component 1926 configured to perform power management of the electronic device 1900, a wired or wireless network interface 1950 configured to connect the electronic device 1900 to a network, and an input output (I/O) interface 1958 .
- the electronic device 1900 can operate based on an operating system stored in the memory 1932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
- a non-volatile computer-readable storage medium is also provided, such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to complete the foregoing method.
- a data processing method includes obtaining an executable file, the executable file includes binary instructions obtained after compiling a neural network model, and used for execution on an artificial intelligence processor;
- the artificial intelligence processor is input so that the artificial intelligence processor trains the neural network model according to the input data and the execution file, and obtains the training result of the neural network model.
- the method according to clause A1 further includes X1 performing optimization processing on the neural network model, and the optimization processing includes at least one of operator fusion, data multiplexing, and redundant operator removal Kind of; compile the optimized neural network model to obtain the executable file.
- the operator fusion includes determining at least two operators to be fused from operators corresponding to the neural network model, wherein the at least two operators to be fused are the In the network model, there are at least two single operators with a connection relationship, and the output data of the former single operator in the at least two single operators is the only input data of the latter single operator;
- the fusion operator performs fusion processing to obtain the fusion operator.
- the data used for data multiplexing includes at least one of weight data, input neuron data, output neuron data, bias, and gradient.
- the method further includes: determining a division granularity of the neural network model; dividing the neural network model into a plurality of sub-graphs according to the division granularity; At least one subgraph in each subgraph includes more than two operators; the neural network model is compiled according to each subgraph to obtain the execution file, and the execution file includes the identification of each subgraph, so The subgraph identifier is used to instruct the artificial intelligence processor to complete the operations of all operators in the subgraph and return the result of the operation of the subgraph.
- Clause A6 the method according to clause A5, the method further comprising: compiling the neural network model according to each of the sub-graphs to obtain an execution file corresponding to each of the sub-graphs.
- the operation result of the subgraph includes: the operation result of the final operator in the subgraph, and/or the operation result of each operator in the subgraph.
- the execution file includes a weight file and a model file, wherein the weight file includes the address of the weight of the neural network model, and the model file contains It includes the binary instructions of each operator in the neural network model and the connection relationship between the operators.
- the execution file includes the address of the weight value of the neural network model, and the binary instructions of each operator in the neural network model and one of the operators The connection relationship between.
- a data processing device comprising:
- An obtaining module for obtaining an execution file including a binary instruction obtained by compiling a neural network model and used for execution on an artificial intelligence processor;
- the execution module is configured to input the execution file into an artificial intelligence processor, so that the artificial intelligence processor trains the neural network model according to the input data and the execution file, and obtains the training result of the neural network model .
- An optimization module configured to perform optimization processing on the neural network model, the optimization processing at least including at least one of operator fusion, data multiplexing, and redundant operator removal;
- the first compiling module is used to compile the optimized neural network model to obtain the executable file.
- the optimization module is further used to:
- At least two operators to be fused are determined from operators corresponding to the neural network model, wherein the at least two operators to be fused are at least two single operators having a connection relationship in the network model, and the The output data of the former one of the at least two single operators is the only input data of the latter one;
- the device according to clause A11, the data used for data multiplexing includes at least one of weight data, input neuron data, output neuron data, bias, and gradient.
- a dividing module configured to divide the neural network model into a plurality of subgraphs according to the division granularity, at least one subgraph of the plurality of subgraphs includes more than two operators;
- the second compiling module is configured to compile the neural network model according to each sub-graph to obtain the execution file.
- the execution file includes each of the sub-graph identifiers, and the sub-graph identifiers are used to instruct manual After the intelligent processor completes the operations of all operators in the sub-graph, it returns the result of the operations in the sub-graph.
- the third compiling module is configured to compile the neural network model according to each sub-graph to obtain an execution file corresponding to each sub-graph.
- the calculation result of the subgraph includes:
- the execution file includes a weight file and a model file, wherein the weight file includes the address of the weight of the neural network model, and the The model file includes the binary instructions of each operator in the neural network model and the connection relationship between the operators.
- the executable file includes the address of the weight of the neural network model, and the binary instructions of each operator in the neural network model and one of the operators The connection relationship between.
- an artificial intelligence chip comprising the data processing device as described in any one of the aforementioned clauses A1 to A8.
- the electronic device includes the artificial intelligence chip as described in clause A19.
- a board card comprising: a storage device, an interface device and a control device, and the artificial intelligence chip as described in clause A19; wherein the artificial intelligence chip is related to the storage device and the control device And the interface devices are respectively connected; the storage device is used to store data; the interface device is used to implement data transmission between the artificial intelligence chip and external equipment; the control device is used to The state of the artificial intelligence chip is monitored.
- the storage device includes: multiple groups of storage units, each group of the storage unit is connected to the artificial intelligence chip through a bus, and the storage unit is: DDR SDRAM;
- the chip includes: a DDR controller, which is used to control the data transmission and data storage of each storage unit;
- the interface device is: a standard PCIE interface.
- an electronic device comprising: a processor; a memory for storing executable instructions of the processor; wherein the processor is configured to call the instructions stored in the memory to execute any one of clauses A1 to A9 The method described in the item.
- Clause A24 a computer-readable storage medium with computer program instructions stored thereon, characterized in that, when the computer program instructions are executed by a processor, the method described in any one of clauses A1 to A9 is implemented.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Neurology (AREA)
- Stored Programmes (AREA)
- Machine Translation (AREA)
Abstract
本公开涉及一种数据处理方法、装置、计算机设备和存储介质。所述计算机设备包括控制模块,所述控制模块包括:指令缓存单元、指令处理单元和存储队列单元;所述指令缓存单元,用于存储所述人工神经网络运算关联的计算指令;所述指令处理单元,用于对所述计算指令解析得到多个运算指令;所述存储队列单元,用于存储指令队列,该指令队列包括:按该队列的前后顺序待执行的多个运算指令或计算指令。通过以上方法,本公开可以提高相关产品在进行神经网络模型训练过程中的运算效率。
Description
本公开要求在2019年8月23日提交中国专利局、申请号为201910786452.X的中国专利申请的优先权,其全部内容通过引用结合在本公开中。
本公开涉及人工智能技术领域,特别是涉及一种数据处理方法、装置、计算机设备和存储介质。
随着人工智能技术的发展,出现了深度学习技术,目前深度学习都是采用在线模型进行实现,采用在线模型进行深度学习会在神经网络模型的训练过程中,依次在人工智能处理器上将神经网络模型中每一个算子编译为二进制指令进行运行,并返回计算结果和状态,这种神经网络模型的训练方法占用CPU资源过多,CPU的功耗高。
发明内容
基于此,有必要针对上述技术问题,提供一种能够减少在神经网络模型的训练过程中对CPU资源的占用,降低CPU功耗的神数据处理方法、装置、计算机设备和存储介质。
根据本公开的一方面,提供了一种数据处理方法,所述方法包括:
获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;
将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
根据本公开的另一方面,提供了一种数据处理装置,包括:
获取模块,用于获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;
执行模块,用于将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
根据本公开的另一方面,提供了一种人工智能芯片,所述芯片包括前述任意一项所述的数据处理装置。
根据本公开的另一方面,提供了一种电子设备,所述电子设备包括如前述的人工智能芯片。
根据本公开的另一方面,提供了一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如前述的人工智能芯片;
其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;
所述存储器件,用于存储数据;
所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;
所述控制器件,用于对所述人工智能芯片的状态进行监控。
根据本公开的另一方面,提供了一种电子设备,包括:
处理器;
用于存储处理器可执行指令的存储器;
其中,所述处理器被配置为调用所述存储器存储的指令,以执行前述任意一项数据处理方法。
根据本公开的另一方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现前述任意一项所述的数据处理方法。
这样,根据本公开实施例提供的数据处理方法、装置、计算机设备和存储介质,可以将神经网络模型预先编译为执行文件保存,进而在进行神经网络模型的训练时,直接将该执行文件输入人工智能处理器中,使人工智能处理器根据读入的输入数据及执行文件对神经网络模型进行训练,这样一来,减少了处理器对神经网络模型中各算子的在线编译操作,可以减少训练过程中对CPU资源的占用,降低CPU的功耗。
通过权要中的技术特征进行推导,能够达到对应背景技术中的技术问题的有益效果。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。
图1示出本公开实施例提供的数据处理方法的应用场景的示意图;
图2示出根据本公开实施例的数据处理方法的人工智能处理器的示意图;
图3示出根据本公开一实施例的数据处理方法的流程图;
图4示出本公开一示例性的数据处理方法的示意图;
图5示出根据本公开实施例的数据处理装置的框图;
图6示出根据本公开实施例的板卡的结构框图;
图7示出根据本公开实施例的一种电子设备800的框图;
图8示出根据本公开实施例的一种电子设备1900的框图。
下面将结合本公开实施例中的附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本公开一部分实施例,而不是全部的实施例。基于本公开中的实施例,本领域技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本公开保护的范围。
应当理解,本公开的权利要求、说明书及附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。本公开的说明书和权利要求书中使用的术语“包括”和“包含”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在此本公开说明书中所使用的术语仅仅是出于描述特定实施例的目的,而并不意在限定本公开。如在本公开说明书和权利要求书中所使用的那样,除非上下文清楚地指明其它情况,否则单数形式的“一”、“一个”及“该”意在包括复数形式。还应当进一步理解,在本公开说明书和权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本说明书和权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。
图1示出本公开实施例提供的数据处理方法的应用场景的示意图,本公开实施例可以应用于图1所示的计算机系统中,该计算机系统包括存储器和处理器,该存储器可以用于存储计算机程序,该处理器可以用于执行数据处理操作,例如,可以用于执行本公开实施例提供的数据处理方法。通过本公开实施例提供的数据处理方法,可以降低在神经网络的训练过程中所占用的CPU资源,降低CPU的功耗。
本公开实施例中涉及的人工智能处理器可以为用于执行人工智能运算的人工智能处理器。人工智能运算可包括机器学习运算,类脑运算等。其中,机器学习运算包括神经网络运算、k-means运算、支持向量机运算等。该人工智能处理器可例如包括GPU(Graphics Processing Unit,图形处理单元)、NPU(Neural-Network Processing Unit,神经网络处理单元)、MLU(Machine Learning Unit,机器学习处理单元)、DSP(Digital Signal Process,数字信号处理单元)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)芯片中的一种或组合。本公开对人工智能处理器的具体类型不作限制。
在一种可能的实现方式中,本公开中所提及的人工智能处理器可包括多个处理单元,每个处理单元可以独立运行所分配到的各种任务,如:卷积运算任务、池化任务或全连接任务等。本公开对处理单元及处理单元所运行的任务不作限制。
图2示出根据本公开实施例的数据处理方法的人工智能处理器的示意图。如图2所示,人工智能处理器100包括多个处理单元101以及存储单元102,多个处理单元101用于执行指令序列,存储单元102用于存储数据,可包括随机存储器(RAM,Random Access Memory)和寄存器堆。处理器100中的多个处理单元101既可共用部分存储空间,例如共用部分RAM存储空间和寄存器堆,又可同时拥有各自的存储空间。其中,多个处理单元101中的一个或多个可以用于执行本公开实施例提供的数据处理方法。
首先,相关在线训练神经网络模型的技术中,通用处理器在整个训练过程中,依次根据获取到的模型中的各算子进行编译,没有获取神经网络模型的全图信息,无法从全图的角度优化计算过程,使得整个计算过程产生很多不必要的计算开销。
其次,在线训练神经网络模型的过程中,通用处理器将神经网络模型的计算过程中的每一个算子编译为二进制指令后,控制人工智能处理器执行该二进制指令,并返回计算结果和状态至通过处理器。在线训练的过程包括算子编译的过程,降低了训练的执行 效率。
再次,通用处理器根据各算子的计算结果和状态控制人工智能处理器执行各个算子的计算,也即人工智能处理器执行每一个算子的运算均需要与通用处理器进行数据交互,I/O(Input/Output,输入/输出)开销比较大。
图3示出根据本公开一实施例的数据处理方法的流程图。如图3所示,该方法应用于处理器,该方法包括:
在步骤S31中:获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令。
举例来说,编译器可以是一个程序,用于将一种语言(源语言)编写的程序,翻译为一个等价的,用另一种语言(目标语言)编写的程序。通常可以利用caffe(Convolutional Architecture for Fast Feature)、tensorflow等神经网络编程框架构建神经网络模型,框架绑定的语言通常为高级语言。例如,tensorflow框架绑定的语言可以为JavaScript、C++、Java、Go和Swift等高级语言。人工智能处理器不能直接处理高级语言,需要把利用高级语言构建后的神经网络模型经过编译器编译,得到可以供人工智能处理器进行处理的语言。
可以在构建神经网络模型后,对神经网络模型执行编译处理,得到对应的执行文件,该执行文件中包括对神经网络模型进行编译得到的用于在人工智能处理器上执行的二进制指令。二进制指令可以为与神经网络模型中的算子对应的机器指令,可以在人工智能处理器上直接运行,以实现利用人工智能处理器运行神经网络模型的目的。可以保存该执行文件至存储器的相应存储区域,并在欲进行神经网络模型的训练时,从该存储区域获取该执行文件。存储器可以为处理器内的存储器,也可以为处理器以外的存储器。
其中,上述执行文件可以包括神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。需要说明的是,上述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系可以直接存储于执行文件中,也可以在执行文件中创建权值文件和模型文件,该权值文件中包括所述神经网络模型的权值的地址,所述模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
在步骤S32中:将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
举例来说,获取该执行文件后,可以将该执行文件输入人工智能处理器中,其中执行文件中包括神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系,以使得人工智能处理器读入输入数据(该输入数据可以为用于训练神经网络模型的样本数据)后,可以直接根据该执行文件中神经网络模型对应的二进制指令及输入数据进行相应的训练操作,最终得到神经网络模型的训练结果,并进一步的可以根据该训练结果进行调参迭代等操作,以完成神经网络模型的训练。
需要说明的是,由于公开中预先将神经网络模型编译成为执行文件,故通用处理器在执行神经网络模型的运算之前已经获取了该神经网络模型的全图信息,可以从全图的角度优化神经网络模型的计算过程,可以减少整个计算过程中所产生的不必要的计算开销。
在神经网络训练的正向传播过程中,可以根据执行文件中神经网络模型的权值的地址获取对应的权值数据,以进行神经网络模型相应的运算,完成一次正向传播;在神经网络训练的反向传播过程中,可以根据执行文件中神经网络模型的权值的地址,更新该权值地址中所保存的权值数据,以完成一次反向传播。
这样,根据本公开实施例提供的数据处理方法,可以将神经网络模型预先编译为执行文件保存,进而在进行神经网络模型的训练时,直接将该执行文件输入人工智能处理器中,使人工智能处理器根据读入的输入数据及执行文件对神经网络模型进行训练,这样一来,减少了处理器对神经网络模型中各算子的在线编译操作,可以减少训练过程中对CPU资源的占用,降低CPU的功耗。
在一种可能的实现方式中,上述执行文件可以包括权值文件和模型文件,其中,权值文件中包括所述神经网络模型的权值的地址,模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
举例来说,在对神经网络模型进行编译时,可以在执行文件中分别创建权值文件和模型文件,并可以将神经网络模型中各算子的二进制指令和算子之间的连接关系写入模型文件中,将神经网络模型的权值写入指定位置,并将该指定位置作为权值的地址写入权值文件中,以使得在人工智能处理器训练神经网络模型的过程中,可以根据权值的地址获取神经网络模型的权值进行神经网络模型的训练,并在根据神经网络模型的训练结果更新权值时,直接在指定位置处进行权值的更新即可,神经网络模型对应的执行文件无需作任何更新操作。
在一种可能的实现方式中,上述执行文件中包括所述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
举例来说,在对神经网络模型进行编译时,可以创建一个执行文件,并可以将神经网络模型中各算子的二进制指令和算子之间的连接关系写入执行文件中,将神经网络模型的权值写入指定位置,并将该指定位置作为权值的地址写入执行文件中,以使得在人工智能处理器训练神经网络模型的过程中,根据权值的地址获取神经网络模型的权值进行神经网络模型的训练,并在根据神经网络模型的训练结果更新权值时,直接在指定位置处进行权值的更新即可,神经网络模型对应的执行文件无需作任何更新操作。
在一种可能的实现方式中,上述方法还可以包括:
对所述神经网络模型进行优化处理,所述优化处理至少包括算子融合、数据复用和去除冗余算子中的至少一种;
将优化后的神经网络模型进行编译,得到所述执行文件。
举例来说,本公开实施例在对神经网络模型进行编译处理之前,可以先基于神经网 络模型的全图信息对神经网络模型进行优化处理,例如:进行算子融合、数据复用及去除冗余算子等至少一种优化处理,并对优化处理后的神经网络模型进行编译,得到神经网络模型对应的执行文件,这样一来,可以优化神经网络模型的运算过程,减少部分没必要的运算过程,可以降低人工智能处理器在神经网络模型的训练过程中的运算开销,并降低人工智能处理器的功耗。
在一种可能的实现方式中,上述所述算子融合,可以包括:
从所述神经网络模型对应的算子中确定至少两个待融合算子,其中所述至少两个待融合算子为所述网络模型中具有连接关系的至少两个单算子,且所述至少两个单算子中前一单算子的输出数据为后一单算子的唯一输入数据;
将至少所述两个待融合算子进行融合处理,得到融合算子。
举例来说,可以从神经网络模型的算子中确定至少两个单算子,该至少两个单算子在同一人工智能处理器上执行运算,并且两两之间具有连接关系,并且前一单算子的输出数据为后一单算子的唯一输入数据,则可以将该至少两个单算子确定为待融合算子,并对该至少两个待融合算子进行融合处理,得到融合后的一个融合算子。
例如:可以按照至少两个待融合算子之间的连接关系,对所述至少两个待融合算子进行拼接处理,对拼接处理后得到的一个算子进行处理,得到融合算子,其中融合算子包括两部分,其中一部分为与输入数据有关的运算部分,另一部分为与输入数据无关的运算部分。
示例性的,假设当前确定有三个待融合算子:
算子1:y1=filter*x1+bias,其中,filter代表权值,bias代表偏差值,x1代表算子1的输入数据,y1代表算子1的输出数据;
算子2:y2=(y1–mean)/var,其中,mean代表平均值,var代表方差,y2代表算子2的输出数据;
算子3:y3=y2*a+b,其中,a,b代表两个参数,y3代表算子3的输出数据。
将上述三个待融合算子进行拼接处理(将前一算子作为后一算子的输入数据代入后一算子中)后,得到:
y3=((filter*x1)+bias-mean)/var*a+b;
进一步的将拼接后的算子进行处理后,得到的融合算子可以为:
y3=(a/var*filter)*x1+(bias-mean)/var*a+b;
其中,(a/var*filter)*x1为与输入数据x1有关的运算部分,(bias-mean)/var*a+b为与输入数据x1无关的运算部分。
这样一来,本来串行处理的3个算子进行融合优化处理后,可以并行进行处理,并且,由于融合算子被处理为两部分,该算子的运算过程也得到了简化,故而提高了人工智能处理器的运算速度,降低了人工智能处理器的功耗。
在一种可能的实现方式中,上述数据复用的数据包括权值数据、输入神经元数据、输出神经元数据、偏置、梯度中的至少一项数据。
举例来说,可以从神经网络模型对应的算子中确定第一算子及至少一个第二算子,其中第二算子复用第一算子的数据,则可以将第一算子对应的数据块地址链接至该第二算子中,以使得人工智能处理器在执行第二算子的运算过程中,涉及到复用数据时,可以直接根据该数据块地址获取第一算子相应的数据并执行第二算子相应的运算。
这样一来,人工智能处理器在神经网络模型的训练过程中,在运算过程中可以复用数据,能够减少一些运算,可以提高人工智能处理器的运算速度,降低人工智能处理器的功耗。
举例来说,在进行神经网络模型的优化时,还可以去除冗余算子,例如:神经网络模型中存在一个算子是对输入数据进行随机排序,但实际上用户提供的输入数据理论上来说也可以认为是一种排序后的数据,故该算子可以确定为冗余算子,可以将该算子去除掉,这样一来,人工智能处理器在神经网络模型的训练过程中,能够减少一些运算,可以提高人工智能处理器的运算速度,降低人工智能处理器的功耗。
在一种可能的实现方式中,上述方法还可以包括:
确定针对所述神经网络模型的划分粒度;
根据所述划分粒度将所述神经网络模型划分为多个子图,所述多个子图中的至少一个子图包括两个以上算子;
根据各所述子图对所述神经网络模型进行编译,得到所述执行文件,所述执行文件中包括各所述子图标识,所述子图标识用于指示人工智能处理器完成该子图中所有算子的运算后,返回该子图的运算结果。
举例来说,上述划分粒度可以在神经网络模型的构建过程中确定,也可以在神经网络模型的编译过程中确定,其用于指示将神经网络模型划分为多个子图,包括每个子图包含的算子。计算机系统可以根据划分粒度将神经网络模型的算子划分为多个子图,在划分时计算机系统可以判断同一子图中的多个算子是否在同一人工智能处理器上执行运算,若是,则将该多个算子划分为一个子图,否则,不执行该子图的划分操作。需要说明的是,本公开实施例中的神经网络模型可以为前述执行优化处理后的神经网络模型。
在划分子图后,可以根据子图对神经网络模型进行编译,编译后各子图对应的指令中包括该子图中各算子对应的二进制指令及子图标识,这样可以使得人工处理器可以以子图为单位进行运算处理,依次执行该子图中各算子的运算,并响应于上述子图标识,在完成该子图中所有算子的运算后,返回该子图的运算结果。
在一种可能的实现方式中,上述方法还可以包括:
根据各所述子图对所述神经网络模型进行编译,得到与各所述子图对应的执行文件。
在对神经网络模型进行编译的过程中,可以生成各子图对应的执行文件,各子图对应的执行文件中可以包含该子图中各算子的二进制指令和算子之间的连接关系、及权值的地址,也即神经网络模型的执行文件中包括各子图对应的执行文件。
示例性的,如图4所示,假设神经网络模型包括6个算子,该神经网络模型的划分粒度为:算子1和算子2对应子图1,算子3、算子4、算子5和算子6对应子图2。假设计算机 系统确定算子1和算子2均为CPU上执行运算,则将算子1和算子2划分为子图1,算子3、算子4、算子5和算子6均为MLU上执行运算,则将算子3、算子4、算子5和算子6划分为子图2。
这样,人工智能处理器可以执行子图1的运算及子图2的运算,并在执行完子图1的运算后,返回子图1的运算结果,并根据该子图1的运算结果执行子图2的运算,如此一来,由于不必每完成一个算子的运算即返回一个算子的运算结果,因此可以有效减少I/O(Input/Output,输入/输出)开销。
在一种可能的实现方式中,上述子图的运算结果包括:
该子图中最终算子的运算结果,和/或
该子图中各算子的运算结果。
举例来说,人工智能处理器可以每完成一个子图的运算后,返回该子图的运算结果,该运算结果可以包括子图中最终算子的运算结果(例如:上述示例中子图2的运算结果可以包括算子6的运算结果);或者人工智能处理器在执行子图中各算子的运算后,缓存各算子的运算结果,并在完成子图的运算后,返回各算子的运算结果;或者,可以根据需要设置子图的运算结果,例如:设置所有子图均返回最终算子的运算结果,或者设置所有子图均返回子图中各算子的运算结果,或者根据需求设置部分子图返回各算子的运算结果,其他子图返回最终算子的运算结果。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本公开并不受所描述的动作顺序的限制,因为依据本公开,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于可选实施例,所涉及的动作和模块并不一定是本公开所必须的。
进一步需要说明的是,虽然图X-Y的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图X-Y中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
图5示出根据本公开实施例的数据处理装置的框图。如图5所示,该装置可以包括:
获取模块51,可以用于获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;
执行模块52,可以用于将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
这样,根据本公开实施例提供的数据处理装置,可以将神经网络模型预先编译为执 行文件保存,进而在进行神经网络模型的训练时,直接将该执行文件输入人工智能处理器中,使人工智能处理器根据读入的输入数据及执行文件对神经网络模型进行训练,这样一来,避免了人工智能处理器对神经网络模型中各算子的在线编译操作,可以减少对CPU资源的占用,降低CPU的功耗。
在一种可能的实现方式中,所述装置还可以包括:
优化模块,用于对所述神经网络模型进行优化处理,所述优化处理至少包括算子融合、数据复用和去除冗余算子中的至少一种;
第一编译模块,用于将优化后的神经网络模型进行编译,得到所述执行文件。
在一种可能的实现方式中,所述优化模块还可以用于:
从所述神经网络模型对应的算子中确定至少两个待融合算子,其中所述至少两个待融合算子为所述网络模型中具有连接关系的至少两个单算子,且所述至少两个单算子中前一单算子的输出数据为后一单算子的唯一输入数据;
将至少所述两个待融合算子进行融合处理,得到融合算子。
在一种可能的实现方式中,用于数据复用的数据包括权值数据、输入神经元数据、输出神经元数据、偏置、梯度中的至少一项数据。
在一种可能的实现方式中,所述装置还包括:
确定模块,用于确定针对所述神经网络模型的划分粒度;
划分模块,用于根据所述划分粒度将所述神经网络模型划分为多个子图,所述多个子图中的至少一个子图包括两个以上算子;
第二编译模块,用于根据各所述子图对所述神经网络模型进行编译,得到所述执行文件,所述执行文件中包括各所述子图标识,所述子图标识用于指示人工智能处理器完成该子图中所有算子的运算后,返回该子图的运算结果。
在一种可能的实现方式中,所述装置还包括:
第三编译模块,用于根据各所述子图对所述神经网络模型进行编译,得到与各所述子图对应的执行文件。
在一种可能的实现方式中,所述子图的运算结果包括:
该子图中最终算子的运算结果,和/或
该子图中各算子的运算结果。
在一种可能的实现方式中,所述执行文件包括权值文件和模型文件,其中,
所述权值文件中包括所述神经网络模型的权值的地址,
所述模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
在一种可能的实现方式中,所述执行文件中包括所述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
应该理解,上述的装置实施例仅是示意性的,本公开的装置还可通过其它的方式实现。例如,上述实施例中所述单元/模块的划分,仅仅为一种逻辑功能划分,实际实现时 可以有另外的划分方式。例如,多个单元、模块或组件可以结合,或者可以集成到另一个系统,或一些特征可以忽略或不执行。
另外,若无特别说明,在本公开各个实施例中的各功能单元/模块可以集成在一个单元/模块中,也可以是各个单元/模块单独物理存在,也可以两个或两个以上单元/模块集成在一起。上述集成的单元/模块既可以采用硬件的形式实现,也可以采用软件程序模块的形式实现。
所述集成的单元/模块如果以硬件的形式实现时,该硬件可以是数字电路,模拟电路等等。硬件结构的物理实现包括但不局限于晶体管,忆阻器等等。若无特别说明,所述人工智能处理器可以是任何适当的硬件处理器,比如CPU、GPU、FPGA、DSP和ASIC等等。若无特别说明,所述存储单元可以是任何适当的磁存储介质或者磁光存储介质,比如,阻变式存储器RRAM(Resistive Random Access Memory)、动态随机存取存储器DRAM(Dynamic Random Access Memory)、静态随机存取存储器SRAM(Static Random-Access Memory)、增强动态随机存取存储器EDRAM(Enhanced Dynamic Random Access Memory)、高带宽内存HBM(High-Bandwidth Memory)、混合存储立方HMC(Hybrid Memory Cube)等等。
所述集成的单元/模块如果以软件程序模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储器中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储器中,包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储器包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
在一种可能的实现方式中,还公开了一种人工智能芯片,其包括了上述数据处理装置。
在一种可能的实现方式中,还公开了一种板卡,其包括存储器件、接口装置和控制器件以及上述人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
图6示出根据本公开实施例的板卡的结构框图,参阅图6,上述板卡除了包括上述芯片389以外,还可以包括其他的配套部件,该配套部件包括但不限于:存储器件390、接口装置391和控制器件392;
所述存储器件390与所述人工智能芯片通过总线连接,用于存储数据。所述存储器件可以包括多组存储单元393。每一组所述存储单元与所述人工智能芯片通过总线连接。可以理解,每一组所述存储单元可以是DDR SDRAM(英文:Double Data Rate SDRAM, 双倍速率同步动态随机存储器)。
DDR不需要提高时钟频率就能加倍提高SDRAM的速度。DDR允许在时钟脉冲的上升沿和下降沿读出数据。DDR的速度是标准SDRAM的两倍。在一个实施例中,所述存储装置可以包括4组所述存储单元。每一组所述存储单元可以包括多个DDR4颗粒(芯片)。在一个实施例中,所述人工智能芯片内部可以包括4个72位DDR4控制器,上述72位DDR4控制器中64bit用于传输数据,8bit用于ECC校验。可以理解,当每一组所述存储单元中采用DDR4-3200颗粒时,数据传输的理论带宽可达到25600MB/s。
在一个实施例中,每一组所述存储单元包括多个并联设置的双倍速率同步动态随机存储器。DDR在一个时钟周期内可以传输两次数据。在所述芯片中设置控制DDR的控制器,用于对每个所述存储单元的数据传输与数据存储的控制。
所述接口装置与所述人工智能芯片电连接。所述接口装置用于实现所述人工智能芯片与外部设备(例如服务器或计算机)之间的数据传输。例如在一个实施例中,所述接口装置可以为标准PCIE接口。比如,待处理的数据由服务器通过标准PCIE接口传递至所述芯片,实现数据转移。优选的,当采用PCIE 3.0 X 16接口传输时,理论带宽可达到16000MB/s。在另一个实施例中,所述接口装置还可以是其他的接口,本公开并不限制上述其他的接口的具体表现形式,所述接口单元能够实现转接功能即可。另外,所述人工智能芯片的计算结果仍由所述接口装置传送回外部设备(例如服务器)。
所述控制器件与所述人工智能芯片电连接。所述控制器件用于对所述人工智能芯片的状态进行监控。具体的,所述人工智能芯片与所述控制器件可以通过SPI接口电连接。所述控制器件可以包括单片机(Micro Controller Unit,MCU)。如所述人工智能芯片可以包括多个处理芯片、多个处理核或多个处理电路,可以带动多个负载。因此,所述人工智能芯片可以处于多负载和轻负载等不同的工作状态。通过所述控制装置可以实现对所述人工智能芯片中多个处理芯片、多个处理和或多个处理电路的工作状态的调控。
在一种可能的实现方式中,公开了一种电子设备,其包括了上述人工智能芯片。电子设备包括数据处理装置、机器人、电脑、打印机、扫描仪、平板电脑、智能终端、手机、行车记录仪、导航仪、传感器、摄像头、服务器、云端服务器、相机、摄像机、投影仪、手表、耳机、移动存储、可穿戴设备、交通工具、家用电器、和/或医疗设备。所述交通工具包括飞机、轮船和/或车辆;所述家用电器包括电视、空调、微波炉、冰箱、电饭煲、加湿器、洗衣机、电灯、燃气灶、油烟机;所述医疗设备包括核磁共振仪、B超仪和/或心电图仪。
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法。
图7示出根据本公开实施例的一种电子设备800的框图。例如,电子设备800可以是移 动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。
参照图7,电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。当电子设备800处于操作模式,如拍摄模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(MIC),当电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,用于输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,用于为电子设备800提供各个方面的状态评 估。例如,传感器组件814可以检测到电子设备800的打开/关闭状态,组件的相对定位,例如所述组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如WiFi,2G或3G,或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由电子设备800的处理器820执行以完成上述方法。
图8示出根据本公开实施例的一种电子设备1900的框图。例如,电子设备1900可以被提供为一服务器。参照图8,电子设备1900包括处理组件1922,其进一步包括一个或多个处理器,以及由存储器1932所代表的存储器资源,用于存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理,一个有线或无线网络接口1950被配置为将电子设备1900连接到网络,和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由电子设备1900的处理组件1922执行以完成上述方法。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。上述实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要 这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
依据以下条款可更好地理解前述内容:
条款A1,一种数据处理方法,所述方法包括获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
条款A2,根据条款A1所述的方法,所述方法还包括X1对所述神经网络模型进行优化处理,所述优化处理至少包括算子融合、数据复用和去除冗余算子中的至少一种;将优化后的神经网络模型进行编译,得到所述执行文件。
条款A3,根据条款A2所述的方法,所述算子融合包括从所述神经网络模型对应的算子中确定至少两个待融合算子,其中所述至少两个待融合算子为所述网络模型中具有连接关系的至少两个单算子,且所述至少两个单算子中前一单算子的输出数据为后一单算子的唯一输入数据;将至少所述两个待融合算子进行融合处理,得到融合算子。
条款A4、根据条款A2所述的方法,用于数据复用的数据包括权值数据、输入神经元数据、输出神经元数据、偏置、梯度中的至少一项数据。
条款A5,根据条款A1或A2所述的方法,所述方法还包括:确定针对所述神经网络模型的划分粒度;根据所述划分粒度将所述神经网络模型划分为多个子图,所述多个子图中的至少一个子图包括两个以上算子;根据各所述子图对所述神经网络模型进行编译,得到所述执行文件,所述执行文件中包括各所述子图标识,所述子图标识用于指示人工智能处理器完成该子图中所有算子的运算后,返回该子图的运算结果。
条款A6,根据条款A5所述的方法,所述方法还包括:根据各所述子图对所述神经网络模型进行编译,得到与各所述子图对应的执行文件。
条款A7,根据条款A5或A6所述的方法,所述子图的运算结果包括:该子图中最终算子的运算结果,和/或该子图中各算子的运算结果。
条款A8,根据条款A1或A7所述的方法,所述执行文件包括权值文件和模型文件,其中,所述权值文件中包括所述神经网络模型的权值的地址,所述模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
条款A9,根据条款A1或A7任一项所述的方法,所述执行文件中包括所述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
条款A10,一种数据处理装置,所述装置包括:
获取模块,用于获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;
执行模块,用于将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
条款A11,根据条款A10所述的装置,所述装置还包括:
优化模块,用于对所述神经网络模型进行优化处理,所述优化处理至少包括算子融合、数据复用和去除冗余算子中的至少一种;
第一编译模块,用于将优化后的神经网络模型进行编译,得到所述执行文件。
条款A12,根据条款A11所述的装置,所述优化模块还用于:
从所述神经网络模型对应的算子中确定至少两个待融合算子,其中所述至少两个待融合算子为所述网络模型中具有连接关系的至少两个单算子,且所述至少两个单算子中前一单算子的输出数据为后一单算子的唯一输入数据;
将至少所述两个待融合算子进行融合处理,得到融合算子。
条款A13,根据条款A11所述的装置,用于数据复用的数据包括权值数据、输入神经元数据、输出神经元数据、偏置、梯度中的至少一项数据。
条款A14,根据条款A10或A11所述的装置,所述装置还包括:
确定模块,用于确定针对所述神经网络模型的划分粒度;
划分模块,用于根据所述划分粒度将所述神经网络模型划分为多个子图,所述多个子图中的至少一个子图包括两个以上算子;
第二编译模块,用于根据各所述子图对所述神经网络模型进行编译,得到所述执行文件,所述执行文件中包括各所述子图标识,所述子图标识用于指示人工智能处理器完成该子图中所有算子的运算后,返回该子图的运算结果。
条款A15,根据条款A14所述的装置,所述装置还包括:
第三编译模块,用于根据各所述子图对所述神经网络模型进行编译,得到与各所述子图对应的执行文件。
条款A16,根据条款A14或A15所述的装置,所述子图的运算结果包括:
该子图中最终算子的运算结果,和/或
该子图中各算子的运算结果。
条款A17,根据条款A10至A16任一项所述的装置,所述执行文件包括权值文件和模型文件,其中,所述权值文件中包括所述神经网络模型的权值的地址,所述模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
条款A18,根据条款A10至A16任一项所述的装置,所述执行文件中包括所述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
条款A19,一种人工智能芯片,所述芯片包括如前述条款A1至A8中任意一项所述的数据处理装置。
条款A20,所述电子设备包括如条款A19所述的人工智能芯片。
条款A21,一种板卡,所述板卡包括:存储器件、接口装置和控制器件以及如条款A19所述的人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的 状态进行监控。
条款A22,根据条款A21所述的板卡,所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;所述接口装置为:标准PCIE接口。
条款A23,一种电子设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行条款A1至A9中任意一项所述的方法。
条款A24,一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现条款A1至A9中任意一项所述的方法。
以上对本公开实施例进行了详细介绍,本文中应用了具体个例对本公开的原理及实施方式进行了阐述,以上实施例的说明仅用于帮助理解本公开的方法及其核心思想。同时,本领域技术人员依据本公开的思想,基于本公开的具体实施方式及应用范围上做出的改变或变形之处,都属于本公开保护的范围。综上所述,本说明书内容不应理解为对本公开的限制。
Claims (24)
- 一种数据处理方法,其特征在于,所述方法包括:获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:对所述神经网络模型进行优化处理,所述优化处理至少包括算子融合、数据复用和去除冗余算子中的至少一种;将优化后的神经网络模型进行编译,得到所述执行文件。
- 根据权利要求2所述的方法,其特征在于,所述算子融合,包括:从所述神经网络模型对应的算子中确定至少两个待融合算子,其中所述至少两个待融合算子为所述网络模型中具有连接关系的至少两个单算子,且所述至少两个单算子中前一单算子的输出数据为后一单算子的唯一输入数据;将至少所述两个待融合算子进行融合处理,得到融合算子。
- 根据权利要求2所述的方法,其特征在于,用于数据复用的数据包括权值数据、输入神经元数据、输出神经元数据、偏置、梯度中的至少一项数据。
- 根据权利要求1或2所述的方法,其特征在于,所述方法还包括:确定针对所述神经网络模型的划分粒度;根据所述划分粒度将所述神经网络模型划分为多个子图,所述多个子图中的至少一个子图包括两个以上算子;根据各所述子图对所述神经网络模型进行编译,得到所述执行文件,所述执行文件中包括各所述子图标识,所述子图标识用于指示人工智能处理器完成该子图中所有算子的运算后,返回该子图的运算结果。
- 根据权利要求5所述的方法,其特征在于,所述方法还包括:根据各所述子图对所述神经网络模型进行编译,得到与各所述子图对应的执行文件。
- 根据权利要求5或6所述的方法,其特征在于,所述子图的运算结果包括:该子图中最终算子的运算结果,和/或该子图中各算子的运算结果。
- 根据权利要求1至7任一项所述的方法,其特征在于,所述执行文件包括权值文件和模型文件,其中,所述权值文件中包括所述神经网络模型的权值的地址,所述模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
- 根据权利要求1至7任一项所述的方法,其特征在于,所述执行文件中包括所述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
- 一种数据处理装置,其特征在于,包括:获取模块,用于获取执行文件,所述执行文件包括将神经网络模型进行编译后得到的,用于在人工智能处理器上执行的二进制指令;执行模块,用于将所述执行文件输入人工智能处理器,以使所述人工智能处理器根据输入数据和所述执行文件对所述神经网络模型进行训练,得到所述神经网络模型的训练结果。
- 根据权利要求10所述的装置,其特征在于,所述装置还包括:优化模块,用于对所述神经网络模型进行优化处理,所述优化处理至少包括算子融合、数据复用和去除冗余算子中的至少一种;第一编译模块,用于将优化后的神经网络模型进行编译,得到所述执行文件。
- 根据权利要求11所述的装置,其特征在于,所述优化模块还用于:从所述神经网络模型对应的算子中确定至少两个待融合算子,其中所述至少两个待融合算子为所述网络模型中具有连接关系的至少两个单算子,且所述至少两个单算子中前一单算子的输出数据为后一单算子的唯一输入数据;将至少所述两个待融合算子进行融合处理,得到融合算子。
- 根据权利要求11所述的装置,其特征在于,用于数据复用的数据包括权值数据、输入神经元数据、输出神经元数据、偏置、梯度中的至少一项数据。
- 根据权利要求10或11所述的装置,其特征在于,所述装置还包括:确定模块,用于确定针对所述神经网络模型的划分粒度;划分模块,用于根据所述划分粒度将所述神经网络模型划分为多个子图,所述多个子图中的至少一个子图包括两个以上算子;第二编译模块,用于根据各所述子图对所述神经网络模型进行编译,得到所述执行文件,所述执行文件中包括各所述子图标识,所述子图标识用于指示人工智能处理器完成该子图中所有算子的运算后,返回该子图的运算结果。
- 根据权利要求14所述的装置,其特征在于,所述装置还包括:第三编译模块,用于根据各所述子图对所述神经网络模型进行编译,得到与各所述子图对应的执行文件。
- 根据权利要求14或15所述的装置,其特征在于,所述子图的运算结果包括:该子图中最终算子的运算结果,和/或该子图中各算子的运算结果。
- 根据权利要求10至16任一项所述的装置,其特征在于,所述执行文件包括权值文件和模型文件,其中,所述权值文件中包括所述神经网络模型的权值的地址,所述模型文件中包括所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
- 根据权利要求10至16任一项所述的装置,其特征在于,所述执行文件中包括所 述神经网络模型的权值的地址,以及所述神经网络模型中各算子的二进制指令和算子之间的连接关系。
- 一种人工智能芯片,其特征在于,所述芯片包括如权利要求10至18中任意一项所述的数据处理装置。
- 一种电子设备,其特征在于,所述电子设备包括如权利要求19所述的人工智能芯片。
- 一种板卡,其特征在于,所述板卡包括:存储器件、接口装置和控制器件以及如权利要求19所述的人工智能芯片;其中,所述人工智能芯片与所述存储器件、所述控制器件以及所述接口装置分别连接;所述存储器件,用于存储数据;所述接口装置,用于实现所述人工智能芯片与外部设备之间的数据传输;所述控制器件,用于对所述人工智能芯片的状态进行监控。
- 根据权利要求21所述的板卡,其特征在于,所述存储器件包括:多组存储单元,每一组所述存储单元与所述人工智能芯片通过总线连接,所述存储单元为:DDR SDRAM;所述芯片包括:DDR控制器,用于对每个所述存储单元的数据传输与数据存储的控制;所述接口装置为:标准PCIE接口。
- 一种电子设备,其特征在于,包括:处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至9中任意一项所述的方法。
- 一种计算机可读存储介质,其上存储有计算机程序指令,其特征在于,所述计算机程序指令被处理器执行时实现权利要求1至9中任意一项所述的方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910786452.X | 2019-08-23 | ||
CN201910786452.XA CN112416352A (zh) | 2019-08-23 | 2019-08-23 | 数据处理方法、装置、计算机设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021036893A1 true WO2021036893A1 (zh) | 2021-03-04 |
Family
ID=74685165
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/110144 WO2021036893A1 (zh) | 2019-08-23 | 2020-08-20 | 数据处理方法、装置、计算机设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112416352A (zh) |
WO (1) | WO2021036893A1 (zh) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11977475B1 (en) * | 2021-08-06 | 2024-05-07 | Marvell Asia Pte Ltd | Method and apparatus for compiler and low-level instruction validation of machine learning operations on hardware |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114936631B (zh) * | 2021-04-26 | 2023-06-09 | 华为技术有限公司 | 一种模型处理方法及装置 |
CN114691577B (zh) * | 2022-03-11 | 2024-03-29 | 中国人民解放军陆军装甲兵学院 | 一种装备维修训练装置 |
CN114339994B (zh) * | 2022-03-17 | 2022-05-27 | 杭州优智联科技有限公司 | 一种片内执行机器学习算法的uwb芯片及方法 |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018213499A1 (en) * | 2017-05-16 | 2018-11-22 | Google Llc | Stop code tolerant image compression neural networks |
CN110018831A (zh) * | 2019-04-04 | 2019-07-16 | 北京中科寒武纪科技有限公司 | 程序处理方法、装置及相关产品 |
CN110119806A (zh) * | 2019-05-23 | 2019-08-13 | 北京环境特性研究所 | 基于fpga实现人工神经网络的方法和装置 |
-
2019
- 2019-08-23 CN CN201910786452.XA patent/CN112416352A/zh active Pending
-
2020
- 2020-08-20 WO PCT/CN2020/110144 patent/WO2021036893A1/zh active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018213499A1 (en) * | 2017-05-16 | 2018-11-22 | Google Llc | Stop code tolerant image compression neural networks |
CN110018831A (zh) * | 2019-04-04 | 2019-07-16 | 北京中科寒武纪科技有限公司 | 程序处理方法、装置及相关产品 |
CN110119806A (zh) * | 2019-05-23 | 2019-08-13 | 北京环境特性研究所 | 基于fpga实现人工神经网络的方法和装置 |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11977475B1 (en) * | 2021-08-06 | 2024-05-07 | Marvell Asia Pte Ltd | Method and apparatus for compiler and low-level instruction validation of machine learning operations on hardware |
Also Published As
Publication number | Publication date |
---|---|
CN112416352A (zh) | 2021-02-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021036893A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
US11113226B2 (en) | Firmware burning apparatus and system | |
CN107832845A (zh) | 一种信息处理方法及相关产品 | |
EP3333733B1 (en) | Method and device for use in parallel execution of terminal database | |
CN111443917B (zh) | 神经网络运行优化方法、装置及相关产品 | |
Golkarifard et al. | Dandelion: A unified code offloading system for wearable computing | |
CN110851787B (zh) | 合并指令处理方法、装置、电子设备和存储介质 | |
WO2021114904A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN109711540B (zh) | 一种计算装置及板卡 | |
WO2021114903A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN109725943A (zh) | 一种程序跳转方法、装置、电子设备及存储介质 | |
CN113297128B (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN115098262B (zh) | 一种多神经网络任务处理方法及装置 | |
WO2021017546A1 (zh) | 神经网络量化方法、装置、芯片、电子设备及板卡 | |
WO2021083097A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN111258732A (zh) | 一种数据处理的方法、数据处理装置和电子设备 | |
WO2020192587A1 (zh) | 人工智能计算装置及相关产品 | |
CN111339060B (zh) | 运算方法、装置、计算机设备和存储介质 | |
CN113469365B (zh) | 基于神经网络模型的推理和编译方法及其相关产品 | |
WO2021082654A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN113298223B (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN111338694B (zh) | 运算方法、装置、计算机设备和存储介质 | |
WO2021083100A1 (zh) | 数据处理方法、装置、计算机设备和存储介质 | |
CN111210011B (zh) | 数据处理装置及相关产品 | |
CN115545180A (zh) | 对运行于人工智能芯片上的神经网络模型进行优化的编译方法及其相关产品 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20858133 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20858133 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20858133 Country of ref document: EP Kind code of ref document: A1 |