WO2022126316A1 - 人工智能ai模型的开发方法和装置 - Google Patents

人工智能ai模型的开发方法和装置 Download PDF

Info

Publication number
WO2022126316A1
WO2022126316A1 PCT/CN2020/136119 CN2020136119W WO2022126316A1 WO 2022126316 A1 WO2022126316 A1 WO 2022126316A1 CN 2020136119 W CN2020136119 W CN 2020136119W WO 2022126316 A1 WO2022126316 A1 WO 2022126316A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
model
models
split
running
Prior art date
Application number
PCT/CN2020/136119
Other languages
English (en)
French (fr)
Inventor
连朔
王晨曦
昌晶
孙方轩
梁雪
周君
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN202080107168.6A priority Critical patent/CN116472533A/zh
Priority to PCT/CN2020/136119 priority patent/WO2022126316A1/zh
Publication of WO2022126316A1 publication Critical patent/WO2022126316A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the field of artificial intelligence, and more particularly, to a method and apparatus for developing an artificial intelligence AI model.
  • the deployment strategy of the AI model on the device can be designed to Reduce the above operating costs.
  • an AI model development method is urgently needed.
  • the deployment efficiency of the AI model on the device can be improved and the running cost of the AI model on the device can be reduced.
  • the present application provides a method and apparatus for developing an artificial intelligence AI model.
  • the AI model is deployed on the device according to the results obtained by the development method, the deployment efficiency of the AI model on the device can be improved and the running cost of the AI model on the device can be reduced.
  • a method for developing an artificial intelligence AI model includes: splitting the AI model to obtain multiple split results, each of the multiple split results includes a plurality of first sub-models, each of the multiple first sub-models Each of the first sub-models corresponds to at least one of the M processors, where M is a positive integer greater than 1, and each of the first sub-models can run on the corresponding at least one processor so that each of the split results having a running cost of running the plurality of first sub-models; determining a first splitting result among the multiple splitting results, and the first running cost of the first splitting result is less than one of the multiple splitting results or Second running costs of multiple second split results; output the first split results.
  • the device is a terminal device, and the above-mentioned M processors may be understood as processors included in the terminal device.
  • a computer program corresponding to the first split result may also be output.
  • the computer program is used to describe the running sequence and communication process of the N first sub-models included in the first split result on the M processors.
  • the first split result is determined as the result of splitting the AI model by comparing various running costs of the various split results corresponding to the AI model.
  • the above method can fully consider the running overhead of each first sub-model in the plurality of first sub-models running on M processors, thereby splitting the AI model.
  • the deployment efficiency of the AI model on the device can be effectively improved and the running overhead when the AI model is executed on the device can be reduced.
  • the first running overhead is the smallest.
  • the first split result includes N first sub-models, where N is a positive integer greater than 2; the method further includes: combining the N first sub-models At least two first sub-models in are merged to obtain a third split result, the at least two first sub-models are adjacent in the execution order, and the third split result includes X second sub-models, X is a positive integer greater than 1 and less than N, each of the X second sub-models corresponds to one or more processors of the M processors, and each second sub-model can run on the The corresponding one or more processors make the third split result have a third running cost of running the X second sub-models, where the third running cost is smaller than the first running cost.
  • the number of first sub-models obtained by dividing the AI model is smaller, which can further improve the performance of the AI model on the device. It improves the deployment efficiency and reduces the running overhead when the AI model is executed on the device.
  • the AI model includes L first operators (operators), where L is a positive integer greater than 2, and each of the plurality of first sub-models has a first
  • the submodel includes some of the L first operators.
  • each first sub-model may include more first operators.
  • the multiple first operators can be combined into one first sub-model, thereby reducing the communication overhead between the multiple first operators.
  • the splitting of the AI model includes: splitting the AI model according to input information, where the input information includes at least one of the following: The execution order of the L first operators, or the attribute information of each of the L first operators, where L is a positive integer greater than 2.
  • the running overhead includes: running overhead of each first sub-model, communication overhead between two adjacent first sub-models in the execution order, and scheduling overhead for scheduling each first sub-model to the corresponding at least one processor.
  • the running cost of the split result of the AI model includes not only the execution cost of each first sub-model on the processor, but also the cost between the two adjacent first sub-models in the AI model.
  • the communication overhead and scheduling overhead between each other make the split result of the AI model determined according to the running cost more accurate.
  • the M processors include at least two of the following processors: a central processing unit (CPU), a neural network processor (NPU), a graphics processing unit (GPU), and a digital signal processor.
  • DSP Deep Learning Processor DPU or Tensor Processor TPU.
  • an apparatus for deploying an artificial intelligence AI model includes: a splitting unit for splitting the AI model to obtain multiple split results, each split result in the multiple split results includes a plurality of first sub-models, the multiple first sub-models Each first sub-model in a sub-model corresponds to at least one processor among M processors, where M is a positive integer greater than 1, and each first sub-model can run on the corresponding at least one processor such that Each type of split result has a running cost of running the plurality of first sub-models; a determining unit is configured to determine a first split result among the multiple split results, and the first running cost of the first split result A second running cost less than one or more second split results in the multiple split results; an output unit, configured to output the first split results.
  • the first running overhead is the smallest.
  • the first split result includes N first sub-models, where N is a positive integer greater than 2
  • the apparatus further includes a merging unit, the merging unit for Combining at least two of the N first sub-models to obtain a third split result, where the at least two first sub-models are adjacent in execution order
  • the third split result includes X second sub-models, where X is a positive integer greater than 1 and less than N, each second sub-model in the X second sub-models corresponds to one or more processors in the M processors, each The second sub-model can run on the corresponding one or more processors so that the third split result has a third running cost of running the X second sub-models, and the third running cost is less than the first running cost .
  • the AI model includes L first operators, where L is a positive integer greater than 2, and each first sub-model in the plurality of first sub-models includes A part of the first operators in the L first operators.
  • the splitting unit is specifically configured to split the AI model according to input information, where the input information includes at least one of the following: The execution order of an operator, or the attribute information of each of the L first operators, where L is a positive integer greater than 2.
  • the running overhead includes: running overhead of each first sub-model, communication overhead between two adjacent first sub-models in the execution order, and scheduling overhead for scheduling each first sub-model to the corresponding at least one processor.
  • the M processors include at least two of the following processors: a central processing unit (CPU), a neural network processor (NPU), a graphics processor (GPU), and a digital signal processor.
  • CPU central processing unit
  • NPU neural network processor
  • GPU graphics processor
  • DSP Deep Learning Processor
  • TPU Tensor Processor
  • an apparatus for developing an artificial intelligence AI model includes a memory and a processor, the memory is used for storing instructions, and the processor is used for reading the instructions stored in the memory, so that the apparatus executes the above-mentioned first A method in an aspect and any possible implementation of the first aspect.
  • a processor including: an input circuit, an output circuit, and a processing circuit.
  • the processing circuit is configured to receive a signal through the input circuit and output a signal through the output circuit, so that any aspect of the first aspect and the method in any possible implementation manner of the first aspect are implemented. accomplish.
  • the above-mentioned processor may be a chip
  • the input circuit may be an input pin
  • the output circuit may be an output pin
  • the processing circuit may be a transistor, a gate circuit, a flip-flop, and various logic circuits.
  • the input circuit and the output circuit may be the same circuit, which is used as the input circuit and the output circuit, respectively, at different times.
  • the embodiments of the present application do not limit the specific implementation manners of the processor and various circuits.
  • a processing apparatus including a processor and a memory.
  • the processor is configured to read the instructions stored in the memory, and can receive signals through the receiver and output signals through the outputter, so as to execute the first aspect and the method in any possible implementation manner of the first aspect.
  • processors there are one or more processors and one or more memories.
  • the memory may be integrated with the processor, or the memory may be provided separately from the processor.
  • the memory can be a non-transitory memory, such as a read only memory (ROM), which can be integrated with the processor on the same chip, or can be separately set in different On the chip, the embodiment of the present application does not limit the type of the memory and the setting manner of the memory and the processor.
  • ROM read only memory
  • the relevant data interaction process such as sending indication information, may be a process of outputting indication information from the processor, and receiving capability information may be a process of receiving input capability information by the processor.
  • the data output by the processing can be output to the exporter, and the input data received by the processor can be from the receiver.
  • a computer-readable storage medium for storing a computer program, the computer program comprising instructions for executing the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.
  • a computer program product comprising instructions that, when run on a computer, cause the computer to execute the method in the above-mentioned first aspect and any possible implementation manner of the above-mentioned first aspect.
  • a chip including at least one processor and an interface; the at least one processor is used to call and run a computer program, so that the chip executes the above-mentioned first aspect and the above-mentioned first aspect method in any possible implementation of .
  • a system including the apparatus for developing an artificial intelligence AI model according to the second aspect or the third aspect.
  • FIG. 1 is a schematic diagram of a system architecture 100 suitable for an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a method 200 for developing an AI model provided by an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method 300 for developing an AI model provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a split result of an AI model provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a split result of an AI model provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of an apparatus 600 for developing an AI model provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a device 700 for developing an AI model provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a system 800 provided by an embodiment of the present application.
  • references in this specification to "one embodiment” or “some embodiments” and the like mean that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically emphasized otherwise.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • At least one means one or more, and “plurality” means two or more.
  • And/or which describes the relationship of the associated objects, indicates that there can be three kinds of relationships, for example, A and/or B, it can indicate that A exists alone, A and B exist at the same time, and B exists alone, where A, B can be singular or plural.
  • the character “/” generally indicates that the associated objects are an “or” relationship.
  • At least one item(s) below” or similar expressions thereof refer to any combination of these items, including any combination of single item(s) or plural items(s).
  • At least one item (a) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c may be single or multiple .
  • CPU Central processing unit
  • the CPU as the computing and control core of a computer system, is the final execution unit for information processing and program execution.
  • the central processing unit mainly includes two parts, namely the controller and the arithmetic unit, which also includes the cache memory and the bus that realizes the connection between them and the data and control.
  • NPU Neural network processing unit
  • NPU often refers to a processor that is specially designed to accelerate the computation of neural networks, such as processors running convolutional neural networks.
  • the NPU can adopt a "data-driven parallel computing" architecture, and is particularly good at processing massive multimedia data such as videos and images.
  • GPU Graphics processing unit
  • GPU also known as display core, visual processor, display chip.
  • a GPU is a microprocessor specialized in graphics and graphics-related operations on personal computers, workstations, game consoles, and some mobile devices (eg, tablets, smartphones, etc.).
  • the main manufacturers of GPUs are NVIDIA and ATI.
  • DSP Digital signal processor
  • DSP a chip is a unique microprocessor, a device that processes a large amount of information with digital signals. Its working principle is to receive an analog signal, convert it into a digital signal of 0 or 1, and then modify, delete, strengthen the digital signal, and interpret the digital data back to analog data or actual environment format in other SoCs.
  • TPU Tensor processing unit
  • TPUs Compared to GPUs, TPUs employ low-precision (8-bit) computation to reduce the number of transistors used per operation. Reducing the precision has little effect on the accuracy of deep learning, but it can greatly reduce power consumption and speed up operations.
  • the TPU uses a systolic array design to optimize matrix multiplication and convolution operations and reduce input/output (I/O) operations.
  • the TPU uses larger on-chip memory to reduce access to dynamic random access memory (DRAM), thereby maximizing performance.
  • DRAM dynamic random access memory
  • TPU and NPU sometimes refer to the same part, that is, the part that runs neural network operations, and both can refer to the same part in artificial intelligence processing.
  • SOC also known as system on chip.
  • SOC is the chip integration at the core of an information system, which integrates key components of the system on one chip; in a broad sense, SOC is a micro-miniature system. If the CPU is the brain, then the SOC includes the brain, the heart, and the brain. , eye and hand system.
  • SOCs are usually custom-made or standard products for a specific purpose.
  • an SOC can be a chip that integrates a series of components such as CPU, GPU, and DSP.
  • the AI model when an AI model is deployed on a device (for example, a terminal device), the AI model is usually deployed on the same processor among multiple processors included in the device. , resulting in a higher running cost.
  • the operators of the key functions included in the AI model are manually optimized and packaged into a library.
  • the AI model when executed on the device, only the key functions are Functional operators are scheduled and executed on multiple processors included in the device.
  • the device includes CPU, GPU, and NPU
  • the AI model includes operator 1, operator 2, and operator 3, and operator 3 is an operator with key functions. Based on this, operator 1 and operator 2 of the AI model can be deployed on the CPU, and operator 3 can be deployed on the NPU.
  • this method has problems of poor flexibility and low deployment efficiency.
  • the present application provides a method and apparatus for developing an artificial intelligence AI model.
  • the AI model When the AI model is deployed on the device according to the results obtained by the development method, the deployment efficiency of the AI model on the device can be improved and the running cost of the AI model on the terminal device can be reduced.
  • the results obtained according to the development method of the AI model provided in this application can be used in different application scenarios, which are not specifically limited.
  • the results can be used in scenarios including terminal devices according to user requirements.
  • the results can be used in scenarios including network devices according to user requirements.
  • an application scenario including a terminal device is taken as an example to introduce a system architecture applicable to the development method of the AI model provided by the embodiment of the present application.
  • FIG. 1 is a schematic diagram of a system architecture 100 suitable for an embodiment of the present application.
  • the system architecture 100 includes: an AI model 110 , a development device 120 , and a terminal device 130 .
  • the terminal device 130 includes M processors, which are respectively a processor 1301, a processor 1302, ..., a processor 130M, where M is a positive integer greater than 1.
  • the above AI model 110 can be understood as a model input by the user.
  • the AI model 110 is input into the development device 120 for processing, and the development device 120 can split the AI model 110 according to user requirements (for example, minimum running power consumption or minimum running time, etc.) to obtain multiple split sub-models,
  • the development device 120 is further configured to output the multiple sub-models after the split of the AI model and a corresponding computer program, the computer program being used to describe the execution sequence, operation schedule and communication relationship of the multiple sub-models after the split of the AI model.
  • the AI model can be deployed on the M processors included in the terminal device 130 according to the output result of the development apparatus 120 according to the user.
  • the AI model is not specifically limited.
  • the AI model can be, but is not limited to, one of the following types: a regression analysis (RA) model, a logistic regression (LR) model, a Bayesian model, a decision tree model, or a deep neural network model .
  • the terminal device 130 may refer to a smart phone, a smart watch, a mobile device, a user terminal, a terminal device (for example, a terminal server), a wireless communication device, a handheld device with a wireless communication function, a vehicle-mounted device, a wearable device A device (for example, a smart bracelet), etc., which is not limited in this embodiment of the present application.
  • the types of the M processors included in the terminal device 130 are not specifically limited.
  • the types of the M processors may include at least two of the following: CPU, NPU, GPU, DSP, deep learning processing unit (DPU), TPU, and the like.
  • M the terminal device 130 includes only 2 processors (processor 1301 and processor 1302)
  • the processor 1301 may be a CPU
  • the processor 1302 may be an NPU.
  • the type of the terminal device 130 and the types of the M processors included in the terminal device 130 may be determined according to user requirements.
  • the deployment of the M processors included in the terminal device 130 in the terminal device 130 is not specifically limited.
  • the M processors included in the terminal device 130 may be deployed on one or more hardware devices (eg, SOCs) included in the terminal device 130 .
  • the terminal device 130 includes a CPU, a GPU and a DSP, wherein the CPU, the GPU and the DSP are all deployed on the SOC in the terminal device 130 .
  • the terminal device 130 includes a CPU, a GPU and an NPU, wherein the CPU and the GPU are deployed on one hardware device of the terminal device 130 , and the NPU is deployed on another hardware device of the terminal device 130 .
  • the deployment of the development apparatus 120 in the system architecture 100 is not specifically limited.
  • the development apparatus 120 may be an apparatus on a third-party platform independent of the terminal device 130 .
  • the development apparatus 120 may also be an apparatus included in the terminal device 130 .
  • FIG. 1 is for illustration only, and does not constitute any limitation to the system architecture applicable to the embodiments of the present application.
  • the system architecture 100 may also include a greater number of development devices 120 .
  • the above-mentioned terminal device 130 may also be replaced by a network device.
  • the above-mentioned terminal device 130 may also be understood as a device including the terminal device 130 .
  • FIG. 2 is a schematic flowchart of a method 200 for developing an AI model provided by an embodiment of the present application.
  • the method 300 includes steps 210 to 230 , and the steps 210 to 230 will be described in detail below.
  • the execution body of the method 200 may be the development device 120 described above.
  • Step 210 split the AI model to obtain multiple split results, each of the multiple split results includes multiple first sub-models, and each of the multiple first sub-models
  • the sub-model corresponds to at least one processor among the M processors, where M is a positive integer greater than 1, and each first sub-model can run on the corresponding at least one processor, so that each split result has a plurality of first The running cost of the submodel.
  • splitting the AI model may include: splitting the AI model according to input information, where the input information includes at least one of the following: the execution order of the L first operators in the AI model, or the L Attribute information of each first operator in the first operators, L is a positive integer greater than 2.
  • the attribute information corresponding to different first operators is not the same.
  • the attribute information of each first operator included in the AI model is not specifically limited in the embodiments of the present application.
  • the attribute information of the first operator includes the dimension of the input data and the dimension of the output data.
  • the structure of the first sub-model is not specifically limited.
  • the above AI model may include L first operators, where L is a positive integer greater than 2, and each first sub-model in the plurality of first sub-models includes part of the first operators in the L first operators. That is, the first submodel may include one or more first operators among the L first operators. It can be understood that when the first sub-model includes multiple first operators, the multiple first operators are adjacent in execution order.
  • each first sub-model only includes one first operator.
  • the AI includes 4 first operators, and the AI model is split to obtain 3 first sub-models, then there are 3 first sub-models that only include one first operator, and 1 first sub-model A sub-model includes two first operators, and the two first operators are adjacent in execution order, and the processors corresponding to the two first operators may be the same or different.
  • the above-mentioned M processors can be understood as processors included in a device (for example, a terminal device or a network device).
  • a device for example, a terminal device or a network device.
  • the M processors may be included in an SOC included in the terminal device.
  • the above-mentioned running overhead includes: the overhead of running each first sub-model, the communication overhead between two adjacent first sub-models in the execution order, and scheduling each of the first sub-models to the corresponding at least one processor scheduling overhead. It can be understood that the above overhead may be running time or running power consumption, etc., which is not specifically limited in this embodiment of the present application.
  • the above-mentioned M processors are the processors included in the terminal device.
  • two first sub-models are obtained, which are the first sub-model 1 and the first sub-model 2 respectively, and each first sub-model is Only one first operator is included in the submodel.
  • the first sub-model 1 corresponds to the CPU in the terminal device
  • the first sub-model 2 corresponds to the DSP in the terminal device
  • the first sub-model 1 is executed first and then the first sub-model 2 is executed.
  • the running overhead when the split AI model is deployed and executed on the terminal device includes: the overhead of running the first sub-model 1 on the CPU, the overhead of running the first sub-model 2 on the DSP, and the first After sub-model 1 is executed on the CPU, it is input to the communication overhead of the DSP where the first sub-model 2 is located, the scheduling overhead of scheduling the first sub-model 1 to the CPU, and the scheduling overhead of scheduling the first sub-model 2 to the DSP.
  • the M processors include at least two of the following processors: a central processing unit (CPU), a neural network processor (NPU), a graphics processor (GPU), a digital signal processor (DSP), a deep learning processor (DPU), or a Quantity processor TPU.
  • a central processing unit CPU
  • NPU neural network processor
  • GPU graphics processor
  • DSP digital signal processor
  • DPU deep learning processor
  • TPU Quantity processor TPU.
  • the following steps may be further included: acquiring the above-mentioned AI model input by the user; and analyzing the above-mentioned AI model to obtain the above-mentioned input information.
  • Step 220 Determine a first split result among the multiple split results, where the first running cost of the first split result is less than the second running cost of one or more second split results among the multiple split results.
  • the first running cost may be smaller than the maximum running cost of the various split results, and greater than the minimum running cost of the various split results.
  • the first running cost is the smallest.
  • the first split result includes N first sub-models, where N is a positive integer greater than 2.
  • the following step may be further included: merging at least two first submodels in the N first submodels to obtain a third splitting result, where the at least two first submodels are in Adjacent in execution order, the third split result includes X second sub-models, X is a positive integer greater than 1 and less than N, and each second sub-model in the X second sub-models corresponds to M processors one or more processors, each second sub-model can run on the corresponding one or more processors so that the third split result has a third running cost of running X second sub-models, the third running cost less than the first running cost.
  • the above-mentioned M processors are the processors included in the terminal device.
  • three first sub-models are obtained, which are respectively recorded as the first sub-model 1, the first sub-model 2 and the first sub-model 3 .
  • the first sub-model 1 corresponds to the CPU in the terminal device
  • the first sub-model 2 corresponds to the DSP in the terminal device
  • the first sub-model 3 corresponds to the GPU in the terminal device.
  • the execution sequence of the first sub-model is: first sub-model 1 , first sub-model 2 and first sub-model 3 .
  • the first sub-model 1 and the first sub-model 2 can be combined to obtain a second sub-model 1 . That is to say, after the AI model is split, the second sub-model 1 and the first sub-model 3 are included.
  • Step 230 outputting the first split result.
  • the execution subject of the above step 230 may be the heterogeneous scheduling description module 123 in the development apparatus 120 .
  • a computer program corresponding to the first split result may also be output.
  • the computer program is used to describe the running sequence and communication process of the N first sub-models included in the first split result on the M processors.
  • the third splitting is performed.
  • the running cost of the result is less than the running cost of the first split result, and when the running cost of the third split result can meet user requirements, the third split result can also be output.
  • the first split result is determined as the result of splitting the AI model by comparing various running costs of the various split results corresponding to the AI model.
  • the above method can fully consider the running overhead of each first sub-model in the plurality of first sub-models running on M processors, thereby splitting the AI model.
  • the deployment efficiency of the AI model on the device can be effectively improved and the running overhead of the AI model executing on the device can be reduced.
  • a computer program can also be output, and the computer program can describe the running sequence and communication process of the N first sub-models obtained by splitting the AI model according to the first split result on the M processors . Based on this, the user can use the computer program flexibly. For example, integrating the computer program into other applications.
  • each split result includes N first sub-models, and N is a positive integer greater than 2.
  • the following describes the development method of the AI model provided by the embodiment of the present application by taking the example of splitting the AI model to obtain two split results.
  • FIG. 3 is a schematic flowchart of a method 300 for developing an AI model provided by an embodiment of the present application.
  • the method 300 includes steps 310 to 392 , and the steps 310 to 392 will be described in detail below.
  • the execution body of the method 300 may be the development device 120 described above.
  • the following step may also be included: determining the type of processor included in the terminal device deployed by the AI model.
  • the terminal device on which the AI model is deployed according to user requirements includes three processors, which are a CPU, an NPU, and a GPU.
  • Step 310 input the AI model.
  • inputting the AI model can be understood as inputting the AI model into the development device 120 .
  • Step 320 Analyze the AI model to obtain L first operators included in the AI model and a processor corresponding to each first operator, where L is a positive integer greater than 2.
  • the execution body of the above step 320 is the model analysis module 121 in the development device 120 .
  • the AI model includes 5 first operators (ie, L equals 5).
  • the five first operators are respectively recorded as first operator 1, first operator 2, first operator 3, first operator 4 and first operator 5.
  • the first operator 1 corresponds to the CPU in the terminal device. That is to say, the running overhead of the first operator 1 when executed on the CPU is smaller than the running overhead of the first operator 1 when executed on the NPU or GPU.
  • the first operator 2 corresponds to the NPU
  • the first operator 3 corresponds to the NPU
  • the first operator 4 corresponds to the GPU
  • the first operator 5 corresponds to the NPU.
  • Step 330 according to the L first operators and the processor corresponding to each first operator, obtain the split result 1 and the split result 2, as well as the running cost of the split result 1 and the running cost of the split result 2.
  • each split result includes N first sub-models, and N is a positive integer greater than 2.
  • split result 1 shows the first split result, which is denoted as split result 1.
  • split result 2 shows the second split result, which is denoted as split result 2.
  • the split result 1 includes four first sub-models, which are respectively denoted as the first sub-model 1, the first sub-model 2, the first sub-model 3 and the first sub-model A submodel 4.
  • the second sub-model 1 corresponds to the CPU, and the first sub-model 1 includes the first operator 1 .
  • the first sub-model 2 corresponds to the NPU, and the first sub-model 2 includes a first operator 2 and a first operator 3 .
  • the first sub-model 3 corresponds to the GPU, and the first sub-model 3 includes the first operator 4 .
  • the first sub-model 4 corresponds to the NPU, and the first sub-model 4 includes the first operator 5 . That is to say, the split result 1 indicates that the AI model is split into the above-mentioned 4 first sub-models.
  • the running cost 1 of the split result 1 includes: the cost of executing each first operator in each first sub-model on the corresponding processing, the scheduling of the first sub-model to the corresponding processor Overhead, the communication overhead of the first sub-model 1 input to the NPU where the first sub-model 2 is located after it is executed on the CPU, the communication overhead of the first sub-model 2 after being executed on the NPU and input to the GPU where the first sub-model 3 is located Overhead, the communication overhead input to the NPU where the first sub-model 4 is located after the first sub-model 3 is executed on the CPU.
  • the split result 2 includes three first sub-models, which are respectively denoted as the first sub-model 1 , the first sub-model 2 and the first sub-model 3 .
  • the second sub-model 1 corresponds to the CPU, and the first sub-model 1 includes the first operator 1 .
  • the first sub-model 2 corresponds to the NPU, and the first sub-model 2 includes a first operator 2 and a first operator 3 .
  • the first sub-model 3 corresponds to the GPU, and the first sub-model 3 includes a first operator 4 and a first operator 5 . That is to say, the split result 2 indicates that the AI model is split into the above-mentioned three first sub-models.
  • the running cost 2 of the split result 2 includes: the cost of executing each first operator in each first sub-model on the corresponding processing, the scheduling of the first sub-model to the corresponding processor Overhead, the communication overhead of the first sub-model 1 input to the NPU where the first sub-model 2 is located after it is executed on the CPU, the communication overhead of the first sub-model 2 after being executed on the NPU and input to the GPU where the first sub-model 3 is located overhead.
  • Step 340 by comparing the running cost 1 and the running cost 2, determine the split result with the smallest running cost as the first split result.
  • the running overhead 1 and the running overhead 2 it can be determined that the running overhead 1 is the smallest. That is, it can be determined that the split result 1 is the first split result.
  • Step 350 Determine whether at least two first sub-models among the N first sub-models included in the first split result need to be merged. Wherein, at least two first sub-models are adjacent in execution order.
  • steps 360 to 380 are performed after step 350 . That is, after steps 310 to 350, steps 360 to 380 may also be performed.
  • step 391 and step 392 are executed after step 350 . That is to say, the running cost of the first split result of the AI model determined according to the above steps 310 to 340 can meet the needs of the user.
  • the first split result determined in step 340 may be determined as the result of splitting the AI model.
  • whether it is necessary to merge at least two first sub-models that are adjacent in execution order among the N first sub-models included in the first splitting result can be determined according to user requirements or actual application conditions, which is implemented in this application. This example is not limited.
  • the split result 1 is the first split result, that is, the split result described in (b1) in FIG. 4 .
  • steps 360 to 380 are introduced by taking the split result 1 shown in (b1) in FIG. 4 as an example.
  • Step 360 Merge at least two adjacent first sub-models in the execution order of the N first sub-models to obtain split result 3 and split result 4, and the running cost of split result 3 and Split result 4 has a running cost of 4.
  • the split result 3 includes a first sub-model and a second sub-model, which are denoted as the first sub-model 1 and the second sub-model 1 respectively.
  • the first sub-model 1 corresponds to the CPU, and the first sub-model 1 includes the first operator 1 .
  • the second sub-model 1 corresponds to the NPU, and the second sub-model 1 includes a first sub-model 2 , a first sub-model 3 and a first sub-model 4 . That is to say, the split result 3 is a result obtained by combining the first sub-model 2 , the first sub-model 3 and the first sub-model 4 in the first split result.
  • the running cost 3 of the split result 3 includes: the cost when the first sub-model 1 is executed on the corresponding process, and the cost when each first sub-model in the second sub-model 1 is executed on the corresponding process , schedule the scheduling overhead of the first sub-model 1 to the corresponding processor, schedule the scheduling overhead of each first sub-model in the second sub-model 1 to the corresponding processor, and the first sub-model 1 executes on the CPU Then input the communication overhead on the NPU where the second sub-model 1 is located.
  • splitting result 4 includes two first sub-models and one second sub-model, denoted as first sub-model 1, first sub-model 2 and Second submodel 1.
  • the first sub-model 1 corresponds to the CPU, and the first sub-model 1 includes the first operator 1 .
  • the first sub-model 2 corresponds to the NPU, and the first sub-model 2 includes a first operator 2 and a first operator 3 .
  • the second sub-model 1 corresponds to the GPU, and the second sub-model 1 includes a first sub-model 3 and a first sub-model 4 . That is to say, the split result 3 is a result obtained by combining the first sub-model 3 and the first sub-model 4 in the first split result.
  • the running cost 4 of the split result 4 includes: the cost when each first sub-model is executed on the corresponding process, and the cost when each first sub-model in the second sub-model 1 is executed on the corresponding process Overhead, the scheduling overhead of scheduling each first sub-model to the corresponding processor, the scheduling overhead of scheduling each first sub-model in the second sub-model 1 to the corresponding processor, the first sub-model 1 on the CPU
  • Step 370 Determine the third split result by comparing the running cost 1, the running cost 3, and the running cost 4, and determine the third split result as the result of splitting the AI model.
  • the running overhead 1, the running overhead 3, and the running overhead 4 it can be determined that the running overhead 3 ⁇ running overhead 1 ⁇ running overhead 4. Based on this, it can be determined that the split result 3 is the third split result.
  • Step 380 output the third split result and the corresponding computer program.
  • the third split result is split result 3, that is, the split result shown in (c1) in FIG. 5 .
  • Step 391 Determine the first split result as the result of splitting the AI model.
  • the first split result is split result 1, that is, the split result shown in (b1) in FIG. 4 .
  • Step 392 output the first split result and the corresponding computer program.
  • the execution subject of the above steps 330 to 370 and step 391 may be the model heterogeneous decomposition module 122 in the development device 120 .
  • the model heterogeneous decomposition module 122 can use an existing algorithm (eg, genetic algorithm or greedy algorithm) to split or merge the models.
  • the execution subject of the above steps 380 and 392 may be the heterogeneous scheduling description module 123 in the development device 120 .
  • FIGS. 3 to 5 are for illustration only, and do not constitute any limitation to the development method of the AI model provided by the embodiments of the present application.
  • the results obtained according to the methods of the embodiments of the present application may be used for, but not limited to, terminal devices.
  • the model of (a) shown in FIG. 4 may be split into a greater number (eg, 4 or 5) of first sub-models.
  • the model of (b1) shown in FIG. 5 can be combined into a second sub-model, in which case the running cost of the second sub-model when executed on one processor is less than the running cost 1.
  • FIG. 6 is a schematic structural diagram of an apparatus 600 for developing an AI model provided by an embodiment of the present application.
  • the development apparatus 600 may be the development apparatus 120 described in FIG. 1 above.
  • the development apparatus 600 includes: a splitting unit 601, configured to split the AI model to obtain multiple split results, each of the multiple split results includes multiple split results
  • the first sub-model, each first sub-model in the plurality of first sub-models corresponds to at least one processor in the M processors, where M is a positive integer greater than 1, and each of the first sub-models can run in
  • the corresponding at least one processor makes each type of split result have a running cost of running the multiple first sub-models;
  • the determining unit 602 is configured to determine a first split result among the multiple split results, the The first running cost of the first split result is less than the second running cost of one or more second split results in the multiple split results;
  • the output unit 604 is configured to output the first split result.
  • the first running overhead is the smallest.
  • the first split result includes N first sub-models, where N is a positive integer greater than 2, and the development apparatus 600 further includes a merging unit 603,
  • the merging unit 603 is configured to merge at least two first sub-models among the N first sub-models to obtain a third split result, where the at least two first sub-models are adjacent in execution order,
  • the third split result includes X second sub-models, where X is a positive integer greater than 1 and less than N, and each of the X second sub-models corresponds to one or more of the M processors processors, each second sub-model can run on the corresponding one or more processors so that the third split result has a third running cost of running the X second sub-models, the third running The overhead is less than the first running overhead.
  • the AI model includes L first operators, L is a positive integer greater than 2, and each first submodel in the plurality of first submodels includes the L first operators. Part of the first operator in the operator.
  • the splitting unit 601 is specifically used for:
  • the input information includes at least one of the following: the execution order of the L first operators in the AI model, or each first operator in the L first operators
  • the attribute information of , L is a positive integer greater than 2.
  • the running overhead includes: running overhead of each first sub-model, communication overhead between two adjacent first sub-models in execution order, and scheduling each of the first sub-models The scheduling overhead of the first sub-model to the corresponding at least one processor.
  • running cost may be determined by means of table look-up or formula calculation.
  • the M processors include at least two of the following processors: central processing unit (CPU), neural network processor (NPU), graphics processor (GPU), digital signal processor (DSP), and deep learning processing. processor DPU or Tensor Processor TPU.
  • CPU central processing unit
  • NPU neural network processor
  • GPU graphics processor
  • DSP digital signal processor
  • deep learning processing processor DPU or Tensor Processor TPU.
  • an input unit is further included before the splitting unit 601, and the input unit is used to acquire the AI model.
  • the specific application form of the development apparatus 600 is not specifically limited.
  • the development apparatus 600 may be opened to users in the form of a software development kit (SDK).
  • SDK software development kit
  • the user can encapsulate the above-mentioned computer program and multiple first sub-models into an android application package (android application package, APK) for direct use through simple operations, or modify the above-mentioned computer program by himself and integrate it into other applications.
  • android application package android application package, APK
  • FIG. 6 is for illustration only, and does not constitute any limitation to the development apparatus 600 provided by the embodiment of the present application.
  • the development apparatus 600 may further include a storage module, and the storage module may be used to store the processing result of the determination unit and the corresponding computer program and the like.
  • the development device of the AI model should include a processor.
  • the device for developing the AI model may further include a memory.
  • an AI model development device includes a processor and a memory.
  • FIG. 7 is a schematic structural diagram of a device 700 for developing an AI model provided by an embodiment of the present application.
  • the development device 700 includes: a processor 701 and a memory 702 .
  • the processor 701 and the memory 702 communicate with each other through an internal connection path to transmit control and/or data signals, the memory 702 is used to store computer programs, and the processor 701 is used to call and run the computer from the memory 702 A program to perform the method 200 and/or the method 300 described above.
  • the functions of the processor 701 correspond to the specific functions of the splitting unit 601 , the determining unit 602 , and the merging unit 603 shown in FIG. 6 , and details are not repeated here.
  • the development device 700 may further include a receiver and/or an exporter.
  • the receiver may be used to receive the AI model, and the function of the output device corresponds to the specific function of the output unit 604 in FIG. 6 , which will not be repeated here.
  • FIG. 8 is a schematic structural diagram of a system 800 provided by an embodiment of the present application. As shown in FIG. 8 , the system 800 includes: an AI model development apparatus 600 or an AI model development device 700 .
  • the embodiment of the present application provides a computer program product, when the computer program product runs on the development apparatus 600 or the development apparatus 700, the development apparatus 600 or the development apparatus 700 executes the method 200 and/or the method in the above method embodiments 300.
  • the disclosed systems, devices and methods may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the unit is only a logical function division.
  • there may be other division methods for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may also be electrical, mechanical or other forms of connection.
  • the unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solutions of the embodiments of the present application.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application are essentially or part of contributions to the prior art, or all or part of the technical solutions can be embodied in the form of software products, and the computer software products are stored in a storage medium , including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer program instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the procedures or functions according to the embodiments of the present application are generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer program instructions may be transmitted from a website site, computer, server or data center via Wired or wireless transmission to another website site, computer, server or data center.
  • the computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes one or more available media integrated.
  • the available media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, digital video discs (DVDs), or semiconductor media (eg, solid state drives), and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Stored Programmes (AREA)

Abstract

一种人工智能AI模型的开发方法和装置。该方法包括:对AI模型进行拆分,以得到多种拆分结果,多种拆分结果中的每种拆分结果包括多个第一子模型,多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,每个第一子模型能够运行于对应的至少一个处理器以使得每种拆分结果具有运行多个第一子模型的运行开销(210);在多种拆分结果中确定第一拆分结果,第一拆分结果的第一运行开销小于多种拆分结果中的一个或多个第二拆分结果的第二运行开销(220);输出第一拆分结果(230)。当根据第一拆分结果将AI模型部署在包括M个处理器的设备上时,能够提高AI模型在设备上的部署效率和降低AI模型在设备上的运行开销。

Description

人工智能AI模型的开发方法和装置 技术领域
本申请涉及人工智能领域,并且更具体地,涉及一种人工智能AI模型的开发方法和装置。
背景技术
为了降低运行在设备上的人工智能(artificial intelligence,AI)模型的运行开销(例如,运行时间或运行功耗等),可以通过设计AI模型在设备(例如,终端设备)上的部署策略,以降低上述运行开销。
但基于现有的AI模型部署方法,将AI模型部署在设备上运行时存在以下问题:部署效率低,运行开销大。基于上述问题,导致现有的AI模型部署方法无法满足用户需求。
因此,亟需一种AI模型的开发方法,当根据该开发方法得到的结果将AI模型部署在设备上时,能够提高AI模型在设备上的部署效率和降低AI模型在设备上的运行开销。
发明内容
本申请提供了一种人工智能AI模型的开发方法和装置。当根据该开发方法得到的结果将AI模型部署在设备上时,能够提高AI模型在设备上的部署效率和降低AI模型在设备上的运行开销。
第一方面,提供了一种人工智能AI模型的开发方法。该方法包括:对AI模型进行拆分,以得到多种拆分结果,该多种拆分结果中的每种拆分结果包括多个第一子模型,该多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,该每个第一子模型能够运行于该对应的至少一个处理器以使得该每种拆分结果具有运行该多个第一子模型的运行开销;在该多种拆分结果中确定第一拆分结果,该第一拆分结果的第一运行开销小于该多种拆分结果中的一个或多个第二拆分结果的第二运行开销;输出该第一拆分结果。
可选的,在一些实现方式中,设备是终端设备,上述M个处理器可以理解为是该终端设备包括的处理器。
可选的,在一些实现方式中,还可以输出第一拆分结果对应的计算机程序。其中,该计算机程序用于描述第一拆分结果包括的N个第一子模型在M个处理器上的运行的时序和通信过程。
在上述技术方案中,通过比较AI模型对应的多种拆分结果的多种运行开销,将第一拆分结果确定为对该AI模型进行拆分后的结果。上述方法在不改变AI模型结构和参数的情况下,能够充分考虑多个第一子模型中的每个第一子模型运行在M个处理器上的运行开销,从而对AI模型进行拆分。当根据第一拆分结果将AI模型部署在包括上述M个处理器的设备上时,能够有效提高AI模型在该设备上的部署效率和降低AI模型在该设备上 执行时的运行开销。
结合第一方面,在第一种可能的实现方式中,在该多种拆分结果的运行开销中,该第一运行开销最小。
结合第一方面,在第一种可能的实现方式中,该第一拆分结果包括N个第一子模型,N为大于2的正整数;该方法还包括:将该N个第一子模型中的至少两个第一子模型进行合并,以得到第三拆分结果,该至少两个第一子模型在执行顺序上相邻,该第三拆分结果包括X个第二子模型,X是大于1且小于N的正整数,该X个第二子模型中的每个第二子模型对应M个处理器中的一个或多个处理器,该每个第二子模型能够运行于该对应的一个或多个处理器以使得该第三拆分结果具有运行该X个第二子模型的第三运行开销,该第三运行开销小于该第一运行开销。
在上述技术方案中,通过将N个第一子模型中的至少两个第一子模型进行合并,使得对AI模型划分得到的第一子模型的数量较少,能够进一步提高AI模型在设备上的部署效率和降低AI模型在设备上执行时的运行开销。
结合第一方面,在第一种可能的实现方式中,该AI模型包括L个第一算子(operator),L为大于2的正整数,该多个第一子模型中的每个第一子模型包括该L个第一算子中的部分第一算子。
在上述技术方案中,每个第一子模型中可以包括更多的第一算子。当多个第一算子对应相同的处理器时,可以通过将该多个第一算子合并为一个第一子模型,从而能够减少多个第一算子之间的通信开销。
结合第一方面,在第一种可能的实现方式中,该对该AI模型进行拆分,包括:根据输入信息对该AI模型进行拆分,该输入信息包括如下至少一个:该AI模型中的L个第一算子的执行顺序、或该L个第一算子中的每个第一算子的属性信息,L为大于2的正整数。
结合第一方面,在第一种可能的实现方式中,该运行开销包括:运行该每个第一子模型的开销,在执行顺序上相邻的两个第一子模型之间的通信开销,以及调度该每个第一子模型到对应的至少一个处理器上的调度开销。
在上述技术方案中,AI模型的拆分结果的运行开销不仅包括每个第一子模型的在处理器上的执行时的开销,还包括AI模型中的相邻的两个第一子模型之间的通信开销和调度开销,使得根据运行开销确定的AI模型的拆分结果更加准确。
结合第一方面,在第一种可能的实现方式中,该M个处理器包括以下处理器中的至少两种:中央处理器CPU、神经网络处理器NPU、图形处理器GPU、数字信号处理器DSP、深度学习处理器DPU或张量处理器TPU。
第二方面,提供了一种人工智能AI模型部署的装置。该装置包括:拆分单元,用于对AI模型进行拆分,以得到多种拆分结果,该多种拆分结果中的每种拆分结果包括多个第一子模型,该多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,该每个第一子模型能够运行于该对应的至少一个处理器以使得该每种拆分结果具有运行该多个第一子模型的运行开销;确定单元,用于在该多种拆分结果中确定第一拆分结果,该第一拆分结果的第一运行开销小于该多种拆分结果中的一个或多个第二拆分结果的第二运行开销;输出单元,用于输出该第一拆分结果。
结合第二方面,在第二种可能的实现方式中,在该多种拆分结果的运行开销中,该第 一运行开销最小。
结合第二方面,在第二种可能的实现方式中,该第一拆分结果包括N个第一子模型,N为大于2的正整数,该装置还包括合并单元,该合并单元,用于将该N个第一子模型中的至少两个第一子模型进行合并,以得到第三拆分结果,该至少两个第一子模型在执行顺序上相邻,该第三拆分结果包括X个第二子模型,X是大于1且小于N的正整数,该X个第二子模型中的每个第二子模型对应M个处理器中的一个或多个处理器,该每个第二子模型能够运行于该对应的一个或多个处理器以使得该第三拆分结果具有运行该X个第二子模型的第三运行开销,该第三运行开销小于该第一运行开销。
结合第二方面,在第二种可能的实现方式中,该AI模型包括L个第一算子,L为大于2的正整数,该多个第一子模型中的每个第一子模型包括该L个第一算子中的部分第一算子。
结合第二方面,在第二种可能的实现方式中,该拆分单元具体用于:根据输入信息对该AI模型进行拆分,该输入信息包括如下至少一个:该AI模型中的L个第一算子的执行顺序、或该L个第一算子中的每个第一算子的属性信息,L为大于2的正整数。
结合第二方面,在第二种可能的实现方式中,该运行开销包括:运行该每个第一子模型的开销,在执行顺序上相邻的两个第一子模型之间的通信开销,以及调度该每个第一子模型到对应的至少一个处理器上的调度开销。
结合第二方面,在第二种可能的实现方式中,该M个处理器包括以下处理器中的至少两种:中央处理器CPU、神经网络处理器NPU、图形处理器GPU、数字信号处理器DSP、深度学习处理器DPU或张量处理器TPU。
第三方面,提供了一种人工智能AI模型的开发装置,该装置包括存储器和处理器,该存储器用于存储指令,该处理器用于读取该存储器中存储的指令,使得该装置执行上述第一方面及第一方面的任意可能的实现方式中的方法。
第四方面,提供了一种处理器,包括:输入电路、输出电路和处理电路。所述处理电路用于通过所述输入电路接收信号,并通过所述输出电路输出信号,使得所述第一方面中的任一方面,以及第一方面中任一种可能实现方式中的方法被实现。
在具体实现过程中,上述处理器可以为芯片,输入电路可以为输入管脚,输出电路可以为输出管脚,处理电路可以为晶体管、门电路、触发器和各种逻辑电路等。输入电路和输出电路可以是同一电路,该电路在不同的时刻分别用作输入电路和输出电路。本申请实施例对处理器及各种电路的具体实现方式不做限定。
第五方面,提供了一种处理装置,包括处理器和存储器。该处理器用于读取存储器中存储的指令,并可通过接收器接收信号,通过输出器输出信号,以执行第一方面以及第一方面任一种可能实现方式中的方法。
可选地,所述处理器为一个或多个,所述存储器为一个或多个。
可选地,所述存储器可以与所述处理器集成在一起,或者所述存储器与处理器分离设置。
在具体实现过程中,存储器可以为非瞬时性(non-transitory)存储器,例如只读存储器(read only memory,ROM),其可以与处理器集成在同一块芯片上,也可以分别设置在不同的芯片上,本申请实施例对存储器的类型以及存储器与处理器的设置方式不做限 定。
应理解,相关的数据交互过程例如发送指示信息可以为从处理器输出指示信息的过程,接收能力信息可以为处理器接收输入能力信息的过程。具体地,处理输出的数据可以输出给输出器,处理器接收的输入数据可以来自接收器。
第六方面,提供了一种计算机可读存储介质,用于存储计算机程序,该计算机程序包括用于执行上述第一方面及上述第一方面的任意可能的实现方式中的方法的指令。
第七方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面及上述第一方面的任意可能的实现方式中的方法。
第八方面,提供了一种芯片,包括至少一个处理器和接口;所述至少一个所述处理器,用于调用并运行计算机程序,以使所述芯片执行上述第一方面及上述第一方面的任意可能的实现方式中的方法。
第九方面,提供了一种系统,包括前述第二方面或第三方面所述的人工智能AI模型的开发装置。
附图说明
图1是适用于本申请实施例的一个系统架构100的示意图。
图2是本申请实施例提供的AI模型的开发方法200的示意性流程图。
图3是本申请实施例提供的AI模型的开发方法300的示意性流程图。
图4是本申请实施例提供的一种AI模型的拆分结果的示意图。
图5是本申请实施例提供的一种AI模型的拆分结果的示意图。
图6是本申请实施例提供的一种AI模型的开发装置600的示意性结构图。
图7是本申请实施例提供的一种AI模型的开发设备700的示意性结构图。
图8是本申请实施例提供的一种系统800的示意性结构图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请的实施方式部分使用的术语仅用于对本申请的具体实施例进行解释,而非旨在限定本申请。
本申请中术语“第一”“第二”“第三”等字样用于对作用和功能基本相同的相同项或相似项进行区分,应理解,“第一”、“第二”和“第三”之间不具有逻辑或时序上的依赖关系,也不对数量和执行顺序进行限定。
本申请将围绕可包括多个设备、组件、模块等的系统来呈现各个方面、实施例或特征。应当理解和明白的是,各个系统可以包括另外的设备、组件、模块等,并且/或者可以并不包括结合附图讨论的所有设备、组件、模块等。此外,还可以使用这些方案的组合。
另外,在本申请实施例中,“示例的”、“例如”等词用于表示作例子、例证或说明。本申请中被描述为“示例”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用示例的一词旨在以具体方式呈现概念。
本申请实施例中,“相应的(corresponding,relevant)”和“对应的(corresponding)”有时可以混用,应当指出的是,在不强调其区别时,其所要表达的含义是一致的。
本申请实施例中,有时候下标如W 1可能会笔误为非下标的形式如W1,在不强调其区别时,其所要表达的含义是一致的。
在本说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
为便于理解,首先介绍本申请实施例中涉及到的相关术语。
1、中央处理器(central processing unit,CPU)
CPU,作为计算机系统的运算和控制核心,是信息处理、程序运行的最终执行单元。中央处理器主要包括两个部分,即控制器、运算器,其中还包括高速缓冲存储器及实现它们之间联系的数据、控制的总线。
2、神经网络处理器(neural network processing unit,NPU)
NPU,常是指专门针对神经网络的计算进行加速的处理器,例如运行卷积神经网络的处理器。可选地,NPU可以采用“数据驱动并行计算”的架构,特别擅长处理视频、图像类的海量多媒体数据。
3、图形处理器(graphics processing unit,GPU)
GPU,又可以称为显示核心、视觉处理器、显示芯片。GPU是一种专门在个人电脑、工作站、游戏机和一些移动设备(例如,平板电脑、智能手机等)上做图像和图形相关运算工作的微处理器。GPU的生产商主要有NVIDIA和ATI。
4、数字信号处理器(digital signal processor,DSP)
DSP,芯片是一种独特的微处理器,是以数字信号来处理大量信息的器件。其工作原理是接收模拟信号,转换为0或1的数字信号,再对数字信号进行修改、删除、强化,并在其他系统芯片中把数字数据解译回模拟数据或实际环境格式。
5、张量处理器(tensor processing unit,TPU)
与GPU相比,TPU采用低精度(8位)计算,以降低每步操作使用的晶体管数量。降低精度对于深度学习的准确度影响很小,但却可以大幅降低功耗、加快运算速度。同时,TPU使用了脉动阵列的设计,用来优化矩阵乘法与卷积运算,减少输入/输出(input/output,I/O)操作。此外,TPU还采用了更大的片上内存,以此减少对动态随机存取存储器(dynamic random access memory,DRAM)的访问,从而更大程度地提升性能。
TPU与NPU有时候指代一个部件,即运行神经网络运算的部件,及二者可以在人工智能处理中指代同样的部件。
6、系统级芯片(system on chip,SOC)
SOC,又称为片上系统。从狭义角度讲,SOC是信息系统核心的芯片集成,是将系统关键部件集成在一块芯片上;从广义角度讲,SOC是一个微小型系统,如果说CPU是大脑,那么SOC就是包括大脑、心脏、眼睛和手的系统。SOC通常是客户定制的,或是面向特定用途的标准产品。例如,SOC可以是一个集成了CPU、GPU和DSP等一系列部件的芯片。
下面,介绍本申请实施例的相关技术:现有技术中,将AI模型部署在设备(例如,终端设备)上时,通常将AI模型部署在设备包括的多个处理器中的同一个处理器上,导致运行开销较大。为了提高AI模型在设备上执行时的运行效率,在一种技术中,通过将AI模型包括的关键功能的算子进行手工优化并封装成库,在设备上执行AI模型时,仅将该关键功能的算子在设备包括的多个处理器上进行调度和执行。例如,设备包括CPU、GPU和NPU,AI模型包括算子1、算子2和算子3,且算子3为关键功能的算子。基于此,可以将AI模型的算子1和算子2部署在CPU上,将算子3部署在NPU上算子,不过该方法存在灵活性差和部署效率低的问题。
本申请提供了一种人工智能AI模型的开发方法和装置。当根据该开发方法得到的结果将AI模型部署在设备上时,能够提高AI模型在设备上的部署效率和降低AI模型在终端设备上的运行开销。应理解,根据本申请提供的AI模型的开发方法得到的结果可以用于不同的应用场景,对此不作具体限定。例如,根据用户需求可以将该结果用于包括终端设备的场景中。例如,根据用户需求可以将该结果用于包括网络设备的场景中。
现在,结合图1以应用场景中包括终端设备为例,介绍本申请实施例提供的AI模型的开发方法适用的系统架构。
图1是适用于本申请实施例的一个系统架构100的示意图。如图1所示,该系统架构100包括:AI模型110,开发装置120,以及终端设备130。其中,终端设备130包括M个处理器,分别为处理器1301,处理器1302,…,处理器130M,M为大于1的正整数。
上述AI模型110可以理解为用户输入的模型。将该AI模型110输入至开发装置120中进行处理,开发装置120可以根据用户需求(例如,运行功耗最低或运行时间最小等)对AI模型110进行拆分得到拆分后的多个子模型,开发装置120还用于输出AI模型拆分后的多个子模型和相应的计算机程序,该计算机程序用于描述AI模型拆分后的多个子模型的执行顺序、运行调度和通信关系。进一步,可以根据用户可以根据开发装置120输出结果将AI模型部署在终端设备130包括的M个处理器上。
在本申请实施例中,对AI模型不作具体限定。例如,该AI模型可以但不限于是以下类型中的一种:回归分析(regression analysis,RA)模型,逻辑回归(logistic regression,LR)模型,贝叶斯模型,决策树模型或深度神经网络模型。
在本申请实施例中,终端设备130可以指智能手机、智能手表、移动设备、用户终端、终端设备(例如,终端服务器)、无线通信设备、具有无线通信功能的手持设备、车载设备、可穿戴设备(例如,智能手环)等,本申请实施例对此并不限定。
在本申请实施例中,对终端设备130包括的M个处理器的类型不作具体限定。其中,该M个处理器的类型可以包括以下中的至少两种:CPU、NPU、GPU、DSP、深度学习处理器(deep learning processing unit,DPU)和TPU等。例如,当M=2,即终端设备130 仅包括2个处理器(处理器1301和处理器1302)时,处理器1301可以是CPU,处理器1302可以是NPU。
可选的,在一些实现方式中,可以根据用户需求确定终端设备130的类型和该终端设备130包括的M个处理器的类型。
在本申请实施例中,对终端设备130包括的M个处理器在终端设备130中的部署不作具体限定。在一个示例中,终端设备130包括的M个处理器可以部署在该终端设备130包括的一个或多个硬件设备(例如,SOC)上。例如,终端设备130包括CPU、GPU和DSP,其中CPU、GPU和DSP都部署在该终端设备130中的SOC上。例如,终端设备130包括CPU、GPU和NPU,其中CPU和GPU部署在终端设备130的一个硬件设备上,NPU部署在终端设备130的另一个硬件设备上。
在本申请实施例中,对开发装置120的在系统架构100中的部署不作具体限定。在一个示例中,开发装置120可以是独立于终端设备130之外的第三方平台上的装置。在另一个示例中,开发装置120也可以是终端设备130包括的装置。
应理解,图1仅为示意,并不对适用于本申请实施例的系统架构构成任何限定。例如,在一些场景中,该系统架构100还可以包括更多数目的开发装置120。例如,在一些场景中,上述终端设备130还可以替换为网络设备。例如,在一些场景中,上述终端设备130还可以理解为包括终端设备130的设备。
下面,结合图2至图5,对本申请实施例提供的AI模型的开发方法200进行详细介绍。图2是本申请实施例提供的AI模型的开发方法200的示意性流程图。如图2所示,该方法300包括步骤210至步骤230,下面对步骤210至步骤230进行详细介绍。其中,该方法200的执行主体可以是上文中描述的开发装置120。
步骤210,对AI模型进行拆分,以得到多种拆分结果,多种拆分结果中的每种拆分结果包括多个第一子模型,多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,每个第一子模型能够运行于对应的至少一个处理器以使得每种拆分结果具有运行多个第一子模型的运行开销。
在本申请实施例中,对AI模型进行拆分,可以包括:根据输入信息对AI模型进行拆分,输入信息包括如下至少一个:AI模型中的L个第一算子的执行顺序、或L个第一算子中的每个第一算子的属性信息,L为大于2的正整数。
可以理解的是,不同的第一算子对应的属性信息不都相同。在本申请实施例中对AI模型包括的每个第一算子的属性信息不作具体限定。例如,在一个示例中,当AI模型包括的一个第一算子是reshape算子时,则该一个第一算子的属性信息包括输入数据的维度和输出数据的维度。
在本申请实施例中,对第一子模型的结构不作具体限定。上述AI模型可以包括L个第一算子,L为大于2的正整数,多个第一子模型中的每个第一子模型包括L个第一算子中的部分第一算子。也就是说,第一子模型可以包括L个第一算子中的一个或多个第一算子。可以理解的是,当第一子模型中包括多个第一算子时,该多个第一算子在执行顺序上是相邻的。
例如,当AI包括3个第一算子,且对AI模型进行拆分后得到3个第一子模型时,则每个第一子模型都仅包括一个第一算子。例如,当AI包括4个第一算子,且对AI模型进 行拆分后得到3个第一子模型时,则有3个第一子模型都仅包括一个第一算子,有1个第一子模型包括2个第一算子,且该2个第一算子在执行顺序上是相邻的,该2个第一算子对应的处理器可以是相同的也可以是不同的。
上述M个处理器,可以理解为是一个设备(例如,终端设备或网络设备)包括的处理器。例如,当上述M个处理器是终端设备包括的处理器时,该M个处理器可以包含于该终端设备包括的SOC中。
上述运行开销包括:运行每个第一子模型的开销,在执行顺序上相邻的两个第一子模型之间的通信开销,以及调度每个第一子模型到对应的至少一个处理器上的调度开销。可以理解的是,上述开销可以是运行时间或运行功耗等,本申请实施例对此不作具体限定。
例如,上述M个处理器是终端设备包括的处理器,对AI模型进行拆分后得到了2个第一子模型,分别为第一子模型1和第一子模型2,且每个第一子模型中仅包括一个第一算子。其中,第一子模型1对应终端设备中的CPU,第一子模型2对应该终端设备中的DSP,且先执行第一子模型1后再执行第一子模型2。在此情况下,拆分后的AI模型部署在该终端设备上执行时的运行开销包括:在CPU上运行第一子模型1的开销,在DSP上运行第一子模型2的开销,第一子模型1在CPU上执行后输入至第一子模型2所在的DSP上的通信开销,调度第一子模型1到CPU上的调度开销和调度第一子模型2到DSP上的调度开销。
在本申请实施例中,M个处理器包括以下处理器中的至少两种:中央处理器CPU、神经网络处理器NPU、图形处理器GPU、数字信号处理器DSP、深度学习处理器DPU或张量处理器TPU。可选的,在步骤210之前,还可以包括如下步骤:获取用户输入的上述AI模型;对上述AI模型进行分析,得到上述输入信息。
步骤220,在多种拆分结果中确定第一拆分结果,第一拆分结果的第一运行开销小于多种拆分结果中的一个或多个第二拆分结果的第二运行开销。在一些实施例中,在多种拆分结果的运行开销中,第一运行开销可以小于多种拆分结果中的最大运行开销,且大于多种拆分结果中的最小运行开销。可选的,在一些实施例中,在多种拆分结果的运行开销中,第一运行开销最小。
可选的,在一些实施例中,第一拆分结果包括N个第一子模型,N为大于2的正整数。在确定第一拆分结果之后,还可以包括如下步骤:将N个第一子模型中的至少两个第一子模型进行合并,以得到第三拆分结果,至少两个第一子模型在执行顺序上相邻,第三拆分结果包括X个第二子模型,X是大于1且小于N的正整数,X个第二子模型中的每个第二子模型对应M个处理器中的一个或多个处理器,每个第二子模型能够运行于对应的一个或多个处理器以使得第三拆分结果具有运行X个第二子模型的第三运行开销,第三运行开销小于第一运行开销。
例如,上述M个处理器是终端设备包括的处理器,对AI模型进行拆分后得到3个第一子模型,分别记为第一子模型1,第一子模型2和第一子模型3。其中,第一子模型1对应终端设备中的CPU,第一子模型2对应该终端设备中的DSP,第一子模型3对应该终端设备中的GPU。第一子模型的执行顺序为:第一子模型1,第一子模型2和第一子模型3。在此情况下,可以将第一子模型1和第一子模型2进行合并得到一个第二子模型1。也就是说,对AI模型进行拆分后包括第二子模型1和第一子模型3。
步骤230,输出第一拆分结果。上述步骤230的执行主体可以是开发装置120中的异构调度描述模块123。
可选的,在一些实现方式中,还可以输出第一拆分结果对应的计算机程序。其中,该计算机程序用于描述第一拆分结果包括的N个第一子模型在M个处理器上的运行的时序和通信过程。
可选的,在一些实施例中,当上述步骤220中包括对第一拆分结果包括的N个第一子模型中的在执行顺序上相邻的第一子模型进行合并,第三拆分结果的运行开销小于第一拆分结果的运行开销,且第三拆分结果的运行开销能够满足用户需求时,还可以输出第三拆分结果。
在上述技术方案中,通过比较AI模型对应的多种拆分结果的多种运行开销,将第一拆分结果确定为对该AI模型进行拆分后的结果。上述方法在不改变AI模型结构和参数的情况下,能够充分考虑多个第一子模型中的每个第一子模型运行在M个处理器上的运行开销,从而对AI模型进行拆分。当根据第一拆分结果将AI模型部署在包括上述M个处理器的设备上时,能够有效提高AI模型在该设备上的部署效率和降低AI模型在该设备上执行时的运行开销。另外,在上述技术方案中,还可以对得到的第一拆分结果包括的N个第一子模型进行合并得到第三拆分结果,该第三拆分结果的运行开销小于第一拆分结果的运行开销,从而进一步降低该AI模型在该设备上执行时的运行开销。
基于上述技术方案,还可以输出计算机程序,该计算机程序可以描述根据第一拆分结果对AI模型进行拆分后得到的N个第一子模型在M个处理器上的运行的时序和通信过程。基于此,用户可以灵活的使用该计算机程序。例如,将该计算机程序集成到其他应用当中。
下面,结合图3至图5,介绍使用上述AI模型的开发方法200对一个AI模型进行拆分的具体的实施例。在本申请实施例中,根据上述方法200对一个AI模型进行拆分可以得到多种拆分结果,每种拆分结果包括N个第一子模型,N为大于2的正整数。为便于描述,下面以对AI模型进行拆分得到二种拆分结果为例,介绍本申请实施例提供的AI模型的开发方法。
图3是本申请实施例提供的AI模型的开发方法300的示意性流程图。如图3所示,该方法300包括步骤310至步骤392,下面对步骤310至步骤392进行详细介绍。其中,该方法300的执行主体可以是上文中描述的开发装置120。在步骤310之前,还可以包括如下步骤:确定AI模型部署的终端设备包括的处理器的类型。在本申请实施例中,根据用户需求确定AI模型部署的终端设备包括3个处理器,分别是CPU、NPU和GPU。
步骤310,输入AI模型。具体的,输入AI模型,可以理解为将AI模型输入至开发装置120中。
步骤320,对AI模型进行分析,得到AI模型包括的L个第一算子和每个第一算子对应的一个处理器,L为大于2的正整数。上述步骤320的执行主体为开发装置120中的模型分析模块121。
具体的,可以参见图4中的(a),使用模型分析模块121对输入的AI模型进行分析,可以得到该AI模型包括5个第一算子(即,L等于5)。为便于描述,将该5个第一算子分别记为第一算子1,第一算子2,第一算子3,第一算子4和第一算子5。其中,第一算 子1对应终端设备中的CPU。也就是说,第一算子1在CPU上执行时的运行开销小于第一算子1在NPU或GPU上执行时的运行开销。第一算子2对应NPU,第一算子3对应NPU,第一算子4对应GPU,第一算子5对应NPU。
步骤330,根据L个第一算子和每个第一算子对应的处理器,得到拆分结果1和拆分结果2,以及拆分结果1的运行开销1和拆分结果2的运行开销2。其中,每种拆分结果包括N个第一子模型,N为大于2的正整数。
在本申请实施例中,执行上述步骤330所述的方法,可以得到二种拆分结果。其中,图4中的(b1)所示的是第一种拆分结果,记为拆分结果1。图4中的(b2)所示的是第二种拆分结果,记为拆分结果2。
如图4中的(b1)所示的拆分结果1,拆分结果1包括4个第一子模型,分别记为第一子模型1,第一子模型2,第一子模型3和第一子模型4。其中,第二子模型1对应CPU,第一子模型1包括第一算子1。第一子模型2对应NPU,第一子模型2包括第一算子2和第一算子3。第一子模型3对应GPU,第一子模型3包括第一算子4。第一子模型4对应NPU,第一子模型4包括第一算子5。也就是说,拆分结果1表示将AI模型拆分为上述4个第一子模型。
基于此,拆分结果1的运行开销1包括:每个第一子模型中的每个第一算子在对应的处理上执行时的开销,调度第一子模型到对应的处理器上的调度开销,第一子模型1在CPU上执行后输入到第一子模型2所在的NPU上的通信开销,第一子模型2在NPU上执行后输入到第一子模型3所在的GPU上的通信开销,第一子模型3在CPU上执行后输入到第一子模型4所在的NPU上的通信开销。
如图4中的(b2)所示的拆分结果2,拆分结果2包括3个第一子模型,分别记为第一子模型1,第一子模型2和第一子模型3。其中,第二子模型1对应CPU,第一子模型1包括第一算子1。第一子模型2对应NPU,第一子模型2包括第一算子2和第一算子3。第一子模型3对应GPU,第一子模型3包括第一算子4和第一算子5。也就是说,拆分结果2表示将AI模型拆分为上述3个第一子模型。
基于此,拆分结果2的运行开销2包括:每个第一子模型中的每个第一算子在对应的处理上执行时的开销,调度第一子模型到对应的处理器上的调度开销,第一子模型1在CPU上执行后输入到第一子模型2所在的NPU上的通信开销,第一子模型2在NPU上执行后输入到第一子模型3所在的GPU上的通信开销。
步骤340,通过比较运行开销1和运行开销2,将具有最小运行开销的拆分结果确定为第一拆分结果。在本申请实施例中,通过比较运行开销1和运行开销2,可以确定运行开销1最小。也就是说,可以确定拆分结果1为第一拆分结果。
步骤350,确定是否需要对第一拆分结果包括的N个第一子模型中的至少两个第一子模型进行合并。其中,至少两个第一子模型在执行顺序上是相邻的。
在确定需要对第一拆分结果包括的N个第一子模型中的在执行顺序上相邻的至少两个第一子模型进行合并的情况下,在步骤350之后执行步骤360至步骤380。也就是说,在步骤310至步骤350之后还可以执行步骤360至步骤380。
在确定不需要对第一拆分结果包括的N个第一子模型中的在执行顺序上相邻的至少两个第一子模型进行合并的情况下,在步骤350之后执行步骤391和步骤392。也就是说, 根据上述步骤310至步骤340确定的AI模的第一拆分结果的运行开销能够满足用户的需求。在此情况下,可以将步骤340确定的第一拆分结果确定为对AI模型进行拆分后的结果。其中,可以根据用户需求或实际应用情况确定是否需要对第一拆分结果包括的N个第一子模型中的在执行顺序上相邻的至少两个第一子模型进行合并,在本申请实施例中对此并不进行限定。
由上述步骤340可知,在本申请实施例中,拆分结果1是第一拆分结果,即图4中的(b1)所述的拆分结果。下面,以图4中的(b1)所示的拆分结果1为例,介绍步骤360至步骤380。
步骤360,对N个第一子模型中的在执行顺序上相邻的至少两个第一子模型进行合并,得到拆分结果3和拆分结果4,以及拆分结果3的运行开销3和拆分结果4的运行开销4。
在本申请实施例中,执行上述步骤360所述的方法,可以得到二种拆分结果。其中,图5中的(c1)所示的是第一种拆分结果,记为拆分结果3。图5中的(c2)所示的是第二种拆分结果,记为拆分结果4。
如图5中的(c1)所示的拆分结果3,拆分结果3包括1个第一子模型和1个第二子模型,分别记为第一子模型1,第二子模型1。其中,第一子模型1对应CPU,第一子模型1包括第一算子1。第二子模型1对应NPU,第二子模型1包括第一子模型2,第一子模型3和第一子模型4。也就是说,拆分结果3是将第一拆分结果中的第一子模型2,第一子模型3和第一子模型4进行合并后得到的结果。
基于此,拆分结果3的运行开销3包括:第一子模型1在对应的处理上执行时的开销,第二子模型1中的每个第一子模型在对应的处理上执行时的开销,调度第一子模型1到对应的处理器上的调度开销,调度第二子模型1中的每个第一子模型到对应的处理器上的调度开销,第一子模型1在CPU上执行后输入到第二子模型1所在的NPU上的通信开销。
如图5中的(c2)所示的拆分结果4,拆分结果4包括2个第一子模型和1个第二子模型,分别记为第一子模型1,第一子模型2和第二子模型1。其中,第一子模型1对应CPU,第一子模型1包括第一算子1。第一子模型2对应NPU,第一子模型2包括第一算子2和第一算子3。第二子模型1对应GPU,第二子模型1包括第一子模型3和第一子模型4。也就是说,拆分结果3是将第一拆分结果中的第一子模型3和第一子模型4进行合并后得到的结果。
基于此,拆分结果4的运行开销4包括:每个第一子模型在对应的处理上执行时的开销,第二子模型1中的每个第一子模型在对应的处理上执行时的开销,调度每个第一子模型到对应的处理器上的调度开销,调度第二子模型1中的每个第一子模型到对应的处理器上的调度开销,第一子模型1在CPU上执行后输入到第一子模型2所在的NPU上的通信开销,第一子模型2在NPU上执行后输入到第二子模型1所在的GPU上的通信开销。
步骤370,通过比较运行开销1、运行开销3和运行开销4,确定第三拆分结果,并将第三拆分结果确定为对AI模型进行拆分后的结果。在本申请实施例中,通过比较运行开销1、运行开销3和运行开销4,可以确定运行开销3<运行开销1<运行开销4。基于此,可以确定拆分结果3是第三拆分结果。
步骤380,输出第三拆分结果和相应的计算机程序。在本申请实施例中,第三拆分结果是拆分结果3,即图5中的(c1)所示的拆分结果。
步骤391,将第一拆分结果确定为对AI模型进行拆分后的结果。在本申请实施例中,第一拆分结果是拆分结果1,即图4中的(b1)所示的拆分结果。步骤392,输出第一拆分结果和相应的计算机程序。
在本申请实施例中,上述步骤330至步骤370,以及步骤391的执行主体可以为开发装置120中的模型异构分解模块122。其中,模型异构分解模块122可以使用现有的算法(例如,遗传算法或贪心算法)对模型进行拆分或合并。上述步骤380和步骤392的执行主体可以为开发装置120中的异构调度描述模块123。
应理解,上述图3至图5仅为示意,并不对本申请实施例提供的AI模型的开发方法构成任何限定。在一些实施例中,根据本申请实施例的方法得到的结果可以用于但不限于终端设备。例如,还可以用于网络设备中,也就是说,可以将上述方法300中的终端设备替换为网络设备。在一些实施例中,可以将图4所示的(a)的模型拆分为更多数目(例如,4个或5个)的第一子模型。在一些实施例中,可以将图5所示的(b1)的模型合并为一个第二子模型,在此情况下,该一个第二子模型在一个处理器上执行时的运行开销小于运行开销1。
上文,结合图1至图5详细介绍了本申请提供的人工智能AI模型的开发方法以及该方法适用的系统架构。下面,结合图6至图8详细介绍本申请提供的人工智能AI模型的开发装置、开发设备和系统。应理解,方法实施例的描述与装置、设备和系统实施例的描述相互对应,因此,未详细描述的部分可以参见前面方法实施例。
图6是本申请实施例提供的一种AI模型的开发装置600的示意性结构图。其中,该开发装置600可以为上文图1中描述的开发装置120。如图6所示,该开发装置600包括:拆分单元601,用于对AI模型进行拆分,以得到多种拆分结果,该多种拆分结果中的每种拆分结果包括多个第一子模型,该多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,该每个第一子模型能够运行于该对应的至少一个处理器以使得该每种拆分结果具有运行该多个第一子模型的运行开销;确定单元602,用于在该多种拆分结果中确定第一拆分结果,该第一拆分结果的第一运行开销小于该多种拆分结果中的一个或多个第二拆分结果的第二运行开销;输出单元604,用于输出该第一拆分结果。
可选的,在一些实施例中,在该多种拆分结果的运行开销中,该第一运行开销最小。可选的,在一些实施例中,该第一拆分结果包括N个第一子模型,N为大于2的正整数,该开发装置600还包括合并单元603,
该合并单元603,用于将该N个第一子模型中的至少两个第一子模型进行合并,以得到第三拆分结果,该至少两个第一子模型在执行顺序上相邻,该第三拆分结果包括X个第二子模型,X是大于1且小于N的正整数,该X个第二子模型中的每个第二子模型对应M个处理器中的一个或多个处理器,该每个第二子模型能够运行于该对应的一个或多个处理器以使得该第三拆分结果具有运行该X个第二子模型的第三运行开销,该第三运行开销小于该第一运行开销。
可选的,在一些实施例中,该AI模型包括L个第一算子,L为大于2的正整数,该多个第一子模型中的每个第一子模型包括该L个第一算子中的部分第一算子。
可选的,在一些实施例中,该拆分单元601具体用于:
根据输入信息对该AI模型进行拆分,该输入信息包括如下至少一个:该AI模型中的L个第一算子的执行顺序、或该L个第一算子中的每个第一算子的属性信息,L为大于2的正整数。
可选的,在一些实施例中,该运行开销包括:运行该每个第一子模型的开销,在执行顺序上相邻的两个第一子模型之间的通信开销,以及调度该每个第一子模型到对应的至少一个处理器上的调度开销。
其中,可以通过查表或公式计算等方式确定上述运行开销。
可选的,在一些实施例中,该M个处理器包括以下处理器中的至少两种:中央处理器CPU、神经网络处理器NPU、图形处理器GPU、数字信号处理器DSP、深度学习处理器DPU或张量处理器TPU。
可选的,在一些实施例中,在拆分单元601之前还包括输入单元,该输入单元用于获取该AI模型。
在本申请实施例中,对开发装置600的具体应用形式不作具体限定。在一些实施例中,开发装置600可以以软件开发工具包(software development kit,SDK)的形式开放给用户使用。用户将待部署的AI模型输入SDK后,选择需要部署的处理器的类型,SDK自动输出切分后的第一子模型和描述第一子模型间运行调度的计算机程序。用户可将上述计算机程序和多个第一子模型通过简单的操作封装为一个安卓应用程序包(android application package,APK)直接使用,或自行修改上述计算机程序,将其集成在其他应用当中。
应理解,上述图6仅为示意,并不对本申请实施例提供的开发装置600构成任何限定。例如,在一些场景中,该开发装置600还可以包括存储模块,该存储模块可以用于存储确定单元的处理结果和相应的计算机程序等。
在本申请实施例中,AI模型的开发设备中应包括处理器。可选的,在一些实现方式中,该AI模型的开发设备中还可以包括存储器。下面,结合图7,以AI模型的开发设备中包括处理器和存储器为例进行介绍。
图7是本申请实施例提供的一种AI模型的开发设备700的示意性结构图。
如图7所示,该开发设备700包括:处理器701和存储器702。其中,处理器701和存储器702之间通过内部连接通路互相通信,传递控制和/或数据信号,该存储器702用于存储计算机程序,该处理器701用于从该存储器702中调用并运行该计算机程序,以执行上文所述的方法200和/或方法300。
具体的,处理器701的功能与图6所示的拆分单元601、确定单元602、和合并单元603的具体功能相对应,此处不再赘述。
可选的,在一些实施例中,开发设备700还可以包括接收器和/或输出器。其中,接收器可以用于接收AI模型,输出器的功能与图6中的输出单元604的具体功能相对应,此处不再赘述。
图8是本申请实施例提供的一种系统800的示意性结构图。如图8所示,该系统800包括:AI模型的开发装置600或AI模型的开发设备700。
本申请实施例提供了一种计算机程序产品,当该计算机程序产品在开发装置600或开发设备700上运行时,使得开发装置600或开发设备700执行上述方法实施例中的方法200和/或方法300。
本领域普通技术人员可以意识到,结合本文中所公开的实施例中描述的各方法步骤和单元,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各实施例的步骤及组成。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。本领域普通技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参见前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,该单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口、装置或单元的间接耦合或通信连接,也可以是电的,机械的或其它的形式连接。
该作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本申请实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以是两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
该集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分,或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例中方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上描述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机程序指令。在计算机上加载和执行该计算机程序指令时,全部或部分地产生按照本申请实施例中的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,该计算机程序指 令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如软盘、硬盘、磁带)、光介质(例如,数字视频光盘(digital video disc,DVD)、或者半导体介质(例如固态硬盘)等。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,该程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种人工智能AI模型的开发方法,其特征在于,包括:
    对AI模型进行拆分,以得到多种拆分结果,所述多种拆分结果中的每种拆分结果包括多个第一子模型,所述多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,所述每个第一子模型能够运行于所述对应的至少一个处理器以使得所述每种拆分结果具有运行所述多个第一子模型的运行开销;
    在所述多种拆分结果中确定第一拆分结果,所述第一拆分结果的第一运行开销小于所述多种拆分结果中的一个或多个第二拆分结果的第二运行开销;
    输出所述第一拆分结果。
  2. 根据权利要求1所述的方法,其特征在于,在所述多种拆分结果的运行开销中,所述第一运行开销最小。
  3. 根据权利要求1或2所述的方法,其特征在于,所述第一拆分结果包括N个第一子模型,N为大于2的正整数;
    所述方法还包括:
    将所述N个第一子模型中的至少两个第一子模型进行合并,以得到第三拆分结果,所述至少两个第一子模型在执行顺序上相邻,所述第三拆分结果包括X个第二子模型,X是大于1且小于N的正整数,所述X个第二子模型中的每个第二子模型对应M个处理器中的一个或多个处理器,所述每个第二子模型能够运行于所述对应的一个或多个处理器以使得所述第三拆分结果具有运行所述X个第二子模型的第三运行开销,所述第三运行开销小于所述第一运行开销。
  4. 根据权利要求1-3任一项所述的方法,其特征在于,所述AI模型包括L个第一算子,L为大于2的正整数,所述多个第一子模型中的每个第一子模型包括所述L个第一算子中的部分第一算子。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述对所述AI模型进行拆分,包括:
    根据输入信息对所述AI模型进行拆分,所述输入信息包括如下至少一个:所述AI模型中的L个第一算子的执行顺序、或所述L个第一算子中的每个第一算子的属性信息,L为大于2的正整数。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述运行开销包括:
    运行所述每个第一子模型的开销,
    在执行顺序上相邻的两个第一子模型之间的通信开销,以及
    调度所述每个第一子模型到对应的至少一个处理器上的调度开销。
  7. 根据权利要求1-6任一项所述的方法,其特征在于,所述M个处理器包括以下处理器中的至少两种:
    中央处理器CPU、神经网络处理器NPU、图形处理器GPU、数字信号处理器DSP、深度学习处理器DPU或张量处理器TPU。
  8. 一种人工智能AI模型的开发装置,其特征在于,包括:
    拆分单元,用于对AI模型进行拆分,以得到多种拆分结果,所述多种拆分结果中的每种拆分结果包括多个第一子模型,所述多个第一子模型中的每个第一子模型对应M个处理器中的至少一个处理器,M为大于1的正整数,所述每个第一子模型能够运行于所述对应的至少一个处理器以使得所述每种拆分结果具有运行所述多个第一子模型的运行开销;
    确定单元,用于在所述多种拆分结果中确定第一拆分结果,所述第一拆分结果的第一运行开销小于所述多种拆分结果中的一个或多个第二拆分结果的第二运行开销;
    输出单元,用于输出所述第一拆分结果。
  9. 根据权利要求8所述的装置,其特征在于,在所述多种拆分结果的运行开销中,所述第一运行开销最小。
  10. 根据权利要求8或9所述的装置,其特征在于,所述第一拆分结果包括N个第一子模型,N为大于2的正整数,所述装置还包括合并单元,
    所述合并单元,用于将所述N个第一子模型中的至少两个第一子模型进行合并,以得到第三拆分结果,所述至少两个第一子模型在执行顺序上相邻,所述第三拆分结果包括X个第二子模型,X是大于1且小于N的正整数,所述X个第二子模型中的每个第二子模型对应M个处理器中的一个或多个处理器,所述每个第二子模型能够运行于所述对应的一个或多个处理器以使得所述第三拆分结果具有运行所述X个第二子模型的第三运行开销,所述第三运行开销小于所述第一运行开销。
  11. 根据权利要求8-10任一项所述的装置,其特征在于,所述AI模型包括L个第一算子,L为大于2的正整数,所述多个第一子模型中的每个第一子模型包括所述L个第一算子中的部分第一算子。
  12. 根据权利要求8-11任一项所述的装置,其特征在于,所述拆分单元具体用于:
    根据输入信息对所述AI模型进行拆分,所述输入信息包括如下至少一个:所述AI模型中的L个第一算子的执行顺序、或所述L个第一算子中的每个第一算子的属性信息,L为大于2的正整数。
  13. 根据权利要求8-12任一项所述的装置,其特征在于,所述运行开销包括:
    运行所述每个第一子模型的开销,
    在执行顺序上相邻的两个第一子模型之间的通信开销,以及
    调度所述每个第一子模型到对应的至少一个处理器上的调度开销。
  14. 根据权利要求8-13任一项所述的装置,其特征在于,所述M个处理器包括以下处理器中的至少两种:
    中央处理器CPU、神经网络处理器NPU、图形处理器GPU、数字信号处理器DSP、深度学习处理器DPU或张量处理器TPU。
  15. 一种人工智能AI模型部署的装置,其特征在于,包括至少一个处理器,所述至少一个处理器,用于执行计算机程序或指令,以使得所述装置执行如权利要求1至7中任一项所述的方法。
  16. 一种人工智能AI模型部署的装置,其特征在于,所述装置包括处理器和存储器,所述存储器用于存储指令,所述处理器用于读取所述存储器中存储的指令,以执行权利要求1至7中任一项所述的方法。
  17. 一种计算机可读存储介质,其特征在于,用于存储计算机指令,当所述计算机指令被执行时,如权利要求1至7中任一项所述的方法被实现。
PCT/CN2020/136119 2020-12-14 2020-12-14 人工智能ai模型的开发方法和装置 WO2022126316A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080107168.6A CN116472533A (zh) 2020-12-14 2020-12-14 人工智能ai模型的开发方法和装置
PCT/CN2020/136119 WO2022126316A1 (zh) 2020-12-14 2020-12-14 人工智能ai模型的开发方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/136119 WO2022126316A1 (zh) 2020-12-14 2020-12-14 人工智能ai模型的开发方法和装置

Publications (1)

Publication Number Publication Date
WO2022126316A1 true WO2022126316A1 (zh) 2022-06-23

Family

ID=82058775

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/136119 WO2022126316A1 (zh) 2020-12-14 2020-12-14 人工智能ai模型的开发方法和装置

Country Status (2)

Country Link
CN (1) CN116472533A (zh)
WO (1) WO2022126316A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117155791A (zh) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 基于集群拓扑结构的模型部署方法、系统、设备及介质
WO2024114520A1 (zh) * 2022-11-28 2024-06-06 索尼集团公司 用于模型推理的电子设备、方法和存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117032936B (zh) * 2023-09-28 2024-02-06 之江实验室 一种数据调度方法、装置和计算机设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170243114A1 (en) * 2016-02-19 2017-08-24 International Business Machines Corporation Adaptation of model for recognition processing
CN110689121A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111736986A (zh) * 2020-05-29 2020-10-02 浪潮(北京)电子信息产业有限公司 一种深度学习模型的fpga加速执行方法及相关装置

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170243114A1 (en) * 2016-02-19 2017-08-24 International Business Machines Corporation Adaptation of model for recognition processing
CN110689121A (zh) * 2019-09-24 2020-01-14 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN110826708A (zh) * 2019-09-24 2020-02-21 上海寒武纪信息科技有限公司 一种用多核处理器实现神经网络模型拆分方法及相关产品
CN111736986A (zh) * 2020-05-29 2020-10-02 浪潮(北京)电子信息产业有限公司 一种深度学习模型的fpga加速执行方法及相关装置

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024114520A1 (zh) * 2022-11-28 2024-06-06 索尼集团公司 用于模型推理的电子设备、方法和存储介质
CN117155791A (zh) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 基于集群拓扑结构的模型部署方法、系统、设备及介质
CN117155791B (zh) * 2023-10-31 2024-02-13 浪潮电子信息产业股份有限公司 基于集群拓扑结构的模型部署方法、系统、设备及介质

Also Published As

Publication number Publication date
CN116472533A (zh) 2023-07-21

Similar Documents

Publication Publication Date Title
WO2022126316A1 (zh) 人工智能ai模型的开发方法和装置
US20240104020A1 (en) Methods and systems for handling data received by a state machine engine
US9280329B2 (en) Methods and systems for detection in a state machine
US9448965B2 (en) Receiving data streams in parallel and providing a first portion of data to a first state machine engine and a second portion to a second state machine
US10489062B2 (en) Methods and systems for using state vector data in a state machine engine
KR101999590B1 (ko) 패턴 인식 프로세싱 시스템에서의 전력 관리를 위한 방법들 및 시스템들
US11379943B2 (en) Optimizing compilation of shaders
CN110650347B (zh) 多媒体数据的处理方法及装置
US20220100576A1 (en) Video processing method and device, electronic equipment and storage medium
EP4386579A1 (en) Retrieval model training method and apparatus, retrieval method and apparatus, device and medium
US20220012578A1 (en) Methods, apparatus, and articles of manufacture to increase utilization of neural network (nn) accelerator circuitry for shallow layers of an nn by reformatting one or more tensors
Koenen et al. Interpreting Deep Neural Networks with the Package innsight
US20140063025A1 (en) Pipelined Image Processing Sequencer
CN116483645A (zh) 设备虚拟调试方法、装置、设备、存储介质和程序产品
WO2021258964A1 (zh) 神经网络结构搜索的方法、装置与系统
KR20200139909A (ko) 전자 장치 및 그의 연산 수행 방법
US11086634B2 (en) Data processing apparatus and method
Estivill-Castro et al. High-level executable models of reactive real-time systems with logic-labelled finite-state machines and FPGAs
CN114327958A (zh) 推理服务组件的运算方法和TensorRT推理服务组件
CN110877332B (zh) 机器人舞蹈文件生成方法、装置、终端设备及存储介质
CN114556408A (zh) 图像渲染方法、装置和系统、计算机可读存储介质
CN111027682A (zh) 神经网络处理器、电子设备及数据处理方法
WO2021017546A1 (zh) 神经网络量化方法、装置、芯片、电子设备及板卡
CN114298292A (zh) 获取算子数据和离线模型操作的设备及方法
CN114064298A (zh) 数据处理方法和装置、存储介质和电子装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20965336

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202080107168.6

Country of ref document: CN

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20965336

Country of ref document: EP

Kind code of ref document: A1