WO2024108907A1 - Data processing method and apparatus, ai chip, electronic device, and storage medium - Google Patents

Data processing method and apparatus, ai chip, electronic device, and storage medium Download PDF

Info

Publication number
WO2024108907A1
WO2024108907A1 PCT/CN2023/092113 CN2023092113W WO2024108907A1 WO 2024108907 A1 WO2024108907 A1 WO 2024108907A1 CN 2023092113 W CN2023092113 W CN 2023092113W WO 2024108907 A1 WO2024108907 A1 WO 2024108907A1
Authority
WO
WIPO (PCT)
Prior art keywords
hardware
network model
execution command
hardware execution
memory space
Prior art date
Application number
PCT/CN2023/092113
Other languages
French (fr)
Chinese (zh)
Inventor
刘军
彭凡
杨媛静
王鸥
Original Assignee
成都登临科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202211486830.0A external-priority patent/CN115586972B/en
Priority claimed from CN202211486836.8A external-priority patent/CN115576699B/en
Application filed by 成都登临科技有限公司 filed Critical 成都登临科技有限公司
Publication of WO2024108907A1 publication Critical patent/WO2024108907A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • the present application belongs to the field of artificial intelligence technology, and specifically relates to a data processing method, device, AI chip, electronic device and storage medium.
  • AI Artificial Intelligence
  • AI chips which can be various processors
  • hardware execution commands referred to as hardware commands or commands
  • the hardware execution commands usually need to include or reflect: the type or content of the operation, the read address of the data source required for the operation, and the write address for storing the operation results.
  • the processor uses a lot of memory information when translating the various operations of the network model into hardware execution commands.
  • the process of generating hardware execution commands for network models involves the allocation and occupation of memory resources, and the data of different network models will occupy different memory spaces, which will increase the occupation of limited memory resources by network models.
  • the data required to execute the network model requires a large amount of memory, it is easy to have insufficient memory, which makes it difficult to translate the hardware execution commands for the network model as expected, and then makes it difficult for the hardware to execute the network model as expected.
  • this insufficient memory may also affect the performance of the hardware in other aspects.
  • one aspect of the present application provides a data processing method to improve the problem that the current processor requires a large performance overhead and takes a long time each time it runs a network model for data processing, and to improve the problem in the related technology that when generating hardware execution commands for network model translation, a large memory resource overhead is required, which easily leads to insufficient memory.
  • an embodiment of the present application provides a data processing method, which may include: obtaining a computational graph of a network model to be run; translating each operation in the computational graph of the network model into a hardware execution command that can be executed by a target hardware device of the AI chip, wherein the hardware execution command includes device information of the target hardware device; using a network execution graph to store the hardware execution command, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device is used to run the network model by executing the hardware execution commands in the network execution graph.
  • translating each operation contained in the computation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip may include: compiling the source code of each operation in the computation graph of the network model into instructions, and obtaining relevant information required for the target hardware device to perform each operation; generating the hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • the preset first API function (such as object creation API, compile instruction API) can be used to compile the source code of each operation in the calculation graph of the network model into instructions
  • the preset second API function (such as memory allocation API, data transfer API) can be used to obtain the relevant information required for the target hardware device to perform each operation (such as the address and length of the instruction, the number of memory addresses that the instruction needs to operate, the execution order between instructions, etc.)
  • the preset third API function (such as execution API) is used to generate hardware execution commands according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • each operation in the calculation graph of the network model can be quickly and accurately translated into hardware execution commands that can be executed by the target hardware device.
  • using a network execution graph to store the hardware execution command may include: storing the hardware execution command corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and recording key information of each hardware execution command, wherein the key information is used to obtain the hardware execution command.
  • the data processing method may also include: when it is necessary to run the network model, obtaining the hardware execution command pre-stored in the network execution graph; sending the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command to realize running the network model on the target hardware device.
  • sending the hardware execution command to the target hardware device for execution may include: Modify the read address used to obtain input data in the hardware execution command, and/or modify the write address used to store output data in the hardware execution command; send the modified hardware execution command to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, thereby achieving the purpose of running the network model on the target hardware device to process the input data.
  • the data processing method may further include: copying the hardware execution command according to the total number of hardware devices in the AI chip; modifying the device information contained in the copied hardware execution command according to the device information of other hardware devices in the AI chip except the target hardware device, to obtain a hardware execution command with modified device information, wherein the hardware execution command with modified device information can be provided to the other hardware devices for execution.
  • the data processing method may further include: determining a first number of hardware devices currently required to run the network model based on an amount of data to be processed.
  • translating each operation in the computation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip may include: allocating a corresponding virtual memory space to the network model; and based on the virtual memory space, translating each operation contained in the network model into a corresponding first hardware execution command, wherein the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
  • using a network execution graph to store the hardware execution command may include: after translating each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space, storing the first hardware execution command, wherein the first hardware execution command is used to be provided to a hardware device that needs to run the network model for execution after address replacement.
  • allocating a corresponding virtual memory space for the network model may include: allocating a virtual memory space corresponding to the data size required to execute the network model.
  • the data processing method may also include: determining whether the network model is executed within a preset time period after the current moment; when it is determined that the network model is not executed within the preset time period after the current moment, executing the steps: based on the virtual memory space, translating the various operations contained in the network model into corresponding first hardware execution commands.
  • the data processing method may also include: when determining that the network model is to be executed within a preset time period starting from the current moment, based on the real memory space, translating each operation contained in the network model into a corresponding second hardware execution command, wherein the addresses contained in the second hardware execution command are all real addresses, and the real memory space is used to store the data required to execute the network model.
  • the data processing method may also include: when the network model needs to be executed, loading the data required to execute the network model into the real memory space; replacing the false address in the first hardware execution command with the real address corresponding to the real memory space; and sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device, so that the corresponding hardware device can execute the second hardware execution command.
  • replacing the false address in the first hardware execution command with the real address corresponding to the real memory space may include: identifying the first hardware execution command, determining part or all of the first hardware execution command currently containing the false address as the target command; replacing the false address in the target command with the real address corresponding to the real memory space.
  • the data processing method may also include: replacing the real address in the second hardware execution command with the false address corresponding to the virtual memory space, and caching the command with the address replaced by the false address.
  • translating each operation contained in the network model into a corresponding first hardware execution command can include: compiling the source code of each operation contained in the network model into instructions corresponding to each operation, and based on the virtual memory space, obtaining relevant information required to execute each operation contained in the network model, the relevant information including address information; and generating the first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • different network models correspond to different virtual memory spaces.
  • an embodiment of the present application also provides a data processing method, which may include: when it is necessary to run a network model, obtaining a pre-stored hardware execution command that can be executed by a target hardware device corresponding to the network model; sending the hardware execution command to the target hardware device so that the target hardware device executes the hardware execution command to achieve the purpose of running the network model on the target hardware device to process the input data.
  • obtaining a pre-stored hardware execution command that can be executed by a target hardware device corresponding to the network model may include: when it is necessary to execute the network model, loading the network original data corresponding to the network model into a real memory space, and obtaining a pre-stored first hardware execution command; wherein the first hardware execution command is obtained by translating each operation contained in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space; and using the real memory space The corresponding real address replaces the false address in the first hardware execution command.
  • the data processing method after replacing the false address in the first hardware execution command with the real address corresponding to the real memory space, the data processing method also includes: sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
  • the data processing method may also include: replacing the real address in the second hardware execution command with a false address corresponding to the false memory space, and replacing the cache address with the second hardware execution command with the false address.
  • an embodiment of the present application also provides a data processing device, which may include: an acquisition module, a command generation module and a storage module, wherein the acquisition module is configured to: acquire the calculation graph of the network model to be run; the command generation module is configured to: translate each operation in the calculation graph of the network model into a hardware execution command that can be executed by the corresponding target hardware device, and the hardware execution command contains the device information of the target hardware device; the storage module is configured to: store the hardware execution command using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can run the network model by executing the hardware execution commands in the network execution graph.
  • the command generation module may include: an allocation module and a translation module, wherein the allocation module is configured to: allocate a corresponding virtual memory space for the network model; the translation module is configured to: based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
  • the storage module is configured to: store the first hardware execution command, and the first hardware execution command is used to be provided to the hardware device that needs to run the network model for execution after the address is replaced.
  • an embodiment of the present application further provides a data processing device, which may include: an acquisition module and a sending module, wherein the acquisition module is configured to: when it is necessary to run a network model, acquire a pre-stored hardware execution command that can be executed by a target hardware device corresponding to the network model; the sending module is configured to: send the hardware execution command to the target hardware device so that the target hardware device executes the hardware execution command to achieve the purpose of running the network model on the target hardware device to process input data.
  • the acquisition module may include: a first hardware execution command acquisition module and a translation module, wherein the first hardware execution command acquisition module is configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, and obtain the pre-stored first hardware execution command; wherein the first hardware execution command is obtained by translating each operation contained in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space; the translation module is configured to: replace the virtual address in the first hardware execution command with the real address corresponding to the real memory space.
  • the sending module is further configured to: send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
  • an embodiment of the present application further provides an AI chip, which may include: a kernel and a storage device, wherein the kernel is configured to: obtain a computational graph of a network model to be run, and translate each operation in the computational graph of the network model into a hardware execution command executable by a target hardware device, wherein the hardware execution command contains device information of the target hardware device; the storage device is configured to: store the hardware execution command using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can be used to run the network model by executing the hardware execution commands in the network execution graph.
  • the kernel is configured to: obtain a computational graph of a network model to be run, and translate each operation in the computational graph of the network model into a hardware execution command executable by a target hardware device, wherein the hardware execution command contains device information of the target hardware device
  • the storage device is configured to: store the hardware execution command using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for
  • the kernel is configured to: allocate a corresponding virtual memory space for the network model, and based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space;
  • the storage device is configured to: store the first hardware execution command, and the first hardware execution command is used to provide it to the hardware device that needs to run the network model for execution after the address is replaced.
  • an embodiment of the present application also provides an AI chip, which may include: a hardware device, a storage device and a kernel, wherein the storage device is configured to: store hardware execution commands corresponding to each operation in the computational graph of the network model; the kernel is configured to: when it is necessary to run the network model, obtain the previously stored hardware execution commands from the storage device, and send the hardware execution commands to the hardware device; the hardware device is configured to: execute the hardware execution commands to achieve the purpose of running the network model to process the input data.
  • the storage device is also configured to: store a first hardware execution command; wherein the first hardware execution command is obtained by translating each operation contained in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space;
  • the kernel is also configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, obtain the first hardware execution command stored in the storage device, replace the virtual address in the first hardware execution command with the real address corresponding to the real memory space, and send the replaced first hardware execution command as the second hardware execution command to the hardware device;
  • the hardware device is also configured to: execute the second hardware execution command to achieve the purpose of running the network model to process the input data.
  • an embodiment of the present application further provides an electronic device, which may include: a memory and a processor, wherein the processor is connected to the memory, wherein the memory is configured to store a program; and the processor is configured to call the program stored in the memory to execute Execute the data processing method provided by the above-mentioned first aspect embodiment and/or any possible implementation method combined with the first aspect embodiment, or execute the data processing method provided by the above-mentioned second aspect embodiment and/or any possible implementation method combined with the second aspect embodiment.
  • an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored.
  • the computer program executes the data processing method provided by the above-mentioned first aspect embodiment and/or any possible implementation method combined with the first aspect embodiment, or executes the above-mentioned second aspect embodiment and/or any possible implementation method combined with the second aspect embodiment.
  • FIG1 shows a schematic flow chart of a data processing method provided in an embodiment of the present application.
  • FIG2 is a flow chart showing some steps in a data processing method provided in an embodiment of the present application.
  • FIG. 3 shows another flowchart of some steps in a data processing method provided in an embodiment of the present application.
  • FIG4 shows a flow chart of another data processing method provided in an embodiment of the present application.
  • FIG5 is a flow chart showing some steps in another data processing method provided in an embodiment of the present application.
  • FIG6 shows a module diagram of a data processing device provided in an embodiment of the present application.
  • FIG. 7 shows a more detailed module diagram of a data processing device provided in an embodiment of the present application.
  • FIG8 shows a module diagram of another data processing device provided in an embodiment of the present application.
  • FIG. 9 shows a more detailed module diagram of yet another data processing device provided in an embodiment of the present application.
  • FIG. 10 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • FIG. 11 shows a schematic structural diagram of another electronic device provided in an embodiment of the present application.
  • a and/or B in this application is merely a term used to describe the association relationship between associated objects, indicating that three relationships may exist.
  • a and/or B may represent three situations: A exists alone, A and B exist at the same time, and B exists alone.
  • first and second in this application are only used to distinguish one entity or operation or object from another entity or operation or object, and do not require or imply any actual relationship or order between these entities or operations or objects.
  • the embodiments of the present application involve application scenarios in which network models (various neural network models) are used for data processing.
  • network models variable neural network models
  • the relevant terms and concepts that may be involved in the embodiments of the present application are first introduced below.
  • the neural network model can be composed of neural units, which can be specifically understood as a neural network model with an input layer, a hidden layer, and an output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in between are all hidden layers (there can be many hidden layers).
  • the neural network model uses one or more layers (such as hidden layers, output layers, etc.) to generate outputs for received inputs.
  • the output of each hidden layer is used as the input of the next layer (such as the next hidden layer or output layer).
  • Each layer of the neural network model generates outputs for received inputs according to the current relevant parameters of the layer (such as weights).
  • Each operation contained in the neural network model (such as convolution, pooling, activation, normalization, classification, etc.) needs to be translated into hardware execution commands before the hardware device can execute them.
  • the hardware device implements the functions of the corresponding operations in the network model by executing these hardware execution commands, thereby supporting the function of running the neural network on the hardware device to process the input data.
  • a computational graph is often used to reflect the computational logic of the network model.
  • Each node in the computational graph can correspond to an operation in the network model. These operations are also called operators in the network model. Each operator has its own unique features and is used to complete specific functions.
  • the computational graph of a network model usually contains many different operations, such as convolution operations, pooling operations, activation functions, etc.
  • the inventors proposed the following embodiments according to the characteristics of the network model to improve the above problems.
  • the inventor of the present application considers that when a network model is used for data processing, the structure of the network model itself is fixed, but the input data processed each time when the network model is loaded into the hardware for execution may be different, and there may be different output results for different input data.
  • the present application generates (or translates in advance) the hardware execution commands that can be executed by the corresponding target hardware device for each operation contained in the network model, but does not send it to the target hardware device (referring to the hardware device that can execute the hardware execution command) for execution first, but first stores the hardware execution commands corresponding to each operation contained in the network model (for example, stored in the constructed network execution graph), so that each time the network model needs to be used to process the input data (such as recognition, classification, feature extraction, size transformation, etc.), the pre-stored hardware execution commands can be distributed to the corresponding hardware for execution, which is conducive to quickly loading the computing logic and computing tasks of the network model onto the hardware that needs to run the network model.
  • the embodiment of the present application provides a data processing method that can be applied to network models used in various artificial intelligence application scenarios.
  • Artificial intelligence application scenarios include but are not limited to: text processing, speech recognition and processing, multi-language translation, image recognition, biometric recognition, and intelligent control.
  • the data processing method can be applied to a driver program and can be applied to an AI chip, which can be a homogeneous processor or a heterogeneous processor.
  • computational graphs are a commonly used method for representing computational processes. They are often used to represent the computational logic of neural network model design and are widely used in various data processing platforms.
  • Each node in the computational graph represents the corresponding operation (i.e., operator or operation) that the network model needs to perform, and the directed edges between nodes represent the dependencies between the operations corresponding to the nodes.
  • each operation (or operator) in the computational graph is translated into a hardware execution command, it is sent to the corresponding hardware device for execution, thereby completing the execution of the network model.
  • the operation operators corresponding to the nodes in the computational graph can be defined at the granularity of algebraic operators (such as vector addition, subtraction, multiplication, division, and matrix multiplication, etc.). For the case where the abstract granularity of the operator is low, the computational graph of a network model may often include many (for example, thousands) nodes.
  • the computation graph of the network model to be run and translated obtained in step S1 may be an original computation graph or an optimized computation graph, for example, a computation graph obtained after operator fusion. After the network structure of the network model is converted into the original computation graph, it may be optimized once or multiple times to obtain an optimized computation graph.
  • the AI chip can directly or indirectly obtain the calculation graph of the network model, as long as the structure of the network model can be determined and the various operations that need to be implemented by the network model can be known.
  • the AI chip is provided with a corresponding driver, which can be deployed in the kernel of the AI chip.
  • the translation process of S2 can be executed by the driver corresponding to the AI chip.
  • the translated hardware execution command may contain the device information of the target hardware device (such as the device identification), which is used to indicate which hardware device can execute the hardware execution command. Different target hardware devices have different corresponding device information. After the operation in the network model is translated, the obtained hardware execution command can be provided to the corresponding hardware device for execution when the network model corresponding to the hardware execution command needs to be run.
  • the target hardware device refers to the hardware device that runs the hardware execution command and is the hardware object that is expected to have the ability to run the network model.
  • An AI chip may involve multiple hardware devices.
  • an AI chip can be a dedicated computing acceleration chip (or accelerator) designed to undertake heavy computing tasks, such as a graphics processing unit (GPU), a tensor processing unit (TPU), etc.
  • GPU graphics processing unit
  • TPU tensor processing unit
  • an AI chip can also be other homogeneous or heterogeneous processors.
  • an AI chip may include multiple hardware devices, any of which may be used as a target hardware device according to actual needs.
  • a hardware device may include multiple hardware execution units.
  • a hardware device in an AI chip may include but is not limited to: a first unit (CU, Compute engine Unit) for general computing, a second unit (TU, Tensor Unit) for AI accelerated computing, and a third unit (DMA, Direct Memory Access) for data transfer, etc.
  • a hardware device in an AI chip can also be regarded as a computing cluster containing multiple hardware execution units. The number of hardware execution units contained in different types of hardware devices may be different, and the types may also be different.
  • the specific hardware architecture should not be understood as a limitation on the embodiments of the method of the present application.
  • S2 may include: compiling the source code of each operation in the computational graph of the network model into instructions (hardware machine instructions), and obtaining the relevant information required for the target hardware device to perform each operation; generating a hardware execution command according to the corresponding instructions of each operation and the relevant information required to perform each operation.
  • the relevant information required for the target hardware device to perform an operation can be used to reflect the following information related to the operation: the address and length of the hardware instruction, how many memory addresses the instruction needs to operate, where the memory address is located, how big the memory is, what the processing order between instructions is, etc.
  • the source code of each operation in the computational graph of the network model can be compiled into instructions using a preset first API (Application Programming Interface) function, and the relevant information required for the target hardware device to perform each operation can be obtained using a preset second API function; a preset third API function can be used to generate hardware execution commands based on the corresponding instructions of each operation and the relevant information required to perform each operation.
  • a preset third API function can be used to generate hardware execution commands based on the corresponding instructions of each operation and the relevant information required to perform each operation.
  • hardware execution commands corresponding to each operation can be generated in advance for each operation of the network model. There may be hundreds of hardware execution commands generated for one operation.
  • the calculation graph of the network model contains many different operations (also called operators, each operator has its own unique features and is used to complete specific functions), such as convolution operations, pooling operations, activation functions, etc.
  • the driver provides a set of relatively general API functions, such as object creation API, compilation instruction API, memory allocation API, data transfer API, execution API, etc.
  • the driver For each operation of the network model, the driver provides a function of a programmable language similar to C++ syntax, and the source code of the operation can be implemented using this syntax.
  • the driver also uses a preset first API function (such as an object creation API, a compilation instruction API) to compile the source code of an operation in the computational graph of the network model into a hardware instruction corresponding to the operation through a compiler.
  • a preset first API function such as an object creation API, a compilation instruction API
  • a memory allocation API provided by the driver can be used to allocate a space on the memory and provide it to the convolution operation.
  • part of each operation may involve the handling of data, so the data handling API provided by the driver is used to handle data during the operation.
  • the driver can obtain the relevant information required for the target hardware device to perform each operation by using the preset second API function (such as the aforementioned memory allocation API, data handling API), and can use the preset third API function (such as the execution API) to generate hardware execution commands according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • the preset second API function such as the aforementioned memory allocation API, data handling API
  • the preset third API function such as the execution API
  • the process of generating the hardware execution command corresponding to the network model for the network model only needs to be done once, and the generated hardware execution command is first cached, for example, stored in the constructed network execution graph.
  • the command is distributed based on the hardware execution command stored in the constructed network execution graph to enable the hardware to execute these commands.
  • the translated hardware execution command is stored.
  • the constructed network execution graph can be used for storage.
  • the network execution graph can also be used to reflect the computing logic of the network model and can be regarded as a new computing graph. However, the network execution graph does not need to record or store the source code of each operation like the original computing graph of the network model.
  • the network execution graph is used to record all hardware execution commands generated for the network model, and can also be used to record the key information of each hardware execution command.
  • the key information may include the starting address, offset, and command execution order, etc. The length and storage location of the command can be known based on the starting address and offset.
  • the key information is used to obtain the hardware execution command, and the target hardware device can obtain the hardware execution command based on the key information.
  • the network execution graph stores all commands that need to be executed by the hardware regarding the network model. After the hardware execution commands are stored in the constructed network execution graph (this process can convert the various operations (including characteristic parameters) contained in the network model into commands that can be recognized and executed by the hardware and stored), after the network execution graph or the commands in the network execution graph are provided to the target hardware device, the target hardware device can run the network model based on the hardware execution commands cached in the network execution graph.
  • the storage device in the AI chip may use the network execution graph to store the hardware execution command.
  • the network execution graph may be located on the target hardware device or not. For example, it may be located on a storage device connected to the target hardware device.
  • each operation in the calculation graph of the network model is first translated into a hardware execution command that can be executed by the target hardware device, but it is not sent to the target hardware device for execution first.
  • the translated hardware execution commands are first stored using the network execution graph, so that each time the network model needs to be run subsequently, the pre-stored hardware execution commands can be distributed to the corresponding hardware for execution, and there is no need to re-translate each operation in the calculation graph of the network model into hardware execution commands, thereby improving the problem that the processor requires a large performance overhead and a long time each time it runs the network model.
  • the process of using a network execution graph to store translated hardware execution commands can be to store the hardware execution commands corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and record the key information of each hardware execution command accordingly. It can be understood that compared with the random storage method, this method of storing hardware execution commands according to the execution order of each operation can improve the efficiency of subsequent execution of hardware execution commands. Since the corresponding hardware execution commands need to be executed in the execution order of each operation contained in the network model when executing the hardware execution commands later, the normal operation of the network model can be guaranteed. Therefore, when storing hardware execution commands, they are stored in this execution order, and the subsequent execution of instructions can be directly sent in the order of storage.
  • the hardware execution commands corresponding to each operation are stored in the network execution graph in turn, and the key information of each hardware execution command is recorded, so that the calculation logic involved in the neural network (the execution order of each operation in the network model) can be quickly learned according to the network execution graph later, so that when executing the network model, the corresponding hardware execution commands can be sent to the target hardware device for execution in turn according to the key information recorded in the network execution graph and the execution order of each operation, thereby avoiding execution logic errors and improving efficiency.
  • the network model has the characteristic that the computing operations performed each time are the same except for the input data and output results, all the operations contained in the network model can be translated into a bunch of command sequences and stored in advance through the driver program. Each time the network model is executed, the hardware execution commands can be fine-tuned.
  • the read address used to obtain input data in the hardware execution command needs to be replaced with the address where the new data is located, and/or the write address used to store output data needs to be replaced with a new write address, so that the same AI model can be used to process the new input data, and the output result corresponding to the new input data can be stored in a new location. This greatly reduces the burden on the processor and improves the efficiency of data processing.
  • this method can be applied to scenarios with multiple network models, that is, when there are multiple network models, correspondingly, for each network model, each operation contained in the network model can be translated into a hardware execution command that can be executed by the corresponding target hardware device, the hardware execution command contains the device information of the target hardware device, and the translated hardware execution command is stored.
  • one network model corresponds to a unique network execution graph.
  • the pre-stored hardware execution command corresponding to the required network model can be selected, so as to distribute the command so that the hardware executes these hardware execution commands corresponding to the corresponding network model.
  • the data processing method may further include S4: when it is necessary to run the network model to process the input data, obtain a pre-stored hardware execution command, and send the hardware execution command to the target hardware device for execution, so as to implement the operation of the network model on the target hardware device. For example, according to the execution order of each operation included in the network model, the corresponding hardware execution command is sent to the target hardware device for execution in sequence, thereby supporting the function of running the network model on the target hardware device to process the input data.
  • the commands can be distributed directly based on the pre-stored hardware execution commands to enable the hardware to execute these commands.
  • sending the hardware execution command to the target hardware device for execution may include: modifying the read address in the hardware execution command for obtaining input data, and/or modifying the write address in the hardware execution command for storing output data; sending the modified hardware execution command to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, and realizes the function or purpose of running the network model on the hardware to process the input data.
  • the modified corresponding hardware execution commands can be sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model, so that the target hardware device executes the modified hardware execution command.
  • the modified corresponding hardware execution command is then sent to the target hardware device for execution.
  • input data can be obtained from different places, and output data can be stored in different places, which has better flexibility.
  • the hardware execution commands generated for a target hardware device can be quickly expanded to other hardware devices in the AI chip, so that when multiple hardware devices are required to run the network model in parallel, there is no need to re-translate each operation in the calculation graph of the network model into corresponding hardware execution commands for different hardware devices respectively, which further reduces the performance overhead of the processor and improves the efficiency of data processing.
  • the read address used to obtain the input data in the hardware execution command can be modified (that is, the read address is changed from the original position A to position C), and the write address used to store the output data in the hardware execution command can be modified (such as changing the write address from the original position B to position D).
  • the modified hardware execution command is sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model, so that the target hardware device executes the modified hardware execution command, thereby realizing that when the target hardware device runs the network model, the input data stored at position C can be processed and the processed output data can be stored in position D.
  • the data processing method may also include: S5: According to the total number of hardware devices in the AI chip, copy the hardware execution command obtained in S2; S6: According to the device information of other hardware devices in the AI chip except the target hardware device, modify the device information contained in the copied hardware execution command to obtain the hardware execution command with the modified device information, wherein the hardware execution command with the modified device information can be provided to other hardware devices for execution.
  • the specified number is the total number of hardware devices in the AI chip
  • each copied hardware execution command is modified according to the device information of the hardware devices other than the target hardware device in the AI chip (that is, the device information in the hardware execution command is modified), so that each modified copied hardware execution command can be executed by other hardware devices.
  • the hardware execution command generated for the network model can be sent to each hardware device with matching device information in the AI chip according to the principle of device information matching, so that each hardware device in the AI chip can obtain the hardware execution command that it can execute, so that each hardware device in the AI chip can run the network model. It is understandable that this can also be done by sending a network execution graph.
  • the distribution of commands means copying the network execution graph into multiple copies, modifying the device information in the network execution graph, and then sending the multiple network execution graphs with different device information to each hardware device with matching device information in the AI chip.
  • the hardware execution command generated for the target hardware device can be copied twice, and for one of the copied hardware execution commands, the copied hardware execution command is modified according to the device information of hardware device 1, so that the modified copied hardware execution command can be executed by hardware device 1.
  • the other copied hardware execution command can be modified according to the device information of hardware device 2, so that the modified other copied hardware execution command can be executed by hardware device 2.
  • the hardware execution commands that have been translated for the network model are expanded in the same chip so that each hardware device in the chip can run the network model.
  • the hardware execution command generated for one hardware device can be quickly extended to other hardware devices.
  • the network execution graph can be copied to other hardware devices first, and then the information related to the device information can be modified (the network execution graph can be fine-tuned). Based on a hardware execution command that has been generated and cached, multiple copies are copied and cached to multiple hardware devices. For the copied hardware execution command, the information is modified for the matched hardware devices.
  • the information modification here is fine-tuning, and the purpose is to modify the copied hardware execution command to a command that is adapted to each hardware device.
  • the number of hardware devices that need to run the network model at the current moment can be manually configured or determined based on the amount of data to be processed.
  • the data processing method may also include: determining the first number of hardware devices that need to run the network model based on the amount of data to be processed. The first number is less than or equal to the total number mentioned above. Each time the network model is run, it is not necessary to use up all the hardware devices in the chip. The specific number of hardware devices that need to be run can be determined based on the actual application scenario.
  • the processor will occupy memory space in the process of generating hardware execution commands (which can be referred to as hardware commands or commands) for each network model, and the memory space occupied by the data required by different network models is different, it is easy to have insufficient memory, resulting in the hardware being unable to execute the network model as expected, and may affect the performance of the hardware.
  • the embodiment of the present application proposes a command generation method to improve the problem that each time the processor generates a hardware execution command for a network model, it needs to occupy a large amount of memory resource overhead, which easily leads to insufficient memory.
  • translating each operation in the computational graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip may include the following S21 and S22.
  • a virtual memory space corresponding to the size of the data required to execute the network model (the data here includes the input data to be processed by the network model, and may also include the characteristic data of the network model itself (such as weights, parameters, etc.)) can be allocated. Since the virtual memory space is allocated according to the size of the data required to execute the network model, the requirements for generating commands can be met without occupying the real memory resources of the hardware.
  • the creation or allocation of false memory space can also be regarded as not occupying too much physical storage space. Even if a little real physical storage space has to be occupied due to the allocation, recording or marking of false memory space, the total amount of space occupied is about 1KB or several KB (this data is only used as an example).
  • the allocation of false memory space may not occupy the real memory of the hardware device to run the network model, but it may occupy a small amount of storage space of the hardware that does not need to run the network model due to the process of allocating or recording false memory space. Since the total amount occupied is extremely small (for example, about 1KB), the amount occupied in this case can also be ignored and is regarded as not occupying real memory and real physical storage space.
  • a fake memory space of 2G can be allocated.
  • the fake memory space is the same as the real memory space, and each storage row has an independent address, and the fake memory space corresponds to a large
  • the size of the real memory space that is expected to be occupied by the data size required for caching and executing the network model is consistent with that of the real memory space that is expected to be occupied.
  • the physical memory allocated is not actually 2G.
  • the fake memory space has the same properties as the real memory space, such as the space size (the fake memory space can be the same size as the real memory space) and independent addresses (the address format and address search method can also be designed like the real memory).
  • the only difference is that the fake memory space is not a real physical storage space.
  • the fake memory allocated here is a fake memory, which is neither a real physical memory nor a traditional virtual memory (virtual memory, sometimes also called logical memory) that needs to be mapped with physical memory.
  • the address of the fake memory can be regarded as a fake address, which is a fake address deliberately created and allocated to meet the need of generating hardware execution commands.
  • each operation i.e., each operator
  • the network model is translated into a corresponding first hardware execution command, wherein the first hardware execution command generated for one operation may be hundreds. Since the command generation process is based on the fake memory space, it does not occupy the real memory space. In this way, even if the first hardware execution commands corresponding to each model are generated for multiple network models, it will not cause insufficient memory space.
  • the hardware execution command of each network model usually needs to include various operations, a read address for obtaining the data source required for the operation, a write address for storing the operation result and other operation information, therefore, when generating a command, it is necessary to provide an address for reading and writing data in order to generate the command.
  • the hardware execution command is generated based on the (address of) false memory space, so that the addresses in the hardware execution command are all false addresses, and when it is determined that some commands need to be executed, the addresses in the relevant commands can be replaced, so that the hardware execution commands can be translated in advance for each operation of the network model, which is conducive to improving the execution efficiency of the network model, and can avoid excessive occupation of memory resources due to translating commands for the network model in advance.
  • S22 can be executed by the core of the AI chip, and a driver for translating each operation included in the network model into corresponding hardware execution commands can be deployed in the core.
  • the implementation process of S22 may include: compiling the source code of each operation included in the network model into instructions corresponding to each operation, and obtaining relevant information required to execute each operation included in the network model based on the virtual memory space; generating a first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • the source code of each operation included in the network model is compiled into instructions, and based on the virtual memory space, the relevant information required to execute each operation included in the network model is obtained, and then the hardware execution command is generated according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • each operation included in the network model can be quickly and accurately translated into the corresponding hardware execution command, and the relevant information required to execute each operation included in the network model is obtained based on the virtual memory space, and will not occupy the real memory resources of the hardware while meeting the requirements for generating commands.
  • the driver In order to translate the various operations contained in the network model into hardware execution commands that can be executed by the hardware, the driver provides a set of relatively common API functions, such as creating compiled object API, compiling instruction API, creating memory API, data transfer API, execution API, etc.
  • the driver For each operation of the network model, the driver provides a programmable language function similar to C++ syntax, and the source code of the operation can be implemented using this syntax. At the same time, the driver will also compile the source code of an operation included in the network model into the hardware instructions corresponding to the operation based on the compiler by using a preset first API function (such as creating a compilation object API, compiling an instruction API).
  • a preset first API function such as creating a compilation object API, compiling an instruction API.
  • a memory creation API provided by the driver can be used to allocate a space on the virtual memory space and provide it to the operator of the convolution operation.
  • some of the operations may involve the handling of data, so the driver provides a data handling API for handling data during the operation. In this way, the driver can obtain the relevant information required for the hardware device to execute each operation contained in the network model based on the virtual memory space by using the preset second API function (such as the aforementioned memory creation API and data handling API).
  • the relevant information required for the hardware device to execute an operation can be used to reflect the following related information: the address and length of the instruction, how many memory addresses the instruction needs to operate, where the specific location of the memory address is, how much the memory size is, what is the processing order between instructions, etc.
  • the preset third API function (such as the execution API) can be used to generate a first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • the process of compiling the source code of the operation into hardware instructions also requires some memory information, and there is no need to execute the instructions temporarily or to perform actual data reading, writing or loading temporarily, the aforementioned false memory space can also be used to compile the instructions. When data reading, writing or loading is required, the false address in the instruction is replaced with the real address.
  • the first hardware execution command may also include device information of the hardware device (such as a device identifier) to indicate which hardware device executes the first hardware execution command, and different hardware devices have different corresponding device information.
  • the hardware device is a hardware object that is expected to have the ability to run the network model.
  • An AI chip may involve multiple hardware devices. The hardware execution command obtained after the network module is translated can be provided to the corresponding hardware device for execution when the network model corresponding to the hardware execution command needs to be run.
  • using a network execution graph to store the hardware execution command may include: after S22, S31: storing the first hardware execution command.
  • the first hardware execution command is stored for subsequent use.
  • the first hardware execution command may be stored in a pre-constructed network execution graph.
  • the first hardware execution command may be stored by a storage device in the AI chip using a network execution graph.
  • the network execution graph is used to record all first hardware execution commands generated for the network model, and can also be used to record key information of each first hardware execution command.
  • the key information may include a starting address, an offset, and a command execution order, etc., and the length and storage location of the command can be known based on the starting address and the offset.
  • the hardware device can obtain the first hardware execution command based on the key information.
  • each operation i.e., each operation operator contained in the network model is translated into a corresponding first hardware execution command based on a false memory space (such as represented by fake memory). Since the process is based on a false memory space, a large amount of real memory space will not be occupied due to the command generation process. In this way, even if many commands are generated for a network model, or the first hardware execution command is generated for multiple network models, the actual real memory space will not be insufficient due to occupying too much real memory space during the command generation process.
  • a false memory space such as represented by fake memory
  • the various operations contained in the network model can be translated into a first hardware execution command that can be executed by the corresponding hardware device, but it is not sent to the hardware device for execution first.
  • the translated first hardware execution command is stored first, so that each subsequent time the network model is needed to process the input data, it does not need to be retranslated, but only needs to be fine-tuned by replacing the address in the first hardware execution command, for example, the address information related to the input data and output data in the command can be modified.
  • the driver There is no need for the driver to re-translate the various operations in the network model into the first hardware execution command, thereby saving the performance overhead required for the processor to run the network model each time.
  • the addresses in the first hardware execution command are all virtual addresses.
  • the virtual addresses can be searched and applied in the process of generating the command, they cannot actually be used to store the loaded data in the command execution process. Therefore, when the network model needs to be executed subsequently, the command generation method may also include: loading the data required to execute the network model into the real memory space, and replacing the virtual addresses in the first hardware execution command with the real addresses corresponding to the real memory space, and sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution.
  • the process of replacing the false address in the first hardware execution command with the real address may include: identifying the first hardware execution command, determining part or all of the first hardware execution command currently containing the false address as the target command; and replacing the false address in the target command with the real address corresponding to the real memory space.
  • replacing it is necessary to identify which first hardware execution commands use the false address.
  • the false address in part or all of the first hardware execution command can be replaced with the real address.
  • the first hardware execution command when performing address replacement, is identified and only the first hardware execution command containing the false address is replaced, thereby avoiding erroneous replacement or missing replacement.
  • the fake addresses in the first hardware execution commands corresponding to all network models may be replaced at one time, or only the fake addresses in the first hardware execution commands corresponding to some network models may be replaced.
  • the decision to replace the addresses of all commands or some commands can be made based on the currently remaining available memory capacity, the processing progress of the network model, the amount of data to be processed, the processing capacity supported by the chip and other optional factors.
  • the present application does not require the number of commands to be replaced each time and/or the type of operation corresponding to the command.
  • the false addresses contained in a batch of commands corresponding to some operations in a network can be replaced each time, or the false addresses in all commands of an entire network model (or multiple network models) can be replaced at one time.
  • the replacement can be performed in batches, such as first replacing the address of the command of one of the network models, and then replacing the address of the command of the other network model.
  • the "real memory” mentioned in this application is physical memory, and the "real address” is the physical address that a physical storage medium will have, while the “false address” is not a physical address, but a false address that can be designed to have similar properties or a similar format as a physical address.
  • the "real address corresponding to the real memory space” used for address replacement can be either the physical address of the physical memory or the address of the virtual memory that has established a mapping relationship with the physical memory in advance.
  • This virtual memory is relatively As for the false memory of the present application, it also has a real address and will occupy physical storage space.
  • mapping relationship can be established between a physical storage space (which may be a physical external memory or a physical memory) and another physical storage space (usually a physical memory), so that the originally discontinuous physical addresses are logically mapped and associated, so that the originally scattered or unrelated and disordered physical addresses become logically associated and ordered in some scenarios, and the actual loading and reading and writing of data can also be completed through this mapping relationship.
  • a physical storage space which may be a physical external memory or a physical memory
  • another physical storage space usually a physical memory
  • the first hardware execution commands corresponding to each cached network model can be processed and executed on demand. For example, assuming that there are first hardware execution commands corresponding to 20 network models in the cache, but only the command corresponding to one of the network models needs to be executed currently, then only these commands can be temporarily replaced, and the replaced new commands (which can be called second hardware execution commands) can be distributed to specific hardware devices for execution.
  • the real address in the second hardware execution command can be replaced based on the false address corresponding to the false memory space, so that the second hardware execution command can become the first hardware execution command containing the false address again (that is, the address in the first hardware execution command that replaced the real address is changed to a false address again), thereby releasing the corresponding physical memory resources.
  • the command generation method may also include: when it is determined that the network model will not be executed within a preset time period starting from the current moment, based on the false memory space, the real address in the second hardware execution command corresponding to the network model is replaced with a false address, and the hardware execution command with the address replaced with the false address is cached so that it can be used the next time the same network model needs to be run.
  • the real address in the command is replaced with a false address corresponding to the false memory space and cached, thereby freeing up a portion of the corresponding real memory space.
  • the addresses in the first hardware execution command are all false addresses, and the addresses in the second hardware execution command are all real addresses. If the false addresses in the first hardware execution command are replaced with real addresses, the first hardware execution command after replacement is the second hardware execution command. Similarly, after replacing the real addresses in the replaced first hardware execution command (second hardware execution command) with false addresses, the first hardware execution command can be obtained again.
  • the data processing method of the embodiment of the present application further includes another step, as shown in Figure 3. The principle will be described below in conjunction with Figure 3.
  • S210 Determine whether the network model is executed within a preset time period starting from the current moment.
  • the preset time period can be set according to actual needs, such as minutes, hours, etc.
  • the various operations contained in the network model will be translated into corresponding first hardware execution commands based on the virtual memory space. This is beneficial for translating the network model in advance without reducing the processing efficiency of the network model, and can improve the translation efficiency, which is beneficial to improving the processing efficiency of the network model.
  • each operation contained in the network model is translated into a corresponding second hardware execution command based on the real memory space, wherein the addresses contained in the second hardware execution command are all real addresses, and the real memory space is used to store the data required when executing the network model.
  • each operation contained in the network model is directly translated into a corresponding second hardware execution command based on the real memory space.
  • the address in the first hardware execution command needs to be converted into the real address required for execution, thereby improving the command translation efficiency and processing efficiency of the network model to be executed.
  • a real memory space corresponding to the data size can be allocated according to the data size required to execute the network model, and each operation contained in the network model can be translated into a corresponding second hardware execution command.
  • the data required for executing the network model (the data here includes the input data to be processed by the network model, and can also include the characteristic data of the network model itself (such as weights, parameters, etc.)) is loaded into the real memory space, so that after translating each operation contained in the network model into the corresponding second hardware execution command, the second hardware execution command is directly sent to the corresponding hardware device for execution, so that the hardware device executes these second hardware execution commands to execute the network model.
  • the second hardware execution command After translating each operation contained in the network model into a corresponding second hardware execution command based on the real memory space, the second hardware execution command is stored. When it is needed to be executed later, the second hardware execution command is directly sent to the corresponding hardware device for execution, so that the hardware device executes these second hardware execution commands to execute the network model.
  • the implementation principle of the above S230 is consistent with the implementation principle of S31 in Figure 2, except that the first hardware execution command is stored in S31, while the second hardware execution command is stored in this step.
  • the second hardware execution command can also be stored in the network execution graph.
  • each operation included in the network model is translated into a corresponding first hardware execution command based on the virtual memory space.
  • the above-mentioned process of translating each operation in the network model into hardware execution commands can be implemented by the same AI chip or by two AI chips respectively.
  • AI chip 1 is only responsible for translating each operation in the network model into hardware execution commands
  • AI chip 2 is responsible for executing these hardware execution commands. The two processes are completed through the cooperation between the two AI chips.
  • AI chip 1 can translate each operation in the network model into hardware execution commands (including a first hardware execution command and a second hardware execution command) and store them; when the network model is to be run subsequently, the corresponding hardware execution command is sent to the hardware device of AI chip 2 for execution, or the corresponding first hardware execution command is converted into a second hardware execution command and then sent to the hardware device of AI chip 2 for execution.
  • the command conversion process includes: replacing the false address in the first hardware execution command with the real address corresponding to the real memory space, thereby obtaining the second hardware execution command.
  • AI chip 1 translates each operation in the network model into a hardware execution command, sends the hardware execution command to AI chip 2 for storage, and when the network model is to be run subsequently, AI chip 2 obtains the corresponding hardware execution command and sends it to the hardware device of AI chip 2 for execution.
  • AI chip 1 translates each operation in the network model into a first hardware execution command, sends the first hardware execution command to AI chip 2 for storage, and when the network model is to be run subsequently, AI chip 2 replaces the false address in the first hardware execution command with the real address corresponding to the real memory space to obtain the second hardware execution command, which is then sent to the hardware device of AI chip 2 for execution.
  • the embodiment of the present application also provides another data processing method for the scenario of running a network model for data processing, and its principle is explained below in conjunction with Figures 4 and 5. Compared with Figure 1, Figure 4 is described only from the perspective of executing hardware execution commands.
  • the various operations contained in the network model can be translated into hardware execution commands that can be executed by the target hardware device and stored in advance (the aforementioned network execution graph storage can be used).
  • the pre-stored hardware execution commands corresponding to the network model are obtained and provided to the target hardware device for execution.
  • the hardware execution commands stored in the network execution graph in advance are distributed to the corresponding hardware executions, and there is no need to re-translate each operation in the network model into hardware execution commands, thereby solving the problem that the processor requires a large performance overhead and takes a long time each time it runs the network model.
  • the corresponding hardware execution commands may be sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model.
  • the corresponding hardware execution commands may be sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model, so that the target hardware device executes the hardware execution commands, thereby realizing the function of running the network model on the target hardware device, and facilitating the network model to be run on the target hardware device to process the input data.
  • the embodiment of the present application obtains the pre-stored hardware execution command executable by the target hardware device corresponding to the network model, which may include the following S110 and S120, as shown in Figure 5. The principle will be described below in conjunction with Figure 5.
  • the original network data corresponding to the network model (the data at this time includes the input data to be processed by the network model and the characteristic data of the network itself) is loaded into the real memory space, and the pre-stored first hardware execution command is obtained.
  • the first hardware execution command is obtained by translating each operation contained in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space.
  • the data processing method further includes: sending the replaced first hardware execution command to the corresponding hardware device.
  • the addresses in the first hardware execution command are all false addresses. Therefore, when executing the network model subsequently, it is necessary to use the real address corresponding to the real memory space to replace the false address in the first hardware execution command, and send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution.
  • the real address in the second hardware execution command can be replaced based on the fake address corresponding to the fake memory space.
  • the corresponding memory resources can be released, that is, the address in the first hardware execution command that replaced the real address is changed to a false address again.
  • the embodiment of the present application also provides a data processing device 100, as shown in Figure 6, the data processing device 100 may include: an acquisition module 110, a command generation module 120 and a storage module 130.
  • the acquisition module 110 can also be recorded as a first acquisition module.
  • the acquisition module 110 may be configured to: acquire a computational graph of a network model to be run.
  • the command generation module 120 may be configured to: translate each operation in the computation graph of the network model into a hardware execution command executable by a corresponding target hardware device, wherein the hardware execution command includes device information of the target hardware device.
  • the storage module 130 can be configured to: store the hardware execution commands using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can run the network model by executing the hardware execution commands in the network execution graph.
  • the command generation module 120 can be configured to use a preset first API function to compile the source code of each operation in the computational graph of the network model into instructions, and use a preset second API function to obtain the relevant information required for the target hardware device to perform each operation; and use a preset third API function to generate the hardware execution command according to the corresponding instructions of each operation and the relevant information required to perform each operation.
  • the storage module 130 can be configured to: store the hardware execution commands corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and record the key information of each hardware execution command, and the key information is used to obtain the hardware execution command.
  • the data processing device 100 may further include: a sending module.
  • the acquisition module 110 may also be configured to: when it is necessary to run the network model, acquire the hardware execution command pre-stored in the network execution graph.
  • the sending module may also be configured to: send the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, thereby implementing the operation of the network model on the target hardware device.
  • the sending module can be configured to: modify the read address in the hardware execution command for obtaining input data, and/or modify the write address in the hardware execution command for storing output data; send the modified corresponding hardware execution command to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, thereby achieving the purpose of running the network model on the target hardware device to process the input data.
  • the data processing device 100 may further include: a copy module, which is configured to: copy the hardware execution command according to the total number of hardware devices in the AI chip; modify the device information contained in the copied hardware execution command according to the device information of other hardware devices in the AI chip except the target hardware device, to obtain a hardware execution command with modified device information, wherein the hardware execution command with modified device information can be provided to the other hardware devices for execution.
  • a copy module which is configured to: copy the hardware execution command according to the total number of hardware devices in the AI chip; modify the device information contained in the copied hardware execution command according to the device information of other hardware devices in the AI chip except the target hardware device, to obtain a hardware execution command with modified device information, wherein the hardware execution command with modified device information can be provided to the other hardware devices for execution.
  • the replication module may also be configured to: determine a first number of hardware devices currently required to run the network model based on an amount of data to be processed.
  • the command generation module 120 in the embodiment of the present application may include: an allocation module 121 and a translation module 122.
  • the allocation module 121 may be configured to allocate a corresponding virtual memory space for the network model.
  • the translation module 122 can be configured to: translate each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
  • the translation module 122 can be configured to translate each operation contained in each network model into a corresponding first hardware execution command based on different virtual memory spaces for different network models, and different network models correspond to different virtual memory spaces.
  • the storage module 130 can also be configured to store the first hardware execution command, where the first hardware execution command is provided to the hardware device that needs to run the network model for execution after the address is replaced.
  • the allocation module 121 may be configured to allocate a virtual memory space corresponding to a data size required for executing the network model.
  • the command generation module 120 may further include a judgment module, which may be configured to: judge whether the network model is executed within a preset time period starting from the current moment.
  • the translation module 122 may be configured to: translate each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space.
  • the translation module 122 may also be configured to:
  • the real memory space is used to translate each operation included in the network model into a corresponding second hardware execution command, wherein the addresses included in the second hardware execution command are all real addresses, and the real memory space stores the data required for executing the network model.
  • the storage module 130 can also be configured to store the second hardware execution command.
  • the command generation module 120 may also include an acquisition module and a sending module, and the acquisition module may be configured to: when executing the network model, load the data required for executing the network model into the real memory space.
  • the translation module 122 may also be configured to: replace the false address in the first hardware execution command with the real address corresponding to the real memory space.
  • the sending module may be configured to: send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
  • the translation module 122 may also be configured to replace the real address in the replaced first hardware execution command with the false address corresponding to the false memory space when it is determined that the network model is not executed within a preset time period starting from the current moment.
  • the translation module 122 may be configured to: identify the first hardware execution command to identify a first hardware execution command containing a false address; and replace the false address in the first hardware execution command containing the false address with a real address corresponding to the real memory space.
  • the translation module 122 can be configured to: compile the source code of each operation included in the network model into instructions, and based on the virtual memory space, obtain the relevant information required to execute each operation included in the network model; generate the first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • the command generation module 120 provided in the embodiment of the present application has the same implementation principle and technical effects as those of the aforementioned method embodiment. For the sake of brief description, for matters not mentioned in the device embodiment, reference may be made to the corresponding contents in the aforementioned method embodiment.
  • the process of executing each module in the above-mentioned command generation module is beneficial to translating the network model in advance without reducing the processing efficiency of the network model, and can improve the translation efficiency, which is beneficial to improving the processing efficiency of the network model, thereby saving the performance overhead required by the processor each time the network model runs.
  • the embodiment of the present application also provides another data processing device 200 for a scenario where a network model is run for data processing, as shown in FIG8 , the data processing device 200 includes: an acquisition module 210 and a sending module 220.
  • the acquisition module 210 can also be recorded as a second acquisition module.
  • the acquisition module 210 may be configured to: when it is necessary to run the network model, acquire the pre-stored hardware execution command that can be executed by the target hardware device corresponding to the network model.
  • the sending module 220 may be configured to: send the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, so as to achieve the purpose of running the network model on the target hardware device to process the input data.
  • the acquisition module 210 may include: a first hardware execution command acquisition module 211 and a translation module 212.
  • the first hardware execution command acquisition module 211 can be configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, and acquire the pre-stored first hardware execution command.
  • the first hardware execution command is obtained by translating each operation included in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space.
  • the translation module 212 may be configured to replace the false address in the first hardware execution command with the real address corresponding to the real memory space.
  • the sending module 220 may also be configured to send the replaced first hardware execution command to the corresponding hardware device.
  • the translation module 212 can also be configured to: when it is determined that the network model is not executed within a preset time period starting from the current moment, replace the real address in the replaced first hardware execution command with a virtual address corresponding to the virtual memory space, and the virtual memory space has the same properties as the real memory space.
  • the acquisition module 210 provided in the embodiment of the present application has the same implementation principle and technical effects as the aforementioned method embodiment.
  • the modules in the acquisition module 210 and the modules in the aforementioned command generation module 120 can be integrated together or used independently.
  • the data processing device 100 or data processing device 200 provided in the embodiment of the present application has the same implementation principle and technical effects as the aforementioned method embodiment.
  • the parts not mentioned in the device embodiment can refer to the corresponding content in the aforementioned method embodiment.
  • the embodiment of the present application also provides an AI chip, which may include: a core and a storage device.
  • the AI chip can be used to execute the above data processing method.
  • the kernel is used to obtain the computation graph of the network model to be run, and translate each operation in the computation graph of the network model into a hardware execution command executable by the target hardware device, wherein the hardware execution command contains the device information of the target hardware device;
  • a driver is deployed in the kernel, which can translate various operations in the calculation graph of the network model into hardware execution commands that can be executed by the target hardware device, and send its hardware execution commands to the storage device.
  • the kernel may compile the source code of each operation in the computational graph of the network model into instructions using a preset first API function, and use a preset second API function to
  • the storage device can be configured to store the hardware execution command using a network execution graph, wherein the network execution graph is used to record the hardware execution command, and the hardware execution command is used to run the network model.
  • the storage device may store the hardware execution commands corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and record key information of each hardware execution command, which is used to obtain the hardware execution command.
  • the kernel in the embodiment of the present application can also be configured to: allocate a corresponding virtual memory space for the network model, and based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
  • a driver is deployed in the kernel, and the driver can translate various operations included in the network model into first hardware execution commands, and send the first hardware execution commands to the storage device for storage.
  • the kernel can also be configured to: compile the source code of each operation included in the network model into instructions, and based on the virtual memory space, obtain the relevant information required to execute each operation included in the network model; generate the first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  • the storage device may also be configured to store a first hardware execution command, where the first hardware execution command is provided to a hardware device that needs to run the network model for execution after address replacement.
  • the kernel may be configured to: allocate a virtual memory space corresponding to a data size required for executing the network model.
  • the kernel before the kernel translates each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space, the kernel can also be configured to: determine whether the network model is executed within a preset time period after the current moment, and only translate each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space when it is determined that the network model is not executed within the preset time period after the current moment.
  • the kernel is further configured to translate each operation contained in the network model into a corresponding second hardware execution command based on the real memory space, wherein the addresses contained in the second hardware execution command are all real addresses, and the real memory space stores the data required for executing the network model.
  • the storage device can also be configured to store the second hardware execution command.
  • the kernel can also be configured to: when executing the network model, load the data required to execute the network model into the real memory space; replace the false address in the first hardware execution command with the real address corresponding to the real memory space, and send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
  • the kernel can be configured to: identify the first hardware execution command, determine part or all of the first hardware execution command currently containing a false address as a target command; and replace the false address in the target command with a real address corresponding to the real memory space.
  • the kernel can also be configured to: after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, replace the real address in the replaced first hardware execution command (i.e., the second hardware execution command) with the false address corresponding to the false memory space, and replace the cache address with the second hardware execution command with the false address.
  • the embodiment of the present application also provides an AI chip, which may include: a hardware device, a kernel, and a storage device.
  • the AI chip can be used to execute the aforementioned data processing method.
  • the storage device may be configured to store hardware execution commands corresponding to each operation in the computation graph of the network model.
  • the kernel may be configured to: when the network model needs to be run, obtain the previously stored hardware execution command from the storage device, and send the hardware execution command to the hardware device.
  • the hardware device can be configured to: execute hardware execution commands to achieve the purpose of running the network model to process input data.
  • the AI chip can receive the hardware execution command sent by other AI chips and execute the hardware execution command.
  • the kernel is also used to receive the hardware execution command sent by other AI chips and store it for execution by the hardware device.
  • the storage device in the embodiment of the present application can also be configured to store a first hardware execution command; wherein the first hardware execution command is obtained by translating each operation included in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space;
  • the kernel may also be configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, obtain the first hardware execution command stored in the storage device, and replace the first hardware execution command with the real address corresponding to the real memory space. The false address in the hardware device and the replaced first hardware execution command are sent to the hardware device;
  • the hardware device may also be configured to: execute the replaced first hardware execution command to achieve the purpose of running the network model to process input data.
  • the AI chip provided in the embodiment of the present application has the same implementation principle and technical effects as those in the aforementioned method embodiment.
  • FIG. 10 shows a structural block diagram of an electronic device 300 provided in an embodiment of the present application.
  • the electronic device 300 may include: a transceiver 310, a memory 320, a communication bus 330 and a processor 340.
  • the transceiver 310, the memory 320, and the processor 340 are directly or indirectly electrically connected to each other to realize data transmission or interaction.
  • these elements can be electrically connected to each other through one or more communication buses 330 or signal lines.
  • the transceiver 310 can be configured to receive and send data.
  • the memory 320 can be configured to store computer programs, such as storing the software function modules shown in Figures 6 to 9, that is, the data processing device 100 of Figure 6 or the data processing device 200 of Figure 8.
  • the data processing device 100 includes at least one software function module that can be stored in the memory 320 in the form of software or firmware or fixed in the operating system (OS) of the electronic device 300.
  • the processor 340 can be configured to execute the executable module stored in the memory 320.
  • the processor 340 can be configured to execute the software function module or computer program included in the data processing device 100
  • the processor 340 can be configured to: obtain the calculation graph of the network model to be run; translate each operation in the calculation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip, and the hardware execution command contains the device information of the target hardware device; use the network execution graph to store the hardware execution command, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can run the network model by executing the hardware execution commands in the network execution graph.
  • the processor 340 can be configured to execute the software function modules or computer programs included in the data processing device 200, the processor 340 can be configured to: when it is necessary to run the network model, obtain a pre-stored hardware execution command that can be executed by the target hardware device corresponding to the network model; send the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, so as to achieve the purpose of running the network model on the target hardware device to process the input data.
  • the electronic device 300 may also include two processors 340, wherein one processor 340 is responsible for translating various operations in the network model into hardware execution commands, and one processor 340 is responsible for executing the hardware execution commands.
  • the memory 320 can be, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable read-only memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
  • RAM Random Access Memory
  • ROM read-only memory
  • PROM Programmable Read-Only Memory
  • EPROM Erasable Programmable Read-Only Memory
  • EEPROM Electrically erasable read-only memory
  • the processor 340 may be an integrated circuit chip with signal processing capabilities.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a graphics processor (GPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the various methods, steps and logic block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor 340 may also be any conventional processor, etc.
  • the above-mentioned electronic devices 300 include but are not limited to smart phones, tablets, computers, industrial computers, vehicle-mounted equipment, servers, smart wearable devices, edge boxes, etc.
  • the memory can be configured to store the network model, and can also be configured to store the original data required to execute the network model, such as input data to be processed, and characteristic data of the network itself.
  • the first processor may be configured to allocate a corresponding virtual memory space for the network model, translate each operation contained in the network model into a corresponding first hardware execution command, and store the first hardware execution command.
  • the electronic device may also include a central processing unit (CPU), and the first processor may be a coprocessor that assists the central processing unit in data processing, such as a graphics processing unit (GPU) or a general purpose computing on graphics processing units (GPGPU).
  • the CPU and the first processor may be regarded as the above-mentioned AI chip.
  • the first processor loads the data required to execute the network model into the real memory space of the first processor, and replaces the false address in the first hardware execution command with the real address corresponding to the real memory space, and sends the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution.
  • the real address in the second hardware execution command can be replaced with the false address corresponding to the false memory space, and the cache address is replaced with the second hardware execution command with the false address.
  • Step 1 Initially, the network raw data (including the input data to be processed and the characteristic data of the network itself) and the network model are stored in In a storage device (which may be a disk).
  • a storage device which may be a disk.
  • Step 2 Before the network model needs to be translated into hardware execution commands, the original network data and the data of the network model itself need to be loaded into the DDR (Double Data Rate) of the CPU. According to the CPU DDR space occupied by the data, a real DDR space of the same size is allocated and occupied in the dedicated DDR of the first processor. Through the collaboration of the CPU and the first processor, all data (including input data) stored in the DDR of the CPU are moved to the DDR of the first processor.
  • DDR Double Data Rate
  • Step 3 When translating the network model into hardware execution commands, the first processor combines all the operation operators in the network model, the DDR addresses of the feature data, the DDR addresses of the input data, and the DDR addresses of the storage operation results based on the allocated real DDR space to generate a series of hardware execution commands.
  • Step 4. Then directly execute these hardware execution commands.
  • a process of using the command generation method shown in this application may include:
  • Step 1 the original network data (including input data and may also include characteristic data of the network itself) and the network model may also be stored in a storage device (which may be a disk).
  • a storage device which may be a disk.
  • Step 2 Before the network model needs to be translated into the first hardware execution command, a fake memory space corresponding to the data size required to execute the network model is allocated, and the network model is loaded into the DDR of the first processor.
  • Step 3 When translating the network model into the first hardware execution command, the first processor combines all the operation operators in the network model, the false DDR addresses of the feature data, the false DDR addresses of the input data, and the false DDR addresses of the storage operation results based on the allocated false DDR space (false memory space) to generate a series of first hardware execution commands, and stores these hardware execution commands.
  • Step 4 When the network model is subsequently executed, the original network data is loaded into the DDR of the CPU. According to the DDR space occupied by the data, a DDR space of the same size (real memory space) is allocated in the DDR of the first processor. Through the collaboration between the CPU and the first processor, all the data in the DDR of the CPU is moved to the DDR of the first processor. Then, according to the real address corresponding to the allocated real memory space, the false address in the first hardware execution command is replaced, and the replaced first hardware execution command is sent to the corresponding hardware device.
  • a DDR space of the same size real memory space
  • the embodiment of the present application also provides a non-volatile computer-readable storage medium (hereinafter referred to as the storage medium), on which a computer program is stored.
  • the storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.
  • the functional modules in the various embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application or the part that contributes to the prior art, or the part of the technical solution, can be embodied in the form of a software product, which is stored in a computer-readable storage medium and includes several instructions for enabling a computer device (which can be a personal computer, a laptop, a server, or an electronic device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.
  • the present application relates to a data processing method, device, AI chip, electronic device and storage medium, and belongs to the field of data processing technology.
  • the data processing method includes: obtaining a calculation graph of a network model to be run; translating each operation in the calculation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip; and storing the hardware execution command using the network execution graph.
  • the data processing method, device, AI chip, electronic device and storage medium of the present application are reproducible and can be used in a variety of industrial applications.
  • the data processing method, device, AI chip, electronic device and storage medium of the present application can be used in any device that needs to reduce the performance overhead of the processor and improve the efficiency of data processing.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The present application relates to the technical field of data processing, and relates to a data processing method and apparatus, an AI chip, an electronic device, and a storage medium. The data processing method comprises: acquiring a computational graph of a network model to be run; translating each operation in the computational graph of the network model into a hardware execution command executable by a target hardware device of an AI chip; and storing the hardware execution command by using a network execution graph. Each operation in a computational graph of a network model is translated into a corresponding hardware execution command executable by a target hardware device, and the hardware execution command is stored, so that subsequently, every time the network model needs to be run, the pre-stored hardware execution command is directly distributed to corresponding hardware for execution; each operation in the computational graph of the network model does not need to be translated into a hardware execution command again; therefore, the problem that a processor needs great performance overhead and consumes long time every time the processor runs the network model is mitigated.

Description

一种数据处理方法、装置、AI芯片、电子设备及存储介质Data processing method, device, AI chip, electronic device and storage medium
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2022年11月25日提交于中国国家知识产权局的申请号为202211486836.8、名称为“数据处理方法、装置、AI芯片、电子设备及存储介质”的中国专利申请以及于2022年11月25日提交于中国国家知识产权局的申请号为202211486830.0、名称为“命令生成方法、装置、AI芯片、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims priority to the Chinese patent application with application number 202211486836.8 filed with the State Intellectual Property Office of China on November 25, 2022, and entitled “Data processing method, device, AI chip, electronic device and storage medium”, and the Chinese patent application with application number 202211486830.0 filed with the State Intellectual Property Office of China on November 25, 2022, and entitled “Command generation method, device, AI chip, electronic device and storage medium”, the entire contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请属于人工智能技术领域,具体涉及一种数据处理方法、装置、AI芯片、电子设备及存储介质。The present application belongs to the field of artificial intelligence technology, and specifically relates to a data processing method, device, AI chip, electronic device and storage medium.
背景技术Background technique
当需要运行AI(Artificial Intelligence,人工智能)网络进行数据处理、计算时,通常需要将网络模型对应的计算任务加载到用于执行计算任务的硬件设备上,该过程需要为网络模型生成硬件能够识别并执行的硬件执行命令。When it is necessary to run an AI (Artificial Intelligence) network for data processing and calculation, it is usually necessary to load the computing tasks corresponding to the network model onto the hardware device used to perform the computing tasks. This process requires generating hardware execution commands for the network model that can be recognized and executed by the hardware.
目前,处理器每次运行网络模型时,都需要重新为该网络模型中的各个操作(或称为算子)翻译出对应的硬件执行命令并尽快提供给硬件去执行,而每个操作被翻译成硬件执行命令都需要耗一点时间。在处理器每次需要运行网络模型时,每处理到该网络模型中的一个操作,驱动程序都需要将该操作翻译成硬件执行命令并发送给硬件执行,然后才处理下一个操作,驱动程序再将该下一个操作翻译成硬件执行命令并发送给硬件执行,直至驱动程序将该网络模型的最后一个操作翻译完,并且将本次运行该网络模型所需的最后一个操作对应的硬件执行命令发送给硬件执行。这样的处理方式,使得处理器每次运行网络模型来进行数据处理时,每次执行、运行AI网络都需要很大的性能开销,且耗时长。At present, every time the processor runs a network model, it needs to re-translate the corresponding hardware execution commands for each operation (or operator) in the network model and provide them to the hardware for execution as soon as possible, and each operation needs to be translated into a hardware execution command. It takes a little time. Every time the processor needs to run a network model, every time it processes an operation in the network model, the driver needs to translate the operation into a hardware execution command and send it to the hardware for execution, and then process the next operation. The driver translates the next operation into a hardware execution command and sends it to the hardware for execution until the driver translates the last operation of the network model and sends the hardware execution command corresponding to the last operation required to run the network model this time. This processing method requires a lot of performance overhead and takes a long time to execute and run the AI network every time the processor runs the network model for data processing.
另外,鉴于目前在AI芯片(可以是各种处理器)使用网络模型进行数据处理的场景下,处理器在为每个网络模型生成硬件执行命令(可简称硬件命令或命令)的过程中,硬件执行命令通常需要包含或反映:操作的类型或内容、用于获取操作所需的数据源的读地址、用于存储操作运算结果的写地址等信息,处理器在将网络模型的各个操作翻译成硬件执行命令时,会用到很多内存信息。In addition, given that AI chips (which can be various processors) currently use network models for data processing, when the processor generates hardware execution commands (referred to as hardware commands or commands) for each network model, the hardware execution commands usually need to include or reflect: the type or content of the operation, the read address of the data source required for the operation, and the write address for storing the operation results. The processor uses a lot of memory information when translating the various operations of the network model into hardware execution commands.
在为网络模型生成硬件执行命令的过程中,会涉及内存资源的分配、占用,并且不同网络模型的数据会占用不同的内存空间,这样会加剧网络模型对有限内存资源的占用。尤其是在执行网络模型所需数据需要占用很大内存的时候,很容易出现内存不够用的情况,从而导致难以按照预期为网络模型翻译出硬件执行命令,进而导致硬件难以按照预期执行网络模型。此外,也可能因为这种内存不够用的情况而影响硬件在其他方面的性能。The process of generating hardware execution commands for network models involves the allocation and occupation of memory resources, and the data of different network models will occupy different memory spaces, which will increase the occupation of limited memory resources by network models. Especially when the data required to execute the network model requires a large amount of memory, it is easy to have insufficient memory, which makes it difficult to translate the hardware execution commands for the network model as expected, and then makes it difficult for the hardware to execute the network model as expected. In addition, this insufficient memory may also affect the performance of the hardware in other aspects.
发明内容Summary of the invention
鉴于此,本申请的一个方面提供了一种数据处理方法,以改善当前处理器每次运行网络模型进行数据处理时都需要很大的性能开销,且耗时长的问题,以及改善相关技术中在为网络模型翻译生成硬件执行命令时,都需要占用很大的内存资源开销,容易导致内存不够用的问题。In view of this, one aspect of the present application provides a data processing method to improve the problem that the current processor requires a large performance overhead and takes a long time each time it runs a network model for data processing, and to improve the problem in the related technology that when generating hardware execution commands for network model translation, a large memory resource overhead is required, which easily leads to insufficient memory.
为了实现上述目的,本申请的实施例是这样实现的:In order to achieve the above purpose, the embodiments of the present application are implemented as follows:
第一方面,本申请实施例提供了一种数据处理方法,所述数据处理方法可以包括:获取待运行的网络模型的计算图;将所述网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备用于通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。In a first aspect, an embodiment of the present application provides a data processing method, which may include: obtaining a computational graph of a network model to be run; translating each operation in the computational graph of the network model into a hardware execution command that can be executed by a target hardware device of the AI chip, wherein the hardware execution command includes device information of the target hardware device; using a network execution graph to store the hardware execution command, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device is used to run the network model by executing the hardware execution commands in the network execution graph.
结合上述第一方面实施例的一种可能的实施方式,将所述网络模型的计算图中包含的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,可以包括:将所述网络模型的计算图中的各个操作的源代码编译成指令,并获取目标硬件设备执行各个操作所需的相关信息;根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述硬件执行命令。In combination with a possible implementation method of the first aspect of the above-mentioned embodiment, translating each operation contained in the computation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip may include: compiling the source code of each operation in the computation graph of the network model into instructions, and obtaining relevant information required for the target hardware device to perform each operation; generating the hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
其中,可以利用预设第一API函数(如创建对象API、编译指令API)将网络模型的计算图中的各个操作的源代码编译成指令,再利用预设第二API函数(如内存分配API、数据搬运API)获得目标硬件设备执行各个操作所需的相关信息(如指令的地址、长度、指令需要操作多少个内存地址、指令之间的执行顺序等),之后利用预设第三API函数(如执行API)根据各个操作的对应的指令与执行各个操作所需的相关信息,生成硬件执行命令,这样可以快速、准确地将网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令。Among them, the preset first API function (such as object creation API, compile instruction API) can be used to compile the source code of each operation in the calculation graph of the network model into instructions, and then the preset second API function (such as memory allocation API, data transfer API) can be used to obtain the relevant information required for the target hardware device to perform each operation (such as the address and length of the instruction, the number of memory addresses that the instruction needs to operate, the execution order between instructions, etc.), and then the preset third API function (such as execution API) is used to generate hardware execution commands according to the corresponding instructions of each operation and the relevant information required to execute each operation. In this way, each operation in the calculation graph of the network model can be quickly and accurately translated into hardware execution commands that can be executed by the target hardware device.
结合上述第一方面实施例的一种可能的实施方式,利用网络执行图存储所述硬件执行命令,可以包括:按照所述网络模型中包含的各个操作的执行顺序,依次将各个操作对应的硬件执行命令存储至所述网络执行图中,并记录每个硬件执行命令的关键信息,所述关键信息用于获取所述硬件执行命令。In combination with a possible implementation method of the first aspect of the above-mentioned embodiment, using a network execution graph to store the hardware execution command may include: storing the hardware execution command corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and recording key information of each hardware execution command, wherein the key information is used to obtain the hardware execution command.
结合上述第一方面实施例的一种可能的实施方式,所述数据处理方法还可以包括:在需要运行所述网络模型时,获取预先存储在所述网络执行图中的所述硬件执行命令;将所述硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型。In combination with a possible implementation manner of the first aspect embodiment above, the data processing method may also include: when it is necessary to run the network model, obtaining the hardware execution command pre-stored in the network execution graph; sending the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command to realize running the network model on the target hardware device.
结合上述第一方面实施例的一种可能的实施方式,将所述硬件执行命令发送给所述目标硬件设备执行,可以包括: 修改所述硬件执行命令中用于获取输入数据的读地址,以及/或者,修改所述硬件执行命令中用于存储输出数据的写地址;将修改后的硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行修改后的所述硬件执行命令,实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。In conjunction with a possible implementation manner of the first aspect of the embodiment, sending the hardware execution command to the target hardware device for execution may include: Modify the read address used to obtain input data in the hardware execution command, and/or modify the write address used to store output data in the hardware execution command; send the modified hardware execution command to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, thereby achieving the purpose of running the network model on the target hardware device to process the input data.
结合上述第一方面实施例的一种可能的实施方式,所述数据处理方法还可以包括:根据所述AI芯片中的硬件设备的总数量,对所述硬件执行命令进行复制;根据所述AI芯片中除所述目标硬件设备外的其他硬件设备的设备信息,对复制出的所述硬件执行命令中包含的设备信息进行修改,得到修改过设备信息的硬件执行命令,其中,修改过设备信息的硬件执行命令能够被提供给所述其他硬件设备执行。In combination with a possible implementation manner of the first aspect embodiment above, the data processing method may further include: copying the hardware execution command according to the total number of hardware devices in the AI chip; modifying the device information contained in the copied hardware execution command according to the device information of other hardware devices in the AI chip except the target hardware device, to obtain a hardware execution command with modified device information, wherein the hardware execution command with modified device information can be provided to the other hardware devices for execution.
结合上述第一方面实施例的一种可能的实施方式,所述数据处理方法还可以包括:根据待处理的数据量,确定当前需要运行所述网络模型的硬件设备的第一数量。In combination with a possible implementation manner of the first aspect embodiment above, the data processing method may further include: determining a first number of hardware devices currently required to run the network model based on an amount of data to be processed.
结合上述第一方面实施例的一种可能的实施方式,将所述网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,可以包括:为所述网络模型分配对应的虚假内存空间;以及基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性。In combination with a possible implementation manner of the first aspect embodiment above, translating each operation in the computation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip may include: allocating a corresponding virtual memory space to the network model; and based on the virtual memory space, translating each operation contained in the network model into a corresponding first hardware execution command, wherein the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
结合上述第一方面实施例的一种可能的实施方式,利用网络执行图存储所述硬件执行命令,可以包括:在基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令之后,存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。In combination with a possible implementation manner of the first aspect embodiment mentioned above, using a network execution graph to store the hardware execution command may include: after translating each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space, storing the first hardware execution command, wherein the first hardware execution command is used to be provided to a hardware device that needs to run the network model for execution after address replacement.
结合上述第一方面实施例的一种可能的实施方式,为所述网络模型分配对应的虚假内存空间,可以包括:根据执行所述网络模型所需的数据大小,分配与所述数据大小对应的虚假内存空间。In combination with a possible implementation manner of the first aspect embodiment described above, allocating a corresponding virtual memory space for the network model may include: allocating a virtual memory space corresponding to the data size required to execute the network model.
以此有利于在不过多占用硬件的真实内存资源的情况下,满足生成命令所需的要求。This helps to meet the requirements for generating commands without occupying too much real memory resources of the hardware.
结合上述第一方面实施例的一种可能的实施方式,在基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令之前,所述数据处理方法还可以包括:判断从当前时刻开始后的预设时间段内是否执行所述网络模型;在确定从当前时刻开始后的预设时间段内不执行所述网络模型时,执行步骤:基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令。In combination with a possible implementation manner of the first aspect embodiment above, before translating the various operations contained in the network model into corresponding first hardware execution commands based on the virtual memory space, the data processing method may also include: determining whether the network model is executed within a preset time period after the current moment; when it is determined that the network model is not executed within the preset time period after the current moment, executing the steps: based on the virtual memory space, translating the various operations contained in the network model into corresponding first hardware execution commands.
结合上述第一方面实施例的一种可能的实施方式,所述数据处理方法还可以包括:在确定从当前时刻开始后的预设时间段内要执行所述网络模型时,基于所述真实内存空间,将所述网络模型中包含的各个操作翻译成对应的第二硬件执行命令,其中,所述第二硬件执行命令中包含的地址均为真实地址,所述真实内存空间用于存储执行所述网络模型时所需的数据。In combination with a possible implementation manner of the first aspect embodiment above, the data processing method may also include: when determining that the network model is to be executed within a preset time period starting from the current moment, based on the real memory space, translating each operation contained in the network model into a corresponding second hardware execution command, wherein the addresses contained in the second hardware execution command are all real addresses, and the real memory space is used to store the data required to execute the network model.
结合上述第一方面实施例的一种可能的实施方式,在存储所述第一硬件执行命令之后,所述数据处理方法还可以包括:在需要执行所述网络模型时,将执行所述网络模型所需的数据加载到所述真实内存空间;利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址;将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备,以供所述对应的硬件设备执行所述第二硬件执行命令。In combination with a possible implementation manner of the first aspect embodiment above, after storing the first hardware execution command, the data processing method may also include: when the network model needs to be executed, loading the data required to execute the network model into the real memory space; replacing the false address in the first hardware execution command with the real address corresponding to the real memory space; and sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device, so that the corresponding hardware device can execute the second hardware execution command.
通过该实施方式,便可以在保证不占用硬件的真实内存资源的情况下,又不会影响网络模型的正常使用。Through this implementation, it is possible to ensure that the actual memory resources of the hardware are not occupied and the normal use of the network model is not affected.
结合上述第一方面实施例的一种可能的实施方式,利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址,可以包括:对所述第一硬件执行命令进行识别,确定出当前包含虚假地址的部分或全部第一硬件执行命令,作为目标命令;利用所述真实内存空间对应的真实地址,替换掉所述目标命令中的虚假地址。In combination with a possible implementation manner of the first aspect embodiment mentioned above, replacing the false address in the first hardware execution command with the real address corresponding to the real memory space may include: identifying the first hardware execution command, determining part or all of the first hardware execution command currently containing the false address as the target command; replacing the false address in the target command with the real address corresponding to the real memory space.
结合上述第一方面实施例的一种可能的实施方式,在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行之后,所述数据处理方法还可以包括:将所述第二硬件执行命令中的真实地址替换为所述虚假内存空间对应的虚假地址,并将地址替换为虚假地址的命令进行缓存。In combination with a possible implementation manner of the first aspect embodiment above, after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, the data processing method may also include: replacing the real address in the second hardware execution command with the false address corresponding to the virtual memory space, and caching the command with the address replaced by the false address.
结合上述第一方面实施例的一种可能的实施方式,基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,可以包括:将所述网络模型中包含的各个操作的源代码编译成各操作分别对应的指令,并基于所述虚假内存空间,获得执行所述网络模型中包含的各个操作所需的相关信息,所述相关信息包括地址信息;以及根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述第一硬件执行命令。In combination with a possible implementation method of the first aspect embodiment above, based on the virtual memory space, translating each operation contained in the network model into a corresponding first hardware execution command can include: compiling the source code of each operation contained in the network model into instructions corresponding to each operation, and based on the virtual memory space, obtaining relevant information required to execute each operation contained in the network model, the relevant information including address information; and generating the first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
结合上述第一方面实施例的一种可能的实施方式,不同的网络模型对应的虚假内存空间不同。In combination with a possible implementation manner of the first aspect of the embodiment, different network models correspond to different virtual memory spaces.
第二方面,本申请实施例还提供了一种数据处理方法,所述数据处理方法可以包括:在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令;将所述硬件执行命令发送给所述目标硬件设备,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。In the second aspect, an embodiment of the present application also provides a data processing method, which may include: when it is necessary to run a network model, obtaining a pre-stored hardware execution command that can be executed by a target hardware device corresponding to the network model; sending the hardware execution command to the target hardware device so that the target hardware device executes the hardware execution command to achieve the purpose of running the network model on the target hardware device to process the input data.
结合上述第二方面实施例的一种可能的实施方式,在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令,可以包括:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取预先存储的第一硬件执行命令;其中,所述第一硬件执行命令是基于虚假内存空间对所述网络模型中包含的各个操作进行翻译得到的,所述虚假内存空间与真实内存空间具备相同属性;以及利用所述真实内存空间 对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。In combination with a possible implementation of the second aspect of the embodiment, when it is necessary to run a network model, obtaining a pre-stored hardware execution command that can be executed by a target hardware device corresponding to the network model may include: when it is necessary to execute the network model, loading the network original data corresponding to the network model into a real memory space, and obtaining a pre-stored first hardware execution command; wherein the first hardware execution command is obtained by translating each operation contained in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space; and using the real memory space The corresponding real address replaces the false address in the first hardware execution command.
结合上述第二方面实施例的一种可能的实施方式,在利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址之后,所述数据处理方法还包括:将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备。In combination with a possible implementation manner of the above-mentioned second aspect embodiment, after replacing the false address in the first hardware execution command with the real address corresponding to the real memory space, the data processing method also includes: sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
结合上述第二方面实施例的一种可能的实施方式,在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行之后,所述数据处理方法还可以包括:将所述第二硬件执行命令中的真实地址替换为虚假内存空间对应的虚假地址,并缓存地址替换为虚假地址的第二硬件执行命令。In combination with a possible implementation manner of the above-mentioned second aspect embodiment, after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, the data processing method may also include: replacing the real address in the second hardware execution command with a false address corresponding to the false memory space, and replacing the cache address with the second hardware execution command with the false address.
第三方面,本申请实施例还提供了一种数据处理装置,所述数据处理装置可以包括:获取模块、命令生成模块以及存储模块,其中,所述获取模块被配置成用于:获取待运行的网络模型的计算图;所述命令生成模块被配置成用于:将网络模型的计算图中的各个操作翻译成对应的目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;所述存储模块被配置成用于:利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备能够通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。In the third aspect, an embodiment of the present application also provides a data processing device, which may include: an acquisition module, a command generation module and a storage module, wherein the acquisition module is configured to: acquire the calculation graph of the network model to be run; the command generation module is configured to: translate each operation in the calculation graph of the network model into a hardware execution command that can be executed by the corresponding target hardware device, and the hardware execution command contains the device information of the target hardware device; the storage module is configured to: store the hardware execution command using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can run the network model by executing the hardware execution commands in the network execution graph.
结合上述第三方面实施例的一种可能的实施方式,所述命令生成模块可以包括:分配模块以及翻译模块,其中,所述分配模块被配置成用于:为所述网络模型分配对应的虚假内存空间;所述翻译模块被配置成用于:基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性。In combination with a possible implementation manner of the third aspect embodiment above, the command generation module may include: an allocation module and a translation module, wherein the allocation module is configured to: allocate a corresponding virtual memory space for the network model; the translation module is configured to: based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
结合上述第三方面实施例的一种可能的实施方式,所述存储模块被配置成用于:存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。In combination with a possible implementation manner of the third aspect embodiment above, the storage module is configured to: store the first hardware execution command, and the first hardware execution command is used to be provided to the hardware device that needs to run the network model for execution after the address is replaced.
第四方面,本申请实施例还提供了一种数据处理装置,所述数据处理装置可以包括:获取模块和发送模块,其中,所述获取模块被配置成用于:在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令;所述发送模块被配置成用于:将所述硬件执行命令发送给所述目标硬件设备,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。In a fourth aspect, an embodiment of the present application further provides a data processing device, which may include: an acquisition module and a sending module, wherein the acquisition module is configured to: when it is necessary to run a network model, acquire a pre-stored hardware execution command that can be executed by a target hardware device corresponding to the network model; the sending module is configured to: send the hardware execution command to the target hardware device so that the target hardware device executes the hardware execution command to achieve the purpose of running the network model on the target hardware device to process input data.
结合上述第四方面实施例的一种可能的实施方式,所述获取模块可以包括:第一硬件执行命令获取模块以及翻译模块,其中,所述第一硬件执行命令获取模块被配置成用于:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取预先存储的第一硬件执行命令;其中,所述第一硬件执行命令是基于虚假内存空间对所述网络模型中包含的各个操作进行翻译得到的,所述虚假内存空间与真实内存空间具备相同属性;所述翻译模块被配置成用于:利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。In combination with a possible implementation manner of the fourth aspect embodiment above, the acquisition module may include: a first hardware execution command acquisition module and a translation module, wherein the first hardware execution command acquisition module is configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, and obtain the pre-stored first hardware execution command; wherein the first hardware execution command is obtained by translating each operation contained in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space; the translation module is configured to: replace the virtual address in the first hardware execution command with the real address corresponding to the real memory space.
结合上述第四方面实施例的一种可能的实施方式,所述发送模块还被配置成用于:将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备。In combination with a possible implementation manner of the fourth aspect of the embodiment, the sending module is further configured to: send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
第五方面,本申请实施例还提供了一种AI芯片,所述AI芯片可以包括:内核和存储设备,其中,所述内核被配置成用于:获取待运行的网络模型的计算图,并将所述网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;所述存储设备被配置成用于:利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备能够用于通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。In a fifth aspect, an embodiment of the present application further provides an AI chip, which may include: a kernel and a storage device, wherein the kernel is configured to: obtain a computational graph of a network model to be run, and translate each operation in the computational graph of the network model into a hardware execution command executable by a target hardware device, wherein the hardware execution command contains device information of the target hardware device; the storage device is configured to: store the hardware execution command using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can be used to run the network model by executing the hardware execution commands in the network execution graph.
结合上述第五方面实施例的一种可能的实施方式,所述内核被配置成用于:为所述网络模型分配对应的虚假内存空间,并基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性;所述存储设备被配置成用于:存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。In combination with a possible implementation manner of the fifth aspect embodiment above, the kernel is configured to: allocate a corresponding virtual memory space for the network model, and based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space; the storage device is configured to: store the first hardware execution command, and the first hardware execution command is used to provide it to the hardware device that needs to run the network model for execution after the address is replaced.
第六方面,本申请实施例还提供了一种AI芯片,所述AI芯片可以包括:硬件设备、存储设备以及内核,其中,所述存储设备被配置成用于:存储网络模型的计算图中的各个操作对应的硬件执行命令;所述内核被配置成用于:在需要运行所述网络模型时,从所述存储设备中获取先存储的所述硬件执行命令,并将所述的硬件执行命令发送给所述硬件设备;所述硬件设备被配置成用于:执行所述硬件执行命令,以实现运行所述网络模型对输入数据进行处理的目的。In the sixth aspect, an embodiment of the present application also provides an AI chip, which may include: a hardware device, a storage device and a kernel, wherein the storage device is configured to: store hardware execution commands corresponding to each operation in the computational graph of the network model; the kernel is configured to: when it is necessary to run the network model, obtain the previously stored hardware execution commands from the storage device, and send the hardware execution commands to the hardware device; the hardware device is configured to: execute the hardware execution commands to achieve the purpose of running the network model to process the input data.
结合上述第六方面实施例的一种可能的实施方式,所述存储设备还被配置成用于:存储第一硬件执行命令;其中,所述第一硬件执行命令是基于虚假内存空间对网络模型中包含的各个操作进行翻译得到的,所述虚假内存空间与真实内存空间具备相同属性;所述内核还被配置成用于:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取存储于所述存储设备中的第一硬件执行命令,利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令作为第二硬件执行命令,发送给硬件设备;所述硬件设备还被配置成用于:执行第二硬件执行命令,以实现运行所述网络模型对输入数据进行处理的目的。In combination with a possible implementation manner of the sixth aspect embodiment above, the storage device is also configured to: store a first hardware execution command; wherein the first hardware execution command is obtained by translating each operation contained in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space; the kernel is also configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, obtain the first hardware execution command stored in the storage device, replace the virtual address in the first hardware execution command with the real address corresponding to the real memory space, and send the replaced first hardware execution command as the second hardware execution command to the hardware device; the hardware device is also configured to: execute the second hardware execution command to achieve the purpose of running the network model to process the input data.
第七方面,本申请实施例还提供了一种电子设备,所述电子设备可以包括:存储器和处理器,所述处理器与所述存储器连接,其中,所述存储器被配置成用于存储程序;所述处理器被配置成用于调用存储于所述存储器中的程序,以执 行上述第一方面实施例和/或结合第一方面实施例的任一种可能的实施方式提供的数据处理方法,或者,执行上述第二方面实施例和/或结合第二方面实施例的任一种可能的实施方式提供的数据处理方法。In a seventh aspect, an embodiment of the present application further provides an electronic device, which may include: a memory and a processor, wherein the processor is connected to the memory, wherein the memory is configured to store a program; and the processor is configured to call the program stored in the memory to execute Execute the data processing method provided by the above-mentioned first aspect embodiment and/or any possible implementation method combined with the first aspect embodiment, or execute the data processing method provided by the above-mentioned second aspect embodiment and/or any possible implementation method combined with the second aspect embodiment.
第八方面,本申请实施例还提供了一种计算机可读存储介质,在所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时,执行上述第一方面实施例和/或结合第一方面实施例的任一种可能的实施方式提供的数据处理方法,或者,执行上述第二方面实施例和/或结合第二方面实施例的任一种可能的实施方式提供的数据处理方法。In the eighth aspect, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, it executes the data processing method provided by the above-mentioned first aspect embodiment and/or any possible implementation method combined with the first aspect embodiment, or executes the above-mentioned second aspect embodiment and/or any possible implementation method combined with the second aspect embodiment.
本申请的其他特征和优点将在随后的说明书阐述。本申请的目的和其他优点可通过在所写的说明书以及附图中所特别指出的结构来实现和获得。Other features and advantages of the present application will be described in the following description. The purpose and other advantages of the present application can be realized and obtained through the structures specifically pointed out in the written description and the drawings.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了更清楚地说明本申请实施例或相关技术中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。通过附图所示,本申请的上述及其它目的、特征和优势将更加清晰。在全部附图中相同的附图标记指示相同的部分。并未刻意按实际尺寸等比例缩放绘制附图,重点在于示出本申请的主旨。In order to more clearly illustrate the technical solutions in the embodiments of the present application or the related art, the drawings required for use in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can also be obtained based on these drawings without creative work. As shown in the drawings, the above and other purposes, features and advantages of the present application will be clearer. The same reference numerals indicate the same parts in all the drawings. The drawings are not deliberately scaled to the actual size, and the focus is on showing the main purpose of the present application.
图1示出了本申请实施例提供的一种数据处理方法的流程示意图。FIG1 shows a schematic flow chart of a data processing method provided in an embodiment of the present application.
图2示出了本申请实施例提供的一种数据处理方法中的部分步骤的流程示意图。FIG2 is a flow chart showing some steps in a data processing method provided in an embodiment of the present application.
图3示出了本申请实施例提供的一种数据处理方法中的部分步骤的另一流程示意图。FIG. 3 shows another flowchart of some steps in a data processing method provided in an embodiment of the present application.
图4示出了本申请实施例提供的又一种数据处理方法的流程示意图。FIG4 shows a flow chart of another data processing method provided in an embodiment of the present application.
图5示出了本申请实施例提供的又一种数据处理方法中的部分步骤的流程示意图。FIG5 is a flow chart showing some steps in another data processing method provided in an embodiment of the present application.
图6示出了本申请实施例提供的一种数据处理装置的模块示意图。FIG6 shows a module diagram of a data processing device provided in an embodiment of the present application.
图7示出了本申请实施例提供的一种数据处理装置的更详细的模块示意图。FIG. 7 shows a more detailed module diagram of a data processing device provided in an embodiment of the present application.
图8示出了本申请实施例提供的又一种数据处理装置的模块示意图。FIG8 shows a module diagram of another data processing device provided in an embodiment of the present application.
图9示出了本申请实施例提供的又一种数据处理装置的更详细的模块示意图。FIG. 9 shows a more detailed module diagram of yet another data processing device provided in an embodiment of the present application.
图10示出了本申请实施例提供的一种电子设备的结构示意图。FIG. 10 shows a schematic structural diagram of an electronic device provided in an embodiment of the present application.
图11示出了本申请实施例提供的另一种电子设备的结构示意图。FIG. 11 shows a schematic structural diagram of another electronic device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。在不冲突的情况下,本申请中的各实施例或实施例中更具体的实施细节可以相互组合使用。The technical solutions in the embodiments of the present application will be described below in conjunction with the accompanying drawings in the embodiments of the present application. In the absence of conflict, the embodiments of the present application or more specific implementation details in the embodiments can be used in combination with each other.
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。同时,在本申请的描述中诸如术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。It should be noted that similar numbers and letters represent similar items in the following figures, so once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. At the same time, in the description of this application, terms such as "including", "comprising" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also includes other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element defined by the statement "including a..." does not exclude the presence of other identical elements in the process, method, article or device including the element.
再者,本申请中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。Furthermore, the term "and/or" in this application is merely a term used to describe the association relationship between associated objects, indicating that three relationships may exist. For example, A and/or B may represent three situations: A exists alone, A and B exist at the same time, and B exists alone.
本申请中的术语“第一”、“第二”仅仅用来将一个实体或者操作或对象,与另一个实体或操作或对象区分开来,并不是要求或者暗示这些实体或操作或对象之间存在任何这种实际的关系或者顺序。The terms "first" and "second" in this application are only used to distinguish one entity or operation or object from another entity or operation or object, and do not require or imply any actual relationship or order between these entities or operations or objects.
本申请实施例涉及使用网络模型(各类神经网络模型)进行数据处理的应用场景,为了更好地理解本申请实施例的方案,下面先对本申请实施例可能涉及的相关术语和概念进行介绍。The embodiments of the present application involve application scenarios in which network models (various neural network models) are used for data processing. In order to better understand the solutions of the embodiments of the present application, the relevant terms and concepts that may be involved in the embodiments of the present application are first introduced below.
其中,神经网络模型可以是由神经单元组成的,具体可以理解为具有输入层、隐含层、输出层的神经网络模型,一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层(可以有很多隐含层)。神经网络模型使用一个或多个层(如,隐藏层、输出层等)来为接收到的输入生成输出,每个隐藏层的输出被用作下一层(如,下一个隐藏层或输出层)的输入,神经网络模型的每一层根据该层当前的相关参数(如,权重)为接收到的输入生成输出。Among them, the neural network model can be composed of neural units, which can be specifically understood as a neural network model with an input layer, a hidden layer, and an output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are all hidden layers (there can be many hidden layers). The neural network model uses one or more layers (such as hidden layers, output layers, etc.) to generate outputs for received inputs. The output of each hidden layer is used as the input of the next layer (such as the next hidden layer or output layer). Each layer of the neural network model generates outputs for received inputs according to the current relevant parameters of the layer (such as weights).
神经网络模型中包含的各个操作(如卷积运算、池化运算、激活、归一化、分类处理等操作)需要在被翻译为硬件执行命令后,才可以让硬件设备去执行,硬件设备通过执行这些硬件执行命令来使得网络模型中相应操作的功能被实现,从而支持在硬件设备上运行神经网络对输入数据进行处理的功能。Each operation contained in the neural network model (such as convolution, pooling, activation, normalization, classification, etc.) needs to be translated into hardware execution commands before the hardware device can execute them. The hardware device implements the functions of the corresponding operations in the network model by executing these hardware execution commands, thereby supporting the function of running the neural network on the hardware device to process the input data.
为了表达网络模型的计算逻辑,常采用计算图的形式反映网络模型的计算逻辑,计算图中的每个节点可以对应网络模型中的一个操作,这些操作在网络模型中也称为算子,每个算子都有自己独特之处,用来完成特定的功能。一个网络模型的计算图中通常包含很多不同的操作,示例性的例如卷积运算、池化运算、激活函数等。In order to express the computational logic of the network model, a computational graph is often used to reflect the computational logic of the network model. Each node in the computational graph can correspond to an operation in the network model. These operations are also called operators in the network model. Each operator has its own unique features and is used to complete specific functions. The computational graph of a network model usually contains many different operations, such as convolution operations, pooling operations, activation functions, etc.
鉴于发明人发现,目前在面临需要通过运行网络模型来进行数据处理的场景时,处理器每次运行网络模型时,都需要重新将该网络模型中的各个操作临时翻译成硬件执行命令,并将临时为单个操作生成的硬件执行命令发送给硬件执行(即,在为网络模型的一个操作生成一部分硬件执行命令并将这部分命令发送给硬件执行以后,才会为同一网络模型的下一个操作生成另一部分的硬件执行命令并发送给硬件执行),使得处理器每次运行网络模型时都需要很大的性能开销, 且耗时长。In view of the inventor's discovery that, currently, when faced with a scenario where data processing needs to be performed by running a network model, each time the processor runs the network model, it is necessary to temporarily translate each operation in the network model into a hardware execution command, and send the hardware execution command temporarily generated for a single operation to the hardware for execution (that is, after a part of the hardware execution command is generated for an operation of the network model and sent to the hardware for execution, another part of the hardware execution command is generated for the next operation of the same network model and sent to the hardware for execution), so that the processor has a large performance overhead each time it runs the network model, And it takes a long time.
因此,发明人根据网络模型的特点,提出了以下实施例,以此来改善上述问题。Therefore, the inventors proposed the following embodiments according to the characteristics of the network model to improve the above problems.
本申请发明人考虑到在用一个网络模型进行数据处理时,网络模型本身的结构是固定的,只是这个网络模型被加载到硬件执行时每次处理的输入数据可能不一样,以及对应不同输入数据可能有不同的输出结果而已。基于此,本申请为网络模型中包含的各个操作提前生成(或称提前翻译)出对应的目标硬件设备能够执行的硬件执行命令,但是先不发给目标硬件设备(是指能够执行硬件执行命令的硬件设备)执行,而是先把网络模型中包含的各个操作对应的硬件执行命令存储起来(例如,存储到构建的网络执行图中),使得后续每次需要利用该网络模型来对输入数据进行处理(如识别、分类、特征提取、尺寸变换等)时,可以将事先存储的硬件执行命令分发给对应的硬件执行,以此有利于快速将网络模型的计算逻辑、计算任务加载到需要运行网络模型的硬件上。在提前为网络模型的各个操作生成硬件执行命令的情况下,每次执行的时候,仅需要对与输入、输出相关的内容进行修改即可,不需要重新将该网络模型中的各个操作再次翻译成硬件执行命令,以此改善处理器每次运行网络模型时,都需要很大的性能开销,且耗时长的问题。The inventor of the present application considers that when a network model is used for data processing, the structure of the network model itself is fixed, but the input data processed each time when the network model is loaded into the hardware for execution may be different, and there may be different output results for different input data. Based on this, the present application generates (or translates in advance) the hardware execution commands that can be executed by the corresponding target hardware device for each operation contained in the network model, but does not send it to the target hardware device (referring to the hardware device that can execute the hardware execution command) for execution first, but first stores the hardware execution commands corresponding to each operation contained in the network model (for example, stored in the constructed network execution graph), so that each time the network model needs to be used to process the input data (such as recognition, classification, feature extraction, size transformation, etc.), the pre-stored hardware execution commands can be distributed to the corresponding hardware for execution, which is conducive to quickly loading the computing logic and computing tasks of the network model onto the hardware that needs to run the network model. When hardware execution commands are generated in advance for each operation of the network model, each time the execution occurs, only the content related to the input and output needs to be modified. There is no need to re-translate each operation in the network model into hardware execution commands. This improves the problem that the processor requires a large performance overhead and takes a long time each time it runs the network model.
本申请实施例提供了一种数据处理方法,可以应用于各种人工智能应用场景下使用的网络模型。人工智能应用场景包括但不限于:文字处理、语音识别与处理、多国语言翻译、图像识别、生物特征识到、智能控制。该数据处理方法可应用于驱动程序,可以应用于AI芯片,该AI芯片可以是同构的处理器,也可以是异构的处理器。The embodiment of the present application provides a data processing method that can be applied to network models used in various artificial intelligence application scenarios. Artificial intelligence application scenarios include but are not limited to: text processing, speech recognition and processing, multi-language translation, image recognition, biometric recognition, and intelligent control. The data processing method can be applied to a driver program and can be applied to an AI chip, which can be a homogeneous processor or a heterogeneous processor.
为了更好的理解,下面将结合图1、图2和图3,对本申请实施例提供的数据处理方法进行说明。For a better understanding, the data processing method provided in the embodiment of the present application will be described below in conjunction with Figures 1, 2 and 3.
S1:获取待运行的网络模型的计算图。S1: Get the computational graph of the network model to be run.
在将网络模型的各个操作翻译成硬件执行命令之前,获取待运行的、待翻译的网络模型的计算图。Before translating each operation of the network model into a hardware execution command, a computational graph of the network model to be run and translated is obtained.
在人工智能领域,计算图是一种常用的计算过程表示方法,常用来表示神经网络模型设计的计算逻辑,普遍应用在各类数据处理平台。计算图中的每个节点表示网络模型需要进行的相应运算(即算子或操作),节点之间的有向边表示节点对应的操作之间的依赖关系。计算图中的各个操作(或称算子)被翻译成硬件执行命令后,再发给对应的硬件设备去执行,进而完成该网络模型的执行。计算图中节点对应的操作运算符可定义在代数运算符(如向量的加、减、乘、除和矩阵乘法等)这个粒度上,对于运算符的抽象粒度较低的情况,往往一个网络模型的计算图可能包括很多个(例如可能有数千个)节点。In the field of artificial intelligence, computational graphs are a commonly used method for representing computational processes. They are often used to represent the computational logic of neural network model design and are widely used in various data processing platforms. Each node in the computational graph represents the corresponding operation (i.e., operator or operation) that the network model needs to perform, and the directed edges between nodes represent the dependencies between the operations corresponding to the nodes. After each operation (or operator) in the computational graph is translated into a hardware execution command, it is sent to the corresponding hardware device for execution, thereby completing the execution of the network model. The operation operators corresponding to the nodes in the computational graph can be defined at the granularity of algebraic operators (such as vector addition, subtraction, multiplication, division, and matrix multiplication, etc.). For the case where the abstract granularity of the operator is low, the computational graph of a network model may often include many (for example, thousands) nodes.
其中,步骤S1所获取的待运行、待翻译的网络模型的计算图,可以是原始计算图,也可以是经过优化后的计算图,例如可以是经过算子融合后得到的计算图。网络模型的网络结构转化为原始的计算图后,可以经过一次或多次优化,得到优化后的计算图。The computation graph of the network model to be run and translated obtained in step S1 may be an original computation graph or an optimized computation graph, for example, a computation graph obtained after operator fusion. After the network structure of the network model is converted into the original computation graph, it may be optimized once or multiple times to obtain an optimized computation graph.
作为一种实施方式,可以是AI芯片直接或间接获取网络模型的计算图,只要能够确定网络模型的结构、得知该网络模型需要实现的各个操作即可。AI芯片中设置有对应的驱动程序,驱动程序可以部署在AI芯片的内核中。As an implementation method, the AI chip can directly or indirectly obtain the calculation graph of the network model, as long as the structure of the network model can be determined and the various operations that need to be implemented by the network model can be known. The AI chip is provided with a corresponding driver, which can be deployed in the kernel of the AI chip.
S2:将网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令。S2: Translate each operation in the computational graph of the network model into hardware execution commands that can be executed by the target hardware device of the AI chip.
其中,S2的翻译过程可由AI芯片对应的驱动程序执行。Among them, the translation process of S2 can be executed by the driver corresponding to the AI chip.
翻译得到的硬件执行命令中可包含目标硬件设备的设备信息(如设备标识),用于表示该硬件执行命令可以由哪个硬件设备来执行,不同的目标硬件设备对应的设备信息不同。网络模型中的操作被翻译后,得到的硬件执行命令可以在需要运行该硬件执行命令对应的网络模型时,被提供给对应的硬件设备执行。The translated hardware execution command may contain the device information of the target hardware device (such as the device identification), which is used to indicate which hardware device can execute the hardware execution command. Different target hardware devices have different corresponding device information. After the operation in the network model is translated, the obtained hardware execution command can be provided to the corresponding hardware device for execution when the network model corresponding to the hardware execution command needs to be run.
目标硬件设备是指运行该硬件执行命令的硬件设备,是期望能够具有运行该网络模型这一能力的硬件对象。一个AI芯片可能会涉及多个硬件设备。The target hardware device refers to the hardware device that runs the hardware execution command and is the hardware object that is expected to have the ability to run the network model. An AI chip may involve multiple hardware devices.
示例性的,AI芯片可以是为了承担繁重的计算任务而设计的专用的计算加速芯片(或称加速器),如,图形处理器(Graphic Processing Unit,GPU)、张量处理器(Tensor Processing Unit,TPU)等,当然,也可以是其他的同构或异构处理器。For example, an AI chip can be a dedicated computing acceleration chip (or accelerator) designed to undertake heavy computing tasks, such as a graphics processing unit (GPU), a tensor processing unit (TPU), etc. Of course, it can also be other homogeneous or heterogeneous processors.
可选的,一个AI芯片可以包含多个硬件设备,该多个硬件设备中的任一个,都可以按实际需求被作为目标硬件设备。一个硬件设备可以包含多种硬件执行单元,比如一个AI芯片中的一个硬件设备可以包含但不限于:用于通用计算的第一单元(CU,Compute engine Unit)、用于AI加速计算的第二单元(TU,Tensor Unit)、用于数据搬运的第三单元(DMA,Direct Memory Access)等。AI芯片中的一个硬件设备也可以视为含有多个硬件执行单元的计算集群。不同类型的硬件设备所包含的硬件执行单元的数量可以不同,且种类也可以不同。具体硬件架构不应理解为对本申请方法实施例的限制。Optionally, an AI chip may include multiple hardware devices, any of which may be used as a target hardware device according to actual needs. A hardware device may include multiple hardware execution units. For example, a hardware device in an AI chip may include but is not limited to: a first unit (CU, Compute engine Unit) for general computing, a second unit (TU, Tensor Unit) for AI accelerated computing, and a third unit (DMA, Direct Memory Access) for data transfer, etc. A hardware device in an AI chip can also be regarded as a computing cluster containing multiple hardware execution units. The number of hardware execution units contained in different types of hardware devices may be different, and the types may also be different. The specific hardware architecture should not be understood as a limitation on the embodiments of the method of the present application.
一种可选实施方式下,S2可包括:将网络模型的计算图中的各个操作的源代码编译成指令(硬件机器指令),并获得目标硬件设备执行各个操作所需的相关信息;根据各个操作的对应的指令与执行各个操作所需的相关信息,生成硬件执行命令。示例性的,目标硬件设备执行一个操作所需的相关信息可以用于反映与该操作相关的:硬件指令的地址、长度、指令需要操作多少个内存地址、内存地址的具体位置在哪里、内存大小是多少、指令之间的处理顺序是什么等。In an optional implementation, S2 may include: compiling the source code of each operation in the computational graph of the network model into instructions (hardware machine instructions), and obtaining the relevant information required for the target hardware device to perform each operation; generating a hardware execution command according to the corresponding instructions of each operation and the relevant information required to perform each operation. Exemplarily, the relevant information required for the target hardware device to perform an operation can be used to reflect the following information related to the operation: the address and length of the hardware instruction, how many memory addresses the instruction needs to operate, where the memory address is located, how big the memory is, what the processing order between instructions is, etc.
示例性的,可利用预设第一API(Application Programming Interface,应用程序编程接口)函数将网络模型的计算图中的各个操作的源代码编译成指令,并利用预设第二API函数获得目标硬件设备执行各个操作所需的相关信息;利用预设第三API函数根据各个操作的对应的指令与执行各个操作所需的相关信息,生成硬件执行命令。其中,可以为网络模型的每个操作都提前生成各操作分别对应的硬件执行命令。为一个操作生成的硬件执行命令可能是成百条。 Exemplarily, the source code of each operation in the computational graph of the network model can be compiled into instructions using a preset first API (Application Programming Interface) function, and the relevant information required for the target hardware device to perform each operation can be obtained using a preset second API function; a preset third API function can be used to generate hardware execution commands based on the corresponding instructions of each operation and the relevant information required to perform each operation. Among them, hardware execution commands corresponding to each operation can be generated in advance for each operation of the network model. There may be hundreds of hardware execution commands generated for one operation.
其中,网络模型的计算图中包含很多不同的操作(也称为算子,每个算子都有自己独特之处,用来完成特定的功能),例如卷积运算、池化运算、激活函数等,为了能把网络模型的计算图中的各个操作翻译成硬件能够执行的硬件执行命令,驱动程序提供一组比较通用的API函数,比如创建对象API、编译指令API、内存分配API、数据搬运API、执行API等。Among them, the calculation graph of the network model contains many different operations (also called operators, each operator has its own unique features and is used to complete specific functions), such as convolution operations, pooling operations, activation functions, etc. In order to translate the various operations in the calculation graph of the network model into hardware execution commands that can be executed by the hardware, the driver provides a set of relatively general API functions, such as object creation API, compilation instruction API, memory allocation API, data transfer API, execution API, etc.
示例性的,对于网络模型的每个操作,驱动程序提供了一个类似于C++语法的可编程语言的功能,操作的源代码可以用这种语法来实现,同时驱动程序也会利用预设第一API函数(例如创建对象API、编译指令API),通过编译器将网络模型的计算图中的一个操作的源代码编译成该操作对应的硬件指令。关于利用编译器将源代码编译为指令这部分的具体实现细节已经为本领域所熟知,在此不再介绍。Exemplarily, for each operation of the network model, the driver provides a function of a programmable language similar to C++ syntax, and the source code of the operation can be implemented using this syntax. At the same time, the driver also uses a preset first API function (such as an object creation API, a compilation instruction API) to compile the source code of an operation in the computational graph of the network model into a hardware instruction corresponding to the operation through a compiler. The specific implementation details of using a compiler to compile source code into instructions are already well known in the art and will not be introduced here.
由于各个操作是需要有操作对象的,例如需要操作数据的,比如卷积运算这一操作,需要将输入数据和权重进行卷积运算,可以通过驱动程序提供的一个内存分配API,在内存上面分配一块空间,提供给卷积运算。另外,各个操作中的有一部分可能会涉及到数据的搬运,因此驱动程序提供的数据搬运API,用于搬运运算过程中的数据。驱动程序可通过利用预设第二API函数(例如前述的内存分配API、数据搬运API),获得目标硬件设备执行各个操作所需的相关信息,可利用预设第三API函数(例如执行API),根据各个操作的对应的指令与执行各个操作所需的相关信息,生成硬件执行命令。其中,关于如何为单个操作,组织指令和相关信息来生成单个操作的硬件执行命令,这部分的实现方式已经为本领域所熟知,在此不再介绍。Since each operation requires an operation object, for example, if data needs to be operated, such as the convolution operation, the input data and weight need to be convolved, a memory allocation API provided by the driver can be used to allocate a space on the memory and provide it to the convolution operation. In addition, part of each operation may involve the handling of data, so the data handling API provided by the driver is used to handle data during the operation. The driver can obtain the relevant information required for the target hardware device to perform each operation by using the preset second API function (such as the aforementioned memory allocation API, data handling API), and can use the preset third API function (such as the execution API) to generate hardware execution commands according to the corresponding instructions of each operation and the relevant information required to execute each operation. Among them, how to organize instructions and related information for a single operation to generate a hardware execution command for a single operation, this part of the implementation method is already well known in the art and will not be introduced here.
采用本申请实施例提供的数据处理方法,为网络模型生成该网络模型对应的硬件执行命令这个过程只需要做一次,生成的硬件执行命令先缓存起来,例如存到构建的网络执行图(graph)中,每次要执行的时候,基于已经构建的网络执行图中所存的硬件执行命令,来做命令的分发以使硬件执行这些命令。而无需像相关技术那样将这个过程需要重复做很多次,且每做一次就把生成的命令立刻发给硬件去执行(相关技术每次分发命令前都要做一次翻译转化过程来生成所需的命令,分发很多次就需要翻译很多次)。By adopting the data processing method provided in the embodiment of the present application, the process of generating the hardware execution command corresponding to the network model for the network model only needs to be done once, and the generated hardware execution command is first cached, for example, stored in the constructed network execution graph. Each time it is to be executed, the command is distributed based on the hardware execution command stored in the constructed network execution graph to enable the hardware to execute these commands. There is no need to repeat this process many times like the related art, and each time the generated command is immediately sent to the hardware for execution (the related art needs to perform a translation and conversion process to generate the required command before each command is distributed, and many times of distribution require many times of translation).
S3:利用网络执行图存储所述硬件执行命令。S3: Utilize the network execution graph to store the hardware execution command.
在将网络模型的各个操作翻译成对应的目标硬件设备能够执行的硬件执行命令后,存储翻译得到的硬件执行命令,一种实施方式下,可以利用构建的网络执行图来存储。After translating each operation of the network model into a hardware execution command executable by the corresponding target hardware device, the translated hardware execution command is stored. In one implementation, the constructed network execution graph can be used for storage.
网络执行图也能够用于反映网络模型的计算逻辑,可以视为一种新的计算图,但网络执行图无需像网络模型的原始计算图那样记录或存储各操作的源代码。The network execution graph can also be used to reflect the computing logic of the network model and can be regarded as a new computing graph. However, the network execution graph does not need to record or store the source code of each operation like the original computing graph of the network model.
网络执行图用于记录为该网络模型生成的所有硬件执行命令,还可用于记录每个硬件执行命令的关键信息。该关键信息可包括起始地址、偏移量以及命令执行顺序等,根据起始地址和偏移量便可获悉该命令的长度、存储位置。关键信息用于获取硬件执行命令,目标硬件设备可根据关键信息来获取硬件执行命令。The network execution graph is used to record all hardware execution commands generated for the network model, and can also be used to record the key information of each hardware execution command. The key information may include the starting address, offset, and command execution order, etc. The length and storage location of the command can be known based on the starting address and offset. The key information is used to obtain the hardware execution command, and the target hardware device can obtain the hardware execution command based on the key information.
网络执行图中存储关于该网络模型需要硬件执行的所有命令,将硬件执行命令存到构建的网络执行图(该过程可将网络模型包含的各个操作(包含特征参数)转换为硬件能够识别并执行的命令存储起来)后,在网络执行图或网络执行图中的命令被提供给目标硬件设备后,目标硬件设备能够基于该网络执行图中缓存的硬件执行命令运行该网络模型。The network execution graph stores all commands that need to be executed by the hardware regarding the network model. After the hardware execution commands are stored in the constructed network execution graph (this process can convert the various operations (including characteristic parameters) contained in the network model into commands that can be recognized and executed by the hardware and stored), after the network execution graph or the commands in the network execution graph are provided to the target hardware device, the target hardware device can run the network model based on the hardware execution commands cached in the network execution graph.
其中,可以是由AI芯片中的存储设备来利用网络执行图存储硬件执行命令。网络执行图可以是位于目标硬件设备上,也可以不位于目标硬件设备上。比如可以是位于与目标硬件设备连接的存储设备上。The storage device in the AI chip may use the network execution graph to store the hardware execution command. The network execution graph may be located on the target hardware device or not. For example, it may be located on a storage device connected to the target hardware device.
本申请实施例中,先将网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令,但是先不发给目标硬件设备执行,而是利用网络执行图先把翻译出的硬件执行命令存储起来,使得后续每次需要运行该网络模型时,将事先存储的硬件执行命令分发给对应的硬件执行即可,不需要重新将该网络模型的计算图中的各个操作再次翻译成硬件执行命令,从而改善处理器每次运行网络模型时,都需要很大的性能开销,且耗时长的问题。In an embodiment of the present application, each operation in the calculation graph of the network model is first translated into a hardware execution command that can be executed by the target hardware device, but it is not sent to the target hardware device for execution first. Instead, the translated hardware execution commands are first stored using the network execution graph, so that each time the network model needs to be run subsequently, the pre-stored hardware execution commands can be distributed to the corresponding hardware for execution, and there is no need to re-translate each operation in the calculation graph of the network model into hardware execution commands, thereby improving the problem that the processor requires a large performance overhead and a long time each time it runs the network model.
一种可选实施方式下,利用网络执行图存储翻译的硬件执行命令的过程可以是,按照网络模型中包含的各个操作的执行顺序,依次将各个操作对应的硬件执行命令存储至网络执行图中,并对应记录每个硬件执行命令的关键信息。可以理解的是,相比于随机存储的方式,这种按照各个操作的执行顺序来存储硬件执行命令的方式,可以提高后续执行硬件执行命令的效率,由于后续在执行硬件执行命令时,需要按照网络模型中包含的各个操作的执行顺序去执行对应的硬件执行命令,才能保证网络模型正常运行,因此在存储硬件执行命令时,就按照这种执行顺序进行存储,后续执行指令时直接按照存储的先后顺序发送命令即可。In an optional implementation, the process of using a network execution graph to store translated hardware execution commands can be to store the hardware execution commands corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and record the key information of each hardware execution command accordingly. It can be understood that compared with the random storage method, this method of storing hardware execution commands according to the execution order of each operation can improve the efficiency of subsequent execution of hardware execution commands. Since the corresponding hardware execution commands need to be executed in the execution order of each operation contained in the network model when executing the hardware execution commands later, the normal operation of the network model can be guaranteed. Therefore, when storing hardware execution commands, they are stored in this execution order, and the subsequent execution of instructions can be directly sent in the order of storage.
其中,按照网络模型中包含的各个操作的执行顺序,依次将各个操作对应的硬件执行命令存储至网络执行图中,并记录每个硬件执行命令的关键信息,使得后续可以根据网络执行图快速获悉该神经网络涉及的计算逻辑(网络模型中的各个操作的执行顺序),以便于在执行该网络模型时,可以按照网络执行图中记录的关键信息并按照各个操作的执行顺序,依次将对应的硬件执行命令发送目标硬件设备执行,从而可以避免执行逻辑出错以及提高效率。Among them, according to the execution order of each operation contained in the network model, the hardware execution commands corresponding to each operation are stored in the network execution graph in turn, and the key information of each hardware execution command is recorded, so that the calculation logic involved in the neural network (the execution order of each operation in the network model) can be quickly learned according to the network execution graph later, so that when executing the network model, the corresponding hardware execution commands can be sent to the target hardware device for execution in turn according to the key information recorded in the network execution graph and the execution order of each operation, thereby avoiding execution logic errors and improving efficiency.
鉴于网络模型具有除了输入数据、输出结果不一样外,网络模型每次执行的计算操作都一样的特点,可以把网络模型中包含的各个操作,预先通过驱动程序全部翻译成一堆命令序列存储起来,每次执行网络模型时,将硬件执行命令进行微调即可。Given that the network model has the characteristic that the computing operations performed each time are the same except for the input data and output results, all the operations contained in the network model can be translated into a bunch of command sequences and stored in advance through the driver program. Each time the network model is executed, the hardware execution commands can be fine-tuned.
例如,在需要运行一个AI模型来识别不同人脸图像的场景下,对于采用同一AI模型进行的两次人脸识别过程,因 为用的是同一个AI模型,所以每次执行该AI模型时,所用的网络执行图(包含生成并缓存的硬件执行命令)所对应的本质计算逻辑不变,在识别完一张脸图后需要识别新脸图时,只需要把部分硬件执行命令中与输入、输出有关的内容/参数进行微调,例如,只需要将硬件执行命令中的用于获取输入数据的读地址替换为新数据所在的地址,以及/或者,将用于存储输出数据的写地址替换为新的写地址,以便利用同一个AI模型对新的输入数据进行处理,以及将新输入数据对应的输出结果存储到新的位置。这样极大的降低了处理器的负担,且提高了数据处理的效率。For example, in a scenario where an AI model needs to be run to identify different facial images, for two facial recognition processes using the same AI model, Because the same AI model is used, each time the AI model is executed, the essential computing logic corresponding to the network execution graph used (including the generated and cached hardware execution commands) remains unchanged. When a new face image needs to be recognized after recognizing a face image, only the content/parameters related to input and output in some hardware execution commands need to be fine-tuned. For example, only the read address used to obtain input data in the hardware execution command needs to be replaced with the address where the new data is located, and/or the write address used to store output data needs to be replaced with a new write address, so that the same AI model can be used to process the new input data, and the output result corresponding to the new input data can be stored in a new location. This greatly reduces the burden on the processor and improves the efficiency of data processing.
可以理解的是,该方法可以适用于多个网络模型的场景,即在网络模型的数量有多个时,相应地,可针对每一个网络模型,将该网络模型中包含的各个操作翻译成对应的目标硬件设备能够执行的硬件执行命令,硬件执行命令中包含目标硬件设备的设备信息,并存储翻译的硬件执行命令。其中,一种网络模型对应一个唯一的网络执行图。通过将不同网络模型中包含的各个操作分别翻译成目标硬件设备能够执行的硬件执行命令,并存储起来,后续在需要执行哪个网络模型(可以根据待处理任务来选择所需的网络模型)时,可选择所需要的网络模型对应的事先存储的硬件执行命令,从而进行命令的分发以使硬件执行相应网络模型对应的这些硬件执行命令。It can be understood that this method can be applied to scenarios with multiple network models, that is, when there are multiple network models, correspondingly, for each network model, each operation contained in the network model can be translated into a hardware execution command that can be executed by the corresponding target hardware device, the hardware execution command contains the device information of the target hardware device, and the translated hardware execution command is stored. Among them, one network model corresponds to a unique network execution graph. By translating each operation contained in different network models into hardware execution commands that can be executed by the target hardware device and storing them, when it is necessary to execute which network model later (the required network model can be selected according to the task to be processed), the pre-stored hardware execution command corresponding to the required network model can be selected, so as to distribute the command so that the hardware executes these hardware execution commands corresponding to the corresponding network model.
可选地,该数据处理方法还可包括S4:在需要运行网络模型对输入数据进行处理时,获取预先存储的硬件执行命令,并将硬件执行命令发送给目标硬件设备执行,实现网络模型在目标硬件设备上的运行。例如,按照网络模型中包含的各个操作的执行顺序,依次将对应的硬件执行命令发送给目标硬件设备执行,从而支持在目标硬件设备上运行网络模型来对输入数据进行处理的功能。Optionally, the data processing method may further include S4: when it is necessary to run the network model to process the input data, obtain a pre-stored hardware execution command, and send the hardware execution command to the target hardware device for execution, so as to implement the operation of the network model on the target hardware device. For example, according to the execution order of each operation included in the network model, the corresponding hardware execution command is sent to the target hardware device for execution in sequence, thereby supporting the function of running the network model on the target hardware device to process the input data.
在提前存储网络模型对应的所有硬件执行命令的情况下,后续在需要运行网络模型对输入数据进行处理时,可以直接基于预先存储的硬件执行命令,来做命令的分发以使硬件执行这些命令。In the case of storing all the hardware execution commands corresponding to the network model in advance, when the network model needs to be run to process the input data later, the commands can be distributed directly based on the pre-stored hardware execution commands to enable the hardware to execute these commands.
其中,将硬件执行命令发送给所述目标硬件设备执行,可包括:修改硬件执行命令中用于获取输入数据的读地址,以及/或者,修改硬件执行命令中用于存储输出数据的写地址;将修改后的硬件执行命令发送给目标硬件设备执行,以使目标硬件设备执行修改后的硬件执行命令,实现在硬件上运行网络模型对输入数据进行处理的功能或目的。可选的,可以按照网络模型中包含的各个操作的执行顺序,依次将修改后的对应的硬件执行命令发送给目标硬件设备执行,以使目标硬件设备执行修改后的硬件执行命令。Among them, sending the hardware execution command to the target hardware device for execution may include: modifying the read address in the hardware execution command for obtaining input data, and/or modifying the write address in the hardware execution command for storing output data; sending the modified hardware execution command to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, and realizes the function or purpose of running the network model on the hardware to process the input data. Optionally, the modified corresponding hardware execution commands can be sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model, so that the target hardware device executes the modified hardware execution command.
其中,通过修改硬件执行命令中用于获取输入数据的读地址,以及/或者修改硬件执行命令中用于存储输出数据的写地址,之后,将修改后的对应的硬件执行命令发送给目标硬件设备执行,这样使得前后执行该网络模型时,可以从不同的地方获取输入数据,以及将输出数据存储到不同的地方,灵活性更好。Among them, by modifying the read address used to obtain input data in the hardware execution command, and/or modifying the write address used to store output data in the hardware execution command, the modified corresponding hardware execution command is then sent to the target hardware device for execution. In this way, when the network model is executed before and after, input data can be obtained from different places, and output data can be stored in different places, which has better flexibility.
通过上述方式,针对一个目标硬件设备生成的硬件执行命令,可以快速扩展到AI芯片中的其他硬件设备上,使得当需要多个硬件设备并行运行网络模型时,不需要分别针对不同的硬件设备,将该网络模型的计算图中的各个操作重新翻译成对应的硬件执行命令,进一步降低了处理器的性能开销和提高了数据处理时的效率。Through the above method, the hardware execution commands generated for a target hardware device can be quickly expanded to other hardware devices in the AI chip, so that when multiple hardware devices are required to run the network model in parallel, there is no need to re-translate each operation in the calculation graph of the network model into corresponding hardware execution commands for different hardware devices respectively, which further reduces the performance overhead of the processor and improves the efficiency of data processing.
假设前一次执行该网络模型时,输入数据存储在A位置且输出数据存储在B位置,若当前需要用该网络模型对存储在C位置的输入数据进行处理时,可以修改硬件执行命令中用于获取输入数据的读地址(即将读地址从原本的A位置修改为C位置),以及修改硬件执行命令中用于存储输出数据的写地址(如将写地址从原本的B位置修改为D位置),之后,按照网络模型中包含的各个操作的执行顺序,依次将修改后的硬件执行命令发送给目标硬件设备执行,以使目标硬件设备执行修改后的硬件执行命令,从而实现在目标硬件设备运行网络模型时能够对存储在C位置的输入数据进行处理,并将处理得到的输出数据存储到D位置。Assume that the input data was stored at position A and the output data was stored at position B when the network model was executed last time. If the network model is currently needed to process the input data stored at position C, the read address used to obtain the input data in the hardware execution command can be modified (that is, the read address is changed from the original position A to position C), and the write address used to store the output data in the hardware execution command can be modified (such as changing the write address from the original position B to position D). After that, the modified hardware execution command is sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model, so that the target hardware device executes the modified hardware execution command, thereby realizing that when the target hardware device runs the network model, the input data stored at position C can be processed and the processed output data can be stored in position D.
可以理解的是,若输入数据的获取地址不变(例如上一次需要处理的输入数据在A位置,当前需要处理的新的输入数据仍然被放在A位置,可用新的输入数据替换A位置的原输入数据),则不需要修改硬件执行命令中用于获取输入数据的读地址。同理,若输出数据的存储地址不变(例如在当前对新输入数据进行处理后,期望得到的输出数据仍然存储到B位置),则不需要修改硬件执行命令中用于存储输出数据的写地址。若期望当前对A位置的新输入数据进行处理后,将得到的输出数据存储到D位置,则可修改硬件执行命令中用于存储输出数据的写地址(如将写地址从原本的B位置修改为D位置)。It is understandable that if the acquisition address of the input data remains unchanged (for example, the input data that needed to be processed last time was at position A, and the new input data that needs to be processed currently is still placed at position A, and the new input data can replace the original input data at position A), then there is no need to modify the read address used to acquire the input data in the hardware execution command. Similarly, if the storage address of the output data remains unchanged (for example, after the new input data is currently processed, the expected output data is still stored at position B), then there is no need to modify the write address used to store the output data in the hardware execution command. If it is expected that after the new input data at position A is currently processed, the output data obtained will be stored at position D, then the write address used to store the output data in the hardware execution command can be modified (such as changing the write address from the original position B to position D).
上面描述的过程是针对AI芯片内部的一个硬件设备进行说明的,一种实施方式下,若AI芯片包含多个并行的硬件设备,为了支持多个并行的硬件设备对于网络模型的并行运行,该数据处理方法还可包括:S5:根据AI芯片中的硬件设备的总数量,对S2得到的硬件执行命令进行复制;S6:根据AI芯片中除所述目标硬件设备外的其他硬件设备的设备信息,对复制出的硬件执行命令中包含的设备信息进行修改,得到修改过设备信息的硬件执行命令,其中,修改过设备信息的硬件执行命令能够被提供给其他硬件设备执行。The process described above is illustrated for a hardware device inside the AI chip. In one implementation, if the AI chip contains multiple parallel hardware devices, in order to support the parallel operation of multiple parallel hardware devices for the network model, the data processing method may also include: S5: According to the total number of hardware devices in the AI chip, copy the hardware execution command obtained in S2; S6: According to the device information of other hardware devices in the AI chip except the target hardware device, modify the device information contained in the copied hardware execution command to obtain the hardware execution command with the modified device information, wherein the hardware execution command with the modified device information can be provided to other hardware devices for execution.
其中,通过将硬件执行命令复制为指定数量份,指定数量为AI芯片中的硬件设备的总数量,根据AI芯片中除目标硬件设备外的硬件设备的设备信息修改每一份复制的硬件执行命令(即修改该硬件执行命令中的设备信息),使得修改后的每一份复制的硬件执行命令能够被其他硬件设备运行。Among them, by copying the hardware execution command into a specified number of copies, the specified number is the total number of hardware devices in the AI chip, and each copied hardware execution command is modified according to the device information of the hardware devices other than the target hardware device in the AI chip (that is, the device information in the hardware execution command is modified), so that each modified copied hardware execution command can be executed by other hardware devices.
在得到修改过设备信息的硬件执行命令后,可以将为该网络模型生成的硬件执行命令,按照设备信息匹配的原则,发送给AI芯片中设备信息匹配的各个硬件设备,从而使得AI芯片中的每个硬件设备都能够得到各自能够执行的硬件执行命令,使得AI芯片中的每个硬件设备都能够运行该网络模型。可以理解的是,也可以通过发送网络执行图的方式进行 命令的分发,也即,将网络执行图复制为多份,并将网络执行图中的设备信息进行修改,然后将设备信息不同的多份网络执行图分别发送给AI芯片中设备信息匹配的各个硬件设备。After obtaining the hardware execution command with modified device information, the hardware execution command generated for the network model can be sent to each hardware device with matching device information in the AI chip according to the principle of device information matching, so that each hardware device in the AI chip can obtain the hardware execution command that it can execute, so that each hardware device in the AI chip can run the network model. It is understandable that this can also be done by sending a network execution graph. The distribution of commands means copying the network execution graph into multiple copies, modifying the device information in the network execution graph, and then sending the multiple network execution graphs with different device information to each hardware device with matching device information in the AI chip.
假设AI芯片中能够运行网络模型的硬件设备的总数量为3(假设3个硬件设备分别记为硬件设备0、硬件设备1、硬件设备2,其中,硬件设备0为目标硬件设备),则可以将针对目标硬件设备生成的硬件执行命令复制2份,针对其中一份复制的硬件执行命令,根据硬件设备1的设备信息来修改该份复制的硬件执行命令,使得修改后的该份复制的硬件执行命令能够被硬件设备1执行,同理,可根据硬件设备2的设备信息来修改另一份复制的硬件执行命令,使得修改后的另一份复制的硬件执行命令能够被硬件设备2执行。将已经为网络模型翻译得到的硬件执行命令在同一个芯片内进行扩展,使得该芯片内的各个硬件设备都能够运行该网络模型。Assuming that the total number of hardware devices in the AI chip that can run the network model is 3 (assuming that the three hardware devices are respectively recorded as hardware device 0, hardware device 1, and hardware device 2, where hardware device 0 is the target hardware device), the hardware execution command generated for the target hardware device can be copied twice, and for one of the copied hardware execution commands, the copied hardware execution command is modified according to the device information of hardware device 1, so that the modified copied hardware execution command can be executed by hardware device 1. Similarly, the other copied hardware execution command can be modified according to the device information of hardware device 2, so that the modified other copied hardware execution command can be executed by hardware device 2. The hardware execution commands that have been translated for the network model are expanded in the same chip so that each hardware device in the chip can run the network model.
该种实施方式下,使得针对一个硬件设备生成的硬件执行命令,可以快速扩展到其他硬件设备上,在一个应用场景下,可以先把网络执行图复制到其他硬件设备上,再修改其中与设备信息相关的信息(对网络执行图进行微调)。基于已经生成并缓存的一份硬件执行命令,复制多份并缓存到多个硬件设备上,对于复制过去的硬件执行命令,针对所匹配的硬件设备进行信息修改。此处的信息修改是微调,目的是将复制的硬件执行命令修改为适配各硬件设备的命令。例如,基于硬件设备0和一个AI模型,为硬件设备0和该AI模型生成并缓存一份硬件执行命令的情况下,将这些硬件执行命令复制到硬件设备1、硬件设备2、硬件设备3后,将复制到硬件设备1、硬件设备2、硬件设备3的三份命令,分别修改为适用于硬件设备1的命令、适用于硬件设备2的命令、适用于硬件设备3的命令。Under this implementation, the hardware execution command generated for one hardware device can be quickly extended to other hardware devices. In an application scenario, the network execution graph can be copied to other hardware devices first, and then the information related to the device information can be modified (the network execution graph can be fine-tuned). Based on a hardware execution command that has been generated and cached, multiple copies are copied and cached to multiple hardware devices. For the copied hardware execution command, the information is modified for the matched hardware devices. The information modification here is fine-tuning, and the purpose is to modify the copied hardware execution command to a command that is adapted to each hardware device. For example, based on hardware device 0 and an AI model, when a hardware execution command is generated and cached for hardware device 0 and the AI model, these hardware execution commands are copied to hardware device 1, hardware device 2, and hardware device 3. After that, the three commands copied to hardware device 1, hardware device 2, and hardware device 3 are modified to commands applicable to hardware device 1, commands applicable to hardware device 2, and commands applicable to hardware device 3, respectively.
可以理解的是,若不是采用前述的命令扩展方式,就得:先指定一个硬件设备,为指定的这个硬件设备生成一份硬件执行命令,再指定另一个硬件设备,为该另一个硬件设备又生成另一份硬件执行命令,这其实又是另一个层次的重复过程,为不同硬件设备多次分别生成命令的过程也会在一定程度上影响处理器的性能表现,使得功耗高,且效率低下。It is understandable that if the aforementioned command extension method is not adopted, it is necessary to: first specify a hardware device, generate a hardware execution command for the specified hardware device, then specify another hardware device, and generate another hardware execution command for the other hardware device. This is actually another level of repetitive process. The process of generating commands for different hardware devices multiple times will also affect the performance of the processor to a certain extent, resulting in high power consumption and low efficiency.
可选地,关于当前时刻需要运行网络模型的硬件设备的数量,可以由人工进行配置,也可以根据待处理的数据量来确定,此时,该数据处理方法,还可包括:根据待处理的数据量,确定当前需要运行网络模型的硬件设备的第一数量。其中,第一数量小于等于前述的总数量,每次运行网络模型时,并不是一定要将该芯片内的各个硬件设备都用完,具体需要运行多少个硬件设备可根据实际应用场景决定。Optionally, the number of hardware devices that need to run the network model at the current moment can be manually configured or determined based on the amount of data to be processed. In this case, the data processing method may also include: determining the first number of hardware devices that need to run the network model based on the amount of data to be processed. The first number is less than or equal to the total number mentioned above. Each time the network model is run, it is not necessary to use up all the hardware devices in the chip. The specific number of hardware devices that need to be run can be determined based on the actual application scenario.
其中,可以结合待处理的数据量的实际情况和支持运行网络模型的硬件设备总量,从所有硬件设备中确定出部分或全部,来运行该网络模型,从而对待处理的输入数据进行处理。这样可以最大化的合理确定所需的硬件设备的数量,提高处理效率。Among them, it is possible to combine the actual amount of data to be processed and the total amount of hardware devices that support the operation of the network model, and determine part or all of all hardware devices to run the network model, so as to process the input data to be processed. In this way, the number of required hardware devices can be reasonably determined to the maximum extent, and the processing efficiency can be improved.
在待处理的数据任务量少的情况下,调用一个硬件设备可能就足以满足需求,在任务量多的情况下,可能需要多个硬件设备并行运行。例如,对于用AI模型识别医疗场景下的CT(Computed Tomography,电子计算机断层扫描)图,可能只需要识别一张图或者少量的图像数据,此时用一个硬件设备来运行该AI网络就能满足运算需求;而对于一些需要在短时间内进行大量识别的场景,即需要对大量的图像数据进行识别,或对计算结果的实时性要求高的场景,可以用多个硬件设备并行运行AI网络来进行处理。When the amount of data to be processed is small, calling one hardware device may be sufficient to meet the demand. When the amount of tasks is large, multiple hardware devices may need to run in parallel. For example, when using AI models to identify CT (Computed Tomography) images in medical scenarios, it may only be necessary to identify one image or a small amount of image data. In this case, using one hardware device to run the AI network can meet the computing needs; and for some scenarios that require a large amount of recognition in a short period of time, that is, a large amount of image data needs to be recognized, or scenarios that require high real-time computing results, multiple hardware devices can be used to run the AI network in parallel for processing.
另外,鉴于目前在AI芯片(可以是各种处理器)使用网络模型进行数据处理的场景下,处理器在为每个网络模型生成硬件执行命令(可简称硬件命令或命令)的过程中,都会占用内存空间,并且不同的网络模型所需的数据所占用的内存空间不同,很容易出现内存不够用的情况,从而导致硬件无法按照预期执行网络模型,并可能影响硬件的性能。基于此,本申请实施例提出了一种命令生成方法,以此来改善处理器每次在为网络模型生成硬件执行命令的过程中,都需要占用很大的内存资源开销,容易导致内存不够用的问题。In addition, in view of the current scenario where AI chips (which can be various processors) use network models for data processing, the processor will occupy memory space in the process of generating hardware execution commands (which can be referred to as hardware commands or commands) for each network model, and the memory space occupied by the data required by different network models is different, it is easy to have insufficient memory, resulting in the hardware being unable to execute the network model as expected, and may affect the performance of the hardware. Based on this, the embodiment of the present application proposes a command generation method to improve the problem that each time the processor generates a hardware execution command for a network model, it needs to occupy a large amount of memory resource overhead, which easily leads to insufficient memory.
发明人根据网络模型的特点,提出了以下示例性的实施例,以此来改善上述问题。为了更好的理解,下面将结合图2,对本申请实施例提供的一种可能方式进行说明。一种可选实施方式下,将所述网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,可以包括下述S21和S22。According to the characteristics of the network model, the inventors have proposed the following exemplary embodiments to improve the above problems. For a better understanding, a possible method provided by an embodiment of the present application will be described below in conjunction with Figure 2. In an optional implementation, translating each operation in the computational graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip may include the following S21 and S22.
S21:为所述网络模型分配对应的虚假内存空间。S21: Allocate corresponding virtual memory space for the network model.
本申请实施例中,在为网络模型中包含的各个操作生成对应的第一硬件执行命令之前,需要为该网络模型分配对应的虚假内存空间(可用fake memory表示)。作为一种实施方式,S21可以由AI芯片中的内核执行。In the embodiment of the present application, before generating the corresponding first hardware execution command for each operation included in the network model, it is necessary to allocate the corresponding fake memory space (which can be represented by fake memory) for the network model. As an implementation method, S21 can be executed by the core in the AI chip.
一种实施方式下,可以根据执行网络模型所需的数据(此处的数据包含网络模型待处理的输入数据,还可以包含网络模型本身的特征数据(比如权重、参数等))大小,分配与该数据大小对应的虚假内存空间。由于虚假内存空间是根据执行网络模型所需的数据大小进行分配的,可以在保证不占用硬件的真实内存资源的情况下,满足生成命令所需的要求。In one implementation, a virtual memory space corresponding to the size of the data required to execute the network model (the data here includes the input data to be processed by the network model, and may also include the characteristic data of the network model itself (such as weights, parameters, etc.)) can be allocated. Since the virtual memory space is allocated according to the size of the data required to execute the network model, the requirements for generating commands can be met without occupying the real memory resources of the hardware.
需要说明的是,对于虚假内存空间的创建或分配也可视为不会对物理存储空间有过多占用,即使因为分配、记录或标记虚假内存空间而不得不占用一点真实的物理存储空间,占用的空间总量大概也就在1KB或数KB左右(该数据也仅作为举例示意)。分配虚假内存空间可以不占用待运行网络模型的硬件设备的真实内存,但有可能会因为分配或记录虚假内存空间的过程,而占用无需运行网络模型的硬件的少量存储空间,由于占用的总量极小(例如1KB左右),所以这种情况下的占用量也可以忽略不计,视为不会占用真实内存,不会占用真实的物理存储空间。It should be noted that the creation or allocation of false memory space can also be regarded as not occupying too much physical storage space. Even if a little real physical storage space has to be occupied due to the allocation, recording or marking of false memory space, the total amount of space occupied is about 1KB or several KB (this data is only used as an example). The allocation of false memory space may not occupy the real memory of the hardware device to run the network model, but it may occupy a small amount of storage space of the hardware that does not need to run the network model due to the process of allocating or recording false memory space. Since the total amount occupied is extremely small (for example, about 1KB), the amount occupied in this case can also be ignored and is regarded as not occupying real memory and real physical storage space.
示例性的,假设执行一个网络模型所需的数据大小为2G,则在分配fake memory资源时,可以分配一个大小为2G的虚假内存空间,该虚假内存空间与真实内存空间一样,也是每个存储行均具备独立的地址,且虚假内存空间对应的大 小与缓存执行网络模型所需的数据大小期望占用的真实内存空间大小一致。此时分配的并非实际为2G的物理内存,分配虚假内存时,并不会将执行网络模型所需的数据真的加载(或写入)到待运行网络模型的硬件设备的真实内存中,因此并不会占用硬件的真实内存资源。For example, assuming that the data size required to execute a network model is 2G, when allocating fake memory resources, a fake memory space of 2G can be allocated. The fake memory space is the same as the real memory space, and each storage row has an independent address, and the fake memory space corresponds to a large The size of the real memory space that is expected to be occupied by the data size required for caching and executing the network model is consistent with that of the real memory space that is expected to be occupied. At this time, the physical memory allocated is not actually 2G. When allocating false memory, the data required for executing the network model will not be actually loaded (or written) into the real memory of the hardware device to run the network model, so it will not occupy the real memory resources of the hardware.
可以理解的是,虚假内存空间与真实内存空间具备相同属性,比如可以具备空间尺寸的属性(虚假内存空间与真实内存空间的大小可以相同),以及具备独立地址(地址的格式、地址的查找方式也可以按照真实内存那样设计),区别仅在于虚假内存空间不是真实存在的物理存储空间。此处分配的fake memory是一个虚假内存,既不是真实的物理内存,也不是传统的需要与物理内存映射的虚拟内存(virtual memory,虚拟内存有时也称为逻辑内存)。该fake memory拥有的地址可视为假的地址,是为了满足生成硬件执行命令的这个需求而刻意创建并分配的假地址。It is understandable that the fake memory space has the same properties as the real memory space, such as the space size (the fake memory space can be the same size as the real memory space) and independent addresses (the address format and address search method can also be designed like the real memory). The only difference is that the fake memory space is not a real physical storage space. The fake memory allocated here is a fake memory, which is neither a real physical memory nor a traditional virtual memory (virtual memory, sometimes also called logical memory) that needs to be mapped with physical memory. The address of the fake memory can be regarded as a fake address, which is a fake address deliberately created and allocated to meet the need of generating hardware execution commands.
S22:基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令。S22: Based on the virtual memory space, translate each operation included in the network model into a corresponding first hardware execution command.
本申请实施例中,基于虚假内存空间(如用fake memory表示),将网络模型中包含的各个操作(即各个算子)翻译成对应的第一硬件执行命令,其中,为一个操作生成的第一硬件执行命令可能是成百条。由于命令生成过程是基于虚假内存空间进行的,因此不会占用真实内存空间。这样即便是为多个网络模型生成各自模型分别对应的第一硬件执行命令,也不会导致内存空间不够用。In the embodiment of the present application, based on the fake memory space (such as represented by fake memory), each operation (i.e., each operator) contained in the network model is translated into a corresponding first hardware execution command, wherein the first hardware execution command generated for one operation may be hundreds. Since the command generation process is based on the fake memory space, it does not occupy the real memory space. In this way, even if the first hardware execution commands corresponding to each model are generated for multiple network models, it will not cause insufficient memory space.
由于每个网络模型的硬件执行命令中通常需要包含各种操作、用于获取操作所需的数据源的读地址、用于存储操作运算结果的写地址等操作信息,因此,在生成命令时,需要提供用于读写数据的地址才能生成命令。本申请中,考虑到实际应用中不一定会经常面临“生成命令后就要立即执行命令并立刻进行内存读写”的需求,因此,基于虚假内存空间(的地址)来生成硬件执行命令,使得硬件执行命令中的地址均为虚假地址,在确定需要执行一些命令时可对相关命令中的地址进行替换处理,这样可以支持为网络模型的各个操作提前翻译出硬件执行命令,有利于提升网络模型的执行效率,且可以避免因为提前为网络模型翻译出命令而对内存资源造成过多占用。Since the hardware execution command of each network model usually needs to include various operations, a read address for obtaining the data source required for the operation, a write address for storing the operation result and other operation information, therefore, when generating a command, it is necessary to provide an address for reading and writing data in order to generate the command. In this application, considering that in actual applications, it is not necessarily often faced with the demand of "immediately executing the command after generating the command and immediately reading and writing the memory", therefore, the hardware execution command is generated based on the (address of) false memory space, so that the addresses in the hardware execution command are all false addresses, and when it is determined that some commands need to be executed, the addresses in the relevant commands can be replaced, so that the hardware execution commands can be translated in advance for each operation of the network model, which is conducive to improving the execution efficiency of the network model, and can avoid excessive occupation of memory resources due to translating commands for the network model in advance.
S22可由AI芯片的内核执行,内核中可部署用于将网络模型中包含的各个操作翻译成对应的硬件执行命令的驱动程序。S22 can be executed by the core of the AI chip, and a driver for translating each operation included in the network model into corresponding hardware execution commands can be deployed in the core.
可选地,S22的实现过程可以包括:将网络模型中包含的各个操作的源代码编译成各个操作分别对应的指令,并基于虚假内存空间,获得执行网络模型中包含的各个操作所需的相关信息;根据各个操作的对应的指令与执行各个操作所需的相关信息,生成第一硬件执行命令。Optionally, the implementation process of S22 may include: compiling the source code of each operation included in the network model into instructions corresponding to each operation, and obtaining relevant information required to execute each operation included in the network model based on the virtual memory space; generating a first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
该实施方式中,通过将网络模型中包含的各个操作的源代码编译成指令,并基于虚假内存空间,获得执行网络模型中包含的各个操作所需的相关信息,之后根据各个操作的对应的指令与执行各个操作所需的相关信息,生成硬件执行命令,这样可以快速、准确地将网络模型包含的各个操作翻译成对应的硬件执行命令,并且在获得执行网络模型中包含的各个操作所需的相关信息,是基于虚拟内存空间获得的,在满足生成命令所需的要求下,不会占用硬件的真实内存资源。In this implementation, the source code of each operation included in the network model is compiled into instructions, and based on the virtual memory space, the relevant information required to execute each operation included in the network model is obtained, and then the hardware execution command is generated according to the corresponding instructions of each operation and the relevant information required to execute each operation. In this way, each operation included in the network model can be quickly and accurately translated into the corresponding hardware execution command, and the relevant information required to execute each operation included in the network model is obtained based on the virtual memory space, and will not occupy the real memory resources of the hardware while meeting the requirements for generating commands.
为了能把网络模型包含的各个操作翻译成硬件能够执行的硬件执行命令,驱动程序提供一组比较通用的API函数,比如创建编译对象API、编译指令API、创建内存API、数据搬运API、执行API等。In order to translate the various operations contained in the network model into hardware execution commands that can be executed by the hardware, the driver provides a set of relatively common API functions, such as creating compiled object API, compiling instruction API, creating memory API, data transfer API, execution API, etc.
示例性的,对于网络模型的每个操作,驱动程序提供类似于C++语法的可编程语言的功能,操作的源代码可以用这种语法来实现,同时驱动程序也会基于编译器,通过利用预设第一API函数(例如创建编译对象API、编译指令API)将网络模型包含的一个操作的源代码编译成该操作对应的硬件指令。关于利用编译器将源代码编译为硬件指令这部分的具体实现细节已经为本领域所熟知,在此不再介绍。Exemplarily, for each operation of the network model, the driver provides a programmable language function similar to C++ syntax, and the source code of the operation can be implemented using this syntax. At the same time, the driver will also compile the source code of an operation included in the network model into the hardware instructions corresponding to the operation based on the compiler by using a preset first API function (such as creating a compilation object API, compiling an instruction API). The specific implementation details of using the compiler to compile the source code into hardware instructions are already well known in the art and will not be introduced here.
由于各个操作是需要有操作对象的,例如需要操作数据的,比如卷积运算这一操作,需要将输入数据和权重进行卷积运算,可以通过驱动程序提供的一个创建内存API,在虚假内存空间上面分配一块空间,提供给卷积运算的操作算子。另外,各个操作中的有一部分可能会涉及到数据的搬运,因此驱动程序提供数据搬运API,用于搬运运算过程中的数据。这样驱动程序可通过利用预设第二API函数(例如前述的创建内存API、数据搬运API),基于虚假内存空间,获得硬件设备执行网络模型中包含的各个操作所需的相关信息。示例性的,硬件设备执行一个操作所需的相关信息可以用于反映与该操作相关的:指令的地址、长度、指令需要操作多少个内存地址、内存地址的具体位置在哪里、内存大小是多少、指令之间的处理顺序是什么等。最后可利用预设第三API函数(例如执行API),根据各个操作的对应的指令与执行各个操作所需的相关信息,生成第一硬件执行命令。在一些应用场景下,若将操作的源代码编译为硬件指令的过程也需要用到一些内存信息,且暂时不需要执行指令、暂时不需要进行实际的数据读写或加载,则也可以利用前述的虚假内存空间来进行指令的编译,当需要涉及数据读写或加载时将指令中的虚假地址替换为真实地址。Since each operation requires an operation object, for example, if data needs to be operated, such as the convolution operation, the input data and weight need to be convolved, a memory creation API provided by the driver can be used to allocate a space on the virtual memory space and provide it to the operator of the convolution operation. In addition, some of the operations may involve the handling of data, so the driver provides a data handling API for handling data during the operation. In this way, the driver can obtain the relevant information required for the hardware device to execute each operation contained in the network model based on the virtual memory space by using the preset second API function (such as the aforementioned memory creation API and data handling API). Exemplarily, the relevant information required for the hardware device to execute an operation can be used to reflect the following related information: the address and length of the instruction, how many memory addresses the instruction needs to operate, where the specific location of the memory address is, how much the memory size is, what is the processing order between instructions, etc. Finally, the preset third API function (such as the execution API) can be used to generate a first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation. In some application scenarios, if the process of compiling the source code of the operation into hardware instructions also requires some memory information, and there is no need to execute the instructions temporarily or to perform actual data reading, writing or loading temporarily, the aforementioned false memory space can also be used to compile the instructions. When data reading, writing or loading is required, the false address in the instruction is replaced with the real address.
可选地,该第一硬件执行命令除了包含各操作、用于获取操作所需的数据源的读地址、用于存储操作运算结果的写地址外,还可包含硬件设备的设备信息(如设备标识),用于表示该第一硬件执行命令由哪个硬件设备来执行,不同的硬件设备对应的设备信息不同。硬件设备是期望能够具有运行该网络模型这一能力的硬件对象。一个AI芯片可能会涉及多个硬件设备。网络模块被翻译后得到的硬件执行命令可以在需要运行该硬件执行命令对应的网络模型时,被提供给对应的硬件设备执行。Optionally, in addition to each operation, a read address for obtaining the data source required for the operation, and a write address for storing the operation result, the first hardware execution command may also include device information of the hardware device (such as a device identifier) to indicate which hardware device executes the first hardware execution command, and different hardware devices have different corresponding device information. The hardware device is a hardware object that is expected to have the ability to run the network model. An AI chip may involve multiple hardware devices. The hardware execution command obtained after the network module is translated can be provided to the corresponding hardware device for execution when the network model corresponding to the hardware execution command needs to be run.
当网络模型为多个时,在基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令时,可以是:针对不同的网络模型,基于不同的虚假内存空间,将各个网络模型中包含的各个操作翻译成对应的第一硬件执行 命令,不同的网络模型对应的虚假内存空间不同(即,一个网络模型对应一个虚假内存空间)。由于使用的是虚假内存空间,并不是真实存在的,因此,即便要处理的网络模型很多,也不会增加太多内存资源消耗。When there are multiple network models, when translating each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space, it can be: for different network models, based on different virtual memory spaces, translating each operation contained in each network model into a corresponding first hardware execution command Command, different network models correspond to different virtual memory spaces (that is, one network model corresponds to one virtual memory space). Since the virtual memory space used does not really exist, even if there are many network models to be processed, it will not increase the memory resource consumption too much.
在本申请实施例中,针对不同的网络模型使用不同的虚假内存空间,从而使得后续在进行地址转换时,不会引起逻辑混乱,从而保证了命令的高效转换。In the embodiment of the present application, different virtual memory spaces are used for different network models, so that logical confusion will not be caused when performing subsequent address conversion, thereby ensuring efficient conversion of commands.
一种可选实施方式下,利用网络执行图存储所述硬件执行命令,可以包括:在S22后,S31:存储所述第一硬件执行命令。In an optional implementation manner, using a network execution graph to store the hardware execution command may include: after S22, S31: storing the first hardware execution command.
在S22后,存储第一硬件执行命令,以便后续使用。作为一种实施方式,在存储第一硬件命令时,可以将该第一硬件执行命令存储在事先构建的网络执行图中。例如,可以是由AI芯片中的存储设备来利用网络执行图存储第一硬件执行命令。其中,网络执行图用于记录为该网络模型生成的所有第一硬件执行命令,还可用于记录每个第一硬件执行命令的关键信息。该关键信息可包括起始地址、偏移量以及命令执行顺序等,根据起始地址和偏移量便可获悉该命令的长度、存储位置。硬件设备可根据关键信息来获取第一硬件执行命令。After S22, the first hardware execution command is stored for subsequent use. As an implementation mode, when storing the first hardware command, the first hardware execution command may be stored in a pre-constructed network execution graph. For example, the first hardware execution command may be stored by a storage device in the AI chip using a network execution graph. Among them, the network execution graph is used to record all first hardware execution commands generated for the network model, and can also be used to record key information of each first hardware execution command. The key information may include a starting address, an offset, and a command execution order, etc., and the length and storage location of the command can be known based on the starting address and the offset. The hardware device can obtain the first hardware execution command based on the key information.
本申请实施例中,基于虚假内存空间(如用fake memory表示),将网络模型中包含的各个操作(即各个操作算子)翻译成对应的第一硬件执行命令。由于该过程是基于虚假内存空间进行的,因此不会因为命令的生成过程占用大量的真实内存空间。这样即便是为一个网络模型生成很多命令,或者,为多个网络模型翻译生成第一硬件执行命令,也不会因为在命令的生成过程中由于占用太多真实内存空间从而导致实际的真实内存空间不够用。此外,通过该方法,可以支持提前为网络模型翻译出硬件执行命令,有利于支持为一个或多个网络模型提前生成各模型分别对应的硬件执行命令,且可以避免由于提前为网络模型翻译出各操作对应的硬件执行命令而对有限的内存资源造成过多占用。In an embodiment of the present application, each operation (i.e., each operation operator) contained in the network model is translated into a corresponding first hardware execution command based on a false memory space (such as represented by fake memory). Since the process is based on a false memory space, a large amount of real memory space will not be occupied due to the command generation process. In this way, even if many commands are generated for a network model, or the first hardware execution command is generated for multiple network models, the actual real memory space will not be insufficient due to occupying too much real memory space during the command generation process. In addition, through this method, it is possible to support the translation of hardware execution commands for network models in advance, which is conducive to supporting the generation of hardware execution commands corresponding to each model for one or more network models in advance, and it can avoid excessive occupation of limited memory resources due to the translation of hardware execution commands corresponding to each operation for the network model in advance.
可选的,考虑到基于网络模型的固有特征,即网络模型本身是固定的,只是每次处理的输入数据不一样,以及对应不同输入数据可能有不同的输出结果而已,因此,可以先将网络模型中包含的各个操作翻译成对应的硬件设备能够执行的第一硬件执行命令,但是先不发给硬件设备执行,而是先把翻译生成的第一硬件执行命令存储起来,使得后续每次需要使用该网络模型对输入数据进行处理时,不需要对其重新进行翻译,而是只需要对其进行微调,将第一硬件执行命令中的地址进行更换,例如将命令中与输入数据、输出数据相关的地址信息修改一下即可,不需要驱动程序重新将该网络模型中的各个操作翻译成第一硬件执行命令,从而可以节约处理器每次运行网络模型时所需的性能开销。Optionally, taking into account the inherent characteristics of the network model, that is, the network model itself is fixed, but the input data processed each time is different, and there may be different output results corresponding to different input data, therefore, the various operations contained in the network model can be translated into a first hardware execution command that can be executed by the corresponding hardware device, but it is not sent to the hardware device for execution first. Instead, the translated first hardware execution command is stored first, so that each subsequent time the network model is needed to process the input data, it does not need to be retranslated, but only needs to be fine-tuned by replacing the address in the first hardware execution command, for example, the address information related to the input data and output data in the command can be modified. There is no need for the driver to re-translate the various operations in the network model into the first hardware execution command, thereby saving the performance overhead required for the processor to run the network model each time.
由于上述的第一硬件执行命令是基于虚假内存空间生成的,第一硬件执行命令中的地址均为虚假地址,虚假地址虽然可以在生成命令的过程中被查找应用,但实际上无法在命令执行过程中被用来存储载入的数据,因此,后续在需要执行网络模型时,该命令生成方法还可包括:将执行网络模型所需的数据加载到真实内存空间,并利用真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令作为第二硬件执行命令发送给对应的硬件设备去执行。Since the above-mentioned first hardware execution command is generated based on the virtual memory space, the addresses in the first hardware execution command are all virtual addresses. Although the virtual addresses can be searched and applied in the process of generating the command, they cannot actually be used to store the loaded data in the command execution process. Therefore, when the network model needs to be executed subsequently, the command generation method may also include: loading the data required to execute the network model into the real memory space, and replacing the virtual addresses in the first hardware execution command with the real addresses corresponding to the real memory space, and sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution.
关于用真实地址替换掉第一硬件执行命令中的虚假地址的过程,可以包括:对第一硬件执行命令进行识别,确定出当前包含虚假地址的部分或全部第一硬件执行命令,作为目标命令;利用真实内存空间对应的真实地址对所述目标命令中的虚假地址进行地址替换。在替换时,需要识别哪些第一硬件执行命令是使用了虚假地址,当识别出使用了虚假地址的第一硬件执行命令后,可以将部分或全部第一硬件执行命令中的虚假地址对应替换为真实地址。The process of replacing the false address in the first hardware execution command with the real address may include: identifying the first hardware execution command, determining part or all of the first hardware execution command currently containing the false address as the target command; and replacing the false address in the target command with the real address corresponding to the real memory space. When replacing, it is necessary to identify which first hardware execution commands use the false address. After identifying the first hardware execution command using the false address, the false address in part or all of the first hardware execution command can be replaced with the real address.
本申请实施例中,在进行地址替换时,通过对第一硬件执行命令进行识别,仅对包含虚假地址的第一硬件执行命令进行替换,避免出现错误替换或漏替换。In the embodiment of the present application, when performing address replacement, the first hardware execution command is identified and only the first hardware execution command containing the false address is replaced, thereby avoiding erroneous replacement or missing replacement.
对于识别结果,可能有以下几种结果:1)当第一硬件执行命令全部未执行时,所有的第一硬件执行命令中的地址都是虚假地址;2)在网络模型的执行过程中,存在一部分正在被硬件设备执行的命令,这部分正在被执行的命令的地址可能已被替换为真实地址(有效地址),同时也可能有一些还未被替换的未被执行的命令,这部分命令的地址还是虚假地址。There may be the following results for the identification results: 1) When all the first hardware execution commands are not executed, all the addresses in the first hardware execution commands are false addresses; 2) During the execution of the network model, there are some commands being executed by the hardware devices, and the addresses of these commands being executed may have been replaced with real addresses (valid addresses). At the same time, there may also be some unexecuted commands that have not been replaced, and the addresses of these commands are still false addresses.
当包含多个网络模型时,可以是一次性替换完所有网络模型对应的第一硬件执行命令中的假地址,也可以仅替换部分网络模型对应的第一硬件执行命令中的假地址。When multiple network models are included, the fake addresses in the first hardware execution commands corresponding to all network models may be replaced at one time, or only the fake addresses in the first hardware execution commands corresponding to some network models may be replaced.
其中,可以结合当前剩余的可用内存容量、网络模型的处理进度、待处理的数据量、芯片所支持的处理能力等可选因素,来决定进行全部命令的地址替换还是部分命令的地址替换。本申请不对每次进行地址替换的命令数量和/或命令对应的操作类型作要求,可以每次替换一个网络中的部分操作对应的一批命令所包含的虚假地址,也可以一次性替换一整个网络模型(或多个网络模型)的所有命令中的虚假地址。Among them, the decision to replace the addresses of all commands or some commands can be made based on the currently remaining available memory capacity, the processing progress of the network model, the amount of data to be processed, the processing capacity supported by the chip and other optional factors. The present application does not require the number of commands to be replaced each time and/or the type of operation corresponding to the command. The false addresses contained in a batch of commands corresponding to some operations in a network can be replaced each time, or the false addresses in all commands of an entire network model (or multiple network models) can be replaced at one time.
例如,某个任务需要用到2个网络模型时,在替换时,若当前剩余的可用内存或芯片所支持的处理能力,支持一次性处理这2个网络模型对应的数据,则可以一次性替换这2个网络模型对应的第一硬件执行命令中的假地址,若当前不支持一次性将2个网络模型对应的第一硬件执行命令中的假地址都替换为真实地址,则可以分批进行替换,如先对其中一个网络模型的命令进行地址替换,之后再对另一个网络模型的命令进行地址替换。For example, when a task requires the use of two network models, when replacing, if the currently remaining available memory or the processing power supported by the chip supports processing the data corresponding to the two network models at one time, then the false addresses in the first hardware execution commands corresponding to the two network models can be replaced at one time. If it is currently not supported to replace all the false addresses in the first hardware execution commands corresponding to the two network models with real addresses at one time, then the replacement can be performed in batches, such as first replacing the address of the command of one of the network models, and then replacing the address of the command of the other network model.
可以理解的是,本申请中提到的“真实内存”是物理内存,且“真实地址”是物理存储介质会拥有的物理地址,而“虚假地址”则不是物理地址,是可以被设计为具有与物理地址类似属性或类似格式的假地址,关于“利用真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址”这一过程,用于进行地址替换的“真实内存空间对应的真实地址”既可以是物理内存的物理地址,也可以是与物理内存提前建立了映射关系的虚拟内存的地址,这种虚拟内存相对 于本申请的虚假内存而言,也是具有真实地址且会占用物理存储空间的。通常而言,通过虚拟内存映射技术,可以将一个物理存储空间(可能是物理外存也可能是物理内存)与另一个物理存储空间(通常是物理内存)建立映射关系,从而将原本不连续的物理地址从逻辑上进行映射关联,使得原本分散或无关无序的物理地址在一些场景下从逻辑上变得有关联、有序,通过这种映射关系也可以完成数据的实际加载和读写。关于用于替换虚假地址的真实地址,不论是物理内存的地址还是与物理内存提前建立映射关系的其他的物理地址,只要能保证替换后的第一硬件执行命令能被正确执行即可。It is understandable that the "real memory" mentioned in this application is physical memory, and the "real address" is the physical address that a physical storage medium will have, while the "false address" is not a physical address, but a false address that can be designed to have similar properties or a similar format as a physical address. Regarding the process of "replacing the false address in the first hardware execution command with the real address corresponding to the real memory space", the "real address corresponding to the real memory space" used for address replacement can be either the physical address of the physical memory or the address of the virtual memory that has established a mapping relationship with the physical memory in advance. This virtual memory is relatively As for the false memory of the present application, it also has a real address and will occupy physical storage space. Generally speaking, through virtual memory mapping technology, a mapping relationship can be established between a physical storage space (which may be a physical external memory or a physical memory) and another physical storage space (usually a physical memory), so that the originally discontinuous physical addresses are logically mapped and associated, so that the originally scattered or unrelated and disordered physical addresses become logically associated and ordered in some scenarios, and the actual loading and reading and writing of data can also be completed through this mapping relationship. Regarding the real address used to replace the false address, whether it is the address of the physical memory or other physical addresses that establish a mapping relationship with the physical memory in advance, as long as it can be guaranteed that the first hardware execution command after replacement can be executed correctly.
其中,对于缓存的各个网络模型对应的第一硬件执行命令,可以按需处理、按需执行。例如,假设缓存有20个网络模型对应的第一硬件执行命令,但是当前只需要执行其中一个网络模型对应的命令,则可以暂时只针对这部分命令做替换,将替换后的新命令(可以称为第二硬件执行命令)分发给具体的硬件设备去执行。Among them, the first hardware execution commands corresponding to each cached network model can be processed and executed on demand. For example, assuming that there are first hardware execution commands corresponding to 20 network models in the cache, but only the command corresponding to one of the network models needs to be executed currently, then only these commands can be temporarily replaced, and the replaced new commands (which can be called second hardware execution commands) can be distributed to specific hardware devices for execution.
此外,可选地,可以基于虚假内存空间对应的虚假地址来替换掉第二硬件执行命令中的真实地址,使得所述第二硬件执行命令可以再成为含有虚假地址的第一硬件执行命令(即,将替换了真实地址的第一硬件执行命令中的地址再次更改为虚假地址),从而可释放相应的物理内存资源。在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备去执行之后,当不再需要执行某个网络模型时,该命令生成方法还可包括:在确定从当前时刻开始后的预设时间段内不执行网络模型时,基于虚假内存空间,将该网络模型对应的第二硬件执行命令中的真实地址替换为虚假地址,并将地址替换为虚假地址的硬件执行命令进行缓存,以便下一次需要运行相同网络模型时可以使用。In addition, optionally, the real address in the second hardware execution command can be replaced based on the false address corresponding to the false memory space, so that the second hardware execution command can become the first hardware execution command containing the false address again (that is, the address in the first hardware execution command that replaced the real address is changed to a false address again), thereby releasing the corresponding physical memory resources. After the replaced first hardware execution command is sent to the corresponding hardware device as the second hardware execution command for execution, when it is no longer necessary to execute a certain network model, the command generation method may also include: when it is determined that the network model will not be executed within a preset time period starting from the current moment, based on the false memory space, the real address in the second hardware execution command corresponding to the network model is replaced with a false address, and the hardware execution command with the address replaced with the false address is cached so that it can be used the next time the same network model needs to be run.
通过该实施方式,在将替换后的第一硬件执行命令作为第二硬件执行命令发送给对应的硬件设备执行之后,又将该命令中的真实地址替换为虚假内存空间对应的虚假地址,并缓存,这样可以释放一部分对应的真实内存空间。Through this implementation, after the replaced first hardware execution command is sent to the corresponding hardware device as the second hardware execution command for execution, the real address in the command is replaced with a false address corresponding to the false memory space and cached, thereby freeing up a portion of the corresponding real memory space.
可以理解的是,第一硬件执行命令中的地址均为虚假地址,第二硬件执行命令中的地址均为真实地址,若用真实地址替换掉第一硬件执行命令中的虚假地址,则替换后的第一硬件执行命令即为第二硬件执行命令,同理,在将替换后的第一硬件执行命令(第二硬件执行命令)中的真实地址替换为虚假地址后,又可以重新得到第一硬件执行命令。It can be understood that the addresses in the first hardware execution command are all false addresses, and the addresses in the second hardware execution command are all real addresses. If the false addresses in the first hardware execution command are replaced with real addresses, the first hardware execution command after replacement is the second hardware execution command. Similarly, after replacing the real addresses in the replaced first hardware execution command (second hardware execution command) with false addresses, the first hardware execution command can be obtained again.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,在基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令之前,本申请实施例的数据处理方法还包括另外的步骤,如图3所示。下面将结合图3对其原理进行说明。Based on the same inventive concept as that of improving the problem of insufficient memory in the process of generating hardware execution commands for the network model each time in the above-mentioned data processing method, before translating each operation included in the network model into the corresponding first hardware execution command based on the virtual memory space, the data processing method of the embodiment of the present application further includes another step, as shown in Figure 3. The principle will be described below in conjunction with Figure 3.
S210:判断从当前时刻开始后的预设时间段内是否执行网络模型。S210: Determine whether the network model is executed within a preset time period starting from the current moment.
通过判断从当前时刻开始后的预设时间段内是否执行或使用网络模型,在确定从当前时刻开始后的预设时间段内要执行网络模型时,执行S220,在确定从当前时刻开始后的预设时间段内不执行网络模型时,执行S240。其中,预设时间段可以根据实际需要进行设定,比如可以按分钟、小时等进行设置。By judging whether the network model is executed or used within a preset time period starting from the current moment, if it is determined that the network model is to be executed within the preset time period starting from the current moment, S220 is executed, and if it is determined that the network model is not to be executed within the preset time period starting from the current moment, S240 is executed. The preset time period can be set according to actual needs, such as minutes, hours, etc.
通过该实施方式,只有在确定从当前时刻开始后的预设时间段内不执行网络模型时,才基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令,这样有利于在不降低网络模型处理效率的情况下,对网络模型进行提前翻译,并且可以提高翻译效率,有利于提升对于网络模型的处理效率。Through this implementation, only when it is determined that the network model will not be executed within a preset time period starting from the current moment, the various operations contained in the network model will be translated into corresponding first hardware execution commands based on the virtual memory space. This is beneficial for translating the network model in advance without reducing the processing efficiency of the network model, and can improve the translation efficiency, which is beneficial to improving the processing efficiency of the network model.
S220:基于真实内存空间,将所述网络模型中包含的各个操作翻译成对应的第二硬件执行命令。S220: Based on the real memory space, translate each operation included in the network model into a corresponding second hardware execution command.
在确定从当前时刻开始后的预设时间段内要执行网络模型时,基于真实内存空间,将网络模型中包含的各个操作翻译成对应的第二硬件执行命令。其中,第二硬件执行命令中包含的地址均为真实地址,该真实内存空间用于存储执行网络模型时所需的数据。When it is determined that the network model is to be executed within a preset time period starting from the current moment, each operation contained in the network model is translated into a corresponding second hardware execution command based on the real memory space, wherein the addresses contained in the second hardware execution command are all real addresses, and the real memory space is used to store the data required when executing the network model.
通过该实施方式,在确定从当前时刻开始后的预设时间段内要执行网络模型时,直接基于真实内存空间,将网络模型中包含的各个操作翻译成对应的第二硬件执行命令,这样在需要尽快执行网络模型的情况下,避免基于虚假内存空间生成第一硬件执行命令后,又需要将第一硬件执行命令中的地址转换为执行所需的真实地址,从而能够提高即将要执行的网络模型的命令翻译效率、处理效率。Through this implementation, when it is determined that the network model is to be executed within a preset time period starting from the current moment, each operation contained in the network model is directly translated into a corresponding second hardware execution command based on the real memory space. In this way, when the network model needs to be executed as soon as possible, it is avoided that after the first hardware execution command is generated based on the false memory space, the address in the first hardware execution command needs to be converted into the real address required for execution, thereby improving the command translation efficiency and processing efficiency of the network model to be executed.
在确定从当前时刻开始后的预设时间段内要执行网络模型时,可以根据执行网络模型所需的数据大小,分配与该数据大小对应的真实内存空间,并以此将网络模型中包含的各个操作翻译成对应的第二硬件执行命令,同时,将执行网络模型时所需的数据(此处的数据包含网络模型待处理的输入数据,还可以包含网络模型本身的特征数据(比如权重、参数等))加载至该真实内存空间,以便于在将网络模型中包含的各个操作翻译成对应的第二硬件执行命令后,直接将该第二硬件执行命令发送给对应的硬件设备执行,以使硬件设备执行这些第二硬件执行命令执行网络模型。When it is determined that the network model is to be executed within a preset time period starting from the current moment, a real memory space corresponding to the data size can be allocated according to the data size required to execute the network model, and each operation contained in the network model can be translated into a corresponding second hardware execution command. At the same time, the data required for executing the network model (the data here includes the input data to be processed by the network model, and can also include the characteristic data of the network model itself (such as weights, parameters, etc.)) is loaded into the real memory space, so that after translating each operation contained in the network model into the corresponding second hardware execution command, the second hardware execution command is directly sent to the corresponding hardware device for execution, so that the hardware device executes these second hardware execution commands to execute the network model.
S230:存储所述第二硬件执行命令。S230: Store the second hardware execution command.
在基于真实内存空间,将网络模型中包含的各个操作翻译成对应的第二硬件执行命令后,存储第二硬件执行命令,后续需要执行时,直接将该第二硬件执行命令发送给对应的硬件设备执行,以使硬件设备执行这些第二硬件执行命令执行网络模型。After translating each operation contained in the network model into a corresponding second hardware execution command based on the real memory space, the second hardware execution command is stored. When it is needed to be executed later, the second hardware execution command is directly sent to the corresponding hardware device for execution, so that the hardware device executes these second hardware execution commands to execute the network model.
上述的S230的实现原理与图2中的S31的实现原理一致,不同之处在于,S31中存储的是第一硬件执行命令,而该步骤中存储的是第二硬件执行命令。第二硬件执行命令也可以存储到网络执行图中。The implementation principle of the above S230 is consistent with the implementation principle of S31 in Figure 2, except that the first hardware execution command is stored in S31, while the second hardware execution command is stored in this step. The second hardware execution command can also be stored in the network execution graph.
S240:基于虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令。S240: Based on the virtual memory space, translate each operation included in the network model into a corresponding first hardware execution command.
在确定从当前时刻开始后的预设时间段内不执行网络模型时,基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令。 When it is determined that the network model is not executed within a preset time period starting from the current moment, each operation included in the network model is translated into a corresponding first hardware execution command based on the virtual memory space.
S250:存储所述第一硬件执行命令。S250: Store the first hardware execution command.
上述的S240、S250实现原理分别与前述图2中S22、S31的实现原理一致,在此不再展开介绍。The implementation principles of the above-mentioned S240 and S250 are respectively consistent with the implementation principles of S22 and S31 in the aforementioned Figure 2, and will not be further described here.
需要说明的是,上述的将网络模型中的各个操作翻译成硬件执行命令(包含第一硬件执行命令、第二硬件执行命令)的过程与执行硬件执行命令的过程,这两个过程可以是由同一个AI芯片来实现,也可以是由2个AI芯片来分别实现,比如AI芯片1仅负责将网络模型中的各个操作翻译成硬件执行命令,AI芯片2负责执行这些硬件执行命令,通过2个AI芯片之间的配合来完成这两个过程。It should be noted that the above-mentioned process of translating each operation in the network model into hardware execution commands (including the first hardware execution command and the second hardware execution command) and the process of executing the hardware execution commands can be implemented by the same AI chip or by two AI chips respectively. For example, AI chip 1 is only responsible for translating each operation in the network model into hardware execution commands, and AI chip 2 is responsible for executing these hardware execution commands. The two processes are completed through the cooperation between the two AI chips.
其中,上述两个过程由2个AI芯片来实现时,可以是AI芯片1将网络模型中的各个操作翻译成硬件执行命令(包含第一硬件执行命令、第二硬件执行命令),并存储起来;待后续要运行网络模型时,再将对应的硬件执行命令发给AI芯片2的硬件设备去执行,或者再将对应的第一硬件执行命令转换为第二硬件执行命令后发给AI芯片2的硬件设备去执行,命令转换过程包括:利用真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址,从而得到第二硬件执行命令。也可以是AI芯片1将网络模型中的各个操作翻译成硬件执行命令,将该硬件执行命令发给AI芯片2进行存储,待后续要运行网络模型时,再由AI芯片2去获取对应的硬件执行命令发给AI芯片2的硬件设备去执行,或者也可以是AI芯片1将网络模型中的各个操作翻译成第一硬件执行命令,将该第一硬件执行命令发给AI芯片2进行存储,待后续要运行网络模型时,再由AI芯片2利用真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址,从而得到第二硬件执行命令,之后再发给AI芯片2的硬件设备去执行。Among them, when the above two processes are implemented by two AI chips, AI chip 1 can translate each operation in the network model into hardware execution commands (including a first hardware execution command and a second hardware execution command) and store them; when the network model is to be run subsequently, the corresponding hardware execution command is sent to the hardware device of AI chip 2 for execution, or the corresponding first hardware execution command is converted into a second hardware execution command and then sent to the hardware device of AI chip 2 for execution. The command conversion process includes: replacing the false address in the first hardware execution command with the real address corresponding to the real memory space, thereby obtaining the second hardware execution command. Alternatively, AI chip 1 translates each operation in the network model into a hardware execution command, sends the hardware execution command to AI chip 2 for storage, and when the network model is to be run subsequently, AI chip 2 obtains the corresponding hardware execution command and sends it to the hardware device of AI chip 2 for execution. Alternatively, AI chip 1 translates each operation in the network model into a first hardware execution command, sends the first hardware execution command to AI chip 2 for storage, and when the network model is to be run subsequently, AI chip 2 replaces the false address in the first hardware execution command with the real address corresponding to the real memory space to obtain the second hardware execution command, which is then sent to the hardware device of AI chip 2 for execution.
基于上述数据处理方法中改善处理器每次运行网络模型时都需要很大的性能开销且耗时长的问题同样的发明构思,本申请实施例还提供了又一种应用于运行网络模型进行数据处理的场景的数据处理方法,下面结合图4和图5对其原理进行说明。图4相比于图1,仅从执行硬件执行命令这一角度进行描述。Based on the same inventive concept of improving the problem that each time the processor runs a network model, the data processing method requires a large performance overhead and takes a long time, the embodiment of the present application also provides another data processing method for the scenario of running a network model for data processing, and its principle is explained below in conjunction with Figures 4 and 5. Compared with Figure 1, Figure 4 is described only from the perspective of executing hardware execution commands.
S10:在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令。S10: When it is necessary to run the network model, obtain a pre-stored hardware execution command executable by a target hardware device corresponding to the network model.
S20:将硬件执行命令发送给目标硬件设备执行,以使所述目标硬件设备执行所述硬件执行命令,从而在所述目标硬件设备上运行所述网络模型。S20: Sending the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, thereby running the network model on the target hardware device.
为了降低处理器的性能开销以及提高效率,可以事先将网络模型中包含的各个操作翻译成目标硬件设备能够执行的硬件执行命令,并存储起来(可利用前述的网络执行图存储),在后续需要运行网络模型对输入数据进行处理时,获取预先存储的该网络模型对应的硬件执行命令,并提供给目标硬件设备执行。In order to reduce the performance overhead of the processor and improve efficiency, the various operations contained in the network model can be translated into hardware execution commands that can be executed by the target hardware device and stored in advance (the aforementioned network execution graph storage can be used). When the network model needs to be run to process the input data in the future, the pre-stored hardware execution commands corresponding to the network model are obtained and provided to the target hardware device for execution.
其中,通过事先将网络模型中的各个操作对应的硬件执行命令存储起来,使得后续在运行网络模型时,将事先存储在网络执行图中的硬件执行命令分发给对应的硬件执行,不需要重新将该网络模型中的各个操作翻译成硬件执行命令,从而解决了处理器每次运行网络模型时,都需要很大的性能开销,且耗时长的问题。Among them, by storing the hardware execution commands corresponding to each operation in the network model in advance, when the network model is subsequently run, the hardware execution commands stored in the network execution graph in advance are distributed to the corresponding hardware executions, and there is no need to re-translate each operation in the network model into hardware execution commands, thereby solving the problem that the processor requires a large performance overhead and takes a long time each time it runs the network model.
一种实施方式下,可以是按照网络模型中包含的各个操作的执行顺序,依次将对应的硬件执行命令发送给目标硬件设备执行。在获取到预先存储的网络模型对应的目标硬件设备能够执行的硬件执行命令后,按照网络模型中包含的各个操作的执行顺序,依次将对应的硬件执行命令发送给标硬件设备执行,以使目标硬件设备执行硬件执行命令,从而实现在目标硬件设备上运行网络模型的功能,便于在目标硬件设备上能够运行网络模型对输入数据进行处理。In one implementation, the corresponding hardware execution commands may be sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model. After obtaining the hardware execution commands that can be executed by the target hardware device corresponding to the pre-stored network model, the corresponding hardware execution commands may be sent to the target hardware device for execution in sequence according to the execution order of each operation contained in the network model, so that the target hardware device executes the hardware execution commands, thereby realizing the function of running the network model on the target hardware device, and facilitating the network model to be run on the target hardware device to process the input data.
图4所示的数据处理方法的原理及产生的技术效果和图1所示的方法实施例相同,为简要描述,图4所示实施例部分未提及之处,可参考前述图1所示方法实施例中相应内容。The principle and technical effect of the data processing method shown in FIG4 are the same as those of the method embodiment shown in FIG1. For the sake of brief description, for parts not mentioned in the embodiment shown in FIG4, reference may be made to the corresponding contents in the method embodiment shown in FIG1 above.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,本申请实施例的在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令,可以包括如下S110和S120,如图5所示。下面将结合图5对其原理进行说明。Based on the same inventive concept as that of improving the problem of insufficient memory in the process of generating hardware execution commands for the network model each time in the above data processing method, when the network model needs to be run, the embodiment of the present application obtains the pre-stored hardware execution command executable by the target hardware device corresponding to the network model, which may include the following S110 and S120, as shown in Figure 5. The principle will be described below in conjunction with Figure 5.
S110:在需要使用网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取预先存储的第一硬件执行命令。S110: When the network model needs to be used, the original network data corresponding to the network model is loaded into the real memory space, and the pre-stored first hardware execution command is obtained.
在需要使用网络模型对输入数据进行处理(如图像识别、分类等)时,将网络模型对应的网络原始数据(此时的数据包括网络模型待处理的输入数据、网络自身的特征数据)加载到真实内存空间,并获取预先存储的第一硬件执行命令。其中,第一硬件执行命令为基于虚假内存空间对网络模型中包含的各个操作进行翻译得到,虚假内存空间与真实内存空间具备相同属性。When the network model is needed to process the input data (such as image recognition, classification, etc.), the original network data corresponding to the network model (the data at this time includes the input data to be processed by the network model and the characteristic data of the network itself) is loaded into the real memory space, and the pre-stored first hardware execution command is obtained. The first hardware execution command is obtained by translating each operation contained in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space.
在该种方式下,需要事先基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令,并进行缓存。In this manner, it is necessary to translate each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space in advance and cache it.
S120:利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。S120: Using the real address corresponding to the real memory space to replace the false address in the first hardware execution command.
在本申请实施例中,在利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址之后,所述数据处理方法还包括:将替换后的第一硬件执行命令发送给对应的硬件设备。In the embodiment of the present application, after replacing the false address in the first hardware execution command with the real address corresponding to the real memory space, the data processing method further includes: sending the replaced first hardware execution command to the corresponding hardware device.
由于上述的第一硬件执行命令是基于虚假内存空间生成的,第一硬件执行命令中的地址均为虚假地址,因此,后续在执行网络模型时,需要利用真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备去执行。Since the above-mentioned first hardware execution command is generated based on the virtual memory space, the addresses in the first hardware execution command are all false addresses. Therefore, when executing the network model subsequently, it is necessary to use the real address corresponding to the real memory space to replace the false address in the first hardware execution command, and send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution.
当不再需要执行某个网络模型时,可以基于虚假内存空间对应的虚假地址来替换掉第二硬件执行命令中的真实地址, 从而可释放相应的内存资源,即将替换了真实地址的第一硬件执行命令中的地址再次更改为虚假地址。此时,在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行之后,该命令生成方法还可以包括:在确定从当前时刻开始后的预设时间段内不执行网络模型时,将替换后的第一硬件执行命令(即,第二硬件执行命令)中的真实地址替换为虚假内存空间对应的虚假地址,并缓存地址替换为虚假地址的硬件执行命令,其中,虚假内存空间与真实内存空间具备相同属性,且大小一致。When a certain network model is no longer needed to be executed, the real address in the second hardware execution command can be replaced based on the fake address corresponding to the fake memory space. Thus, the corresponding memory resources can be released, that is, the address in the first hardware execution command that replaced the real address is changed to a false address again. At this time, after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, the command generation method can also include: when it is determined that the network model is not executed within a preset time period starting from the current moment, the real address in the replaced first hardware execution command (i.e., the second hardware execution command) is replaced with a false address corresponding to the false memory space, and the cache address is replaced with the hardware execution command of the false address, wherein the false memory space has the same properties as the real memory space and is the same size.
通过上述的命令生成过程,这样有利于在不降低网络模型处理效率的情况下,对网络模型进行提前翻译,并且可以提高翻译效率,有利于提升对于网络模型的处理效率,从而可以节约处理器每次运行网络模型时所需的性能开销。Through the above-mentioned command generation process, it is beneficial to translate the network model in advance without reducing the network model processing efficiency, and it can improve the translation efficiency, which is beneficial to improve the processing efficiency of the network model, thereby saving the performance overhead required by the processor each time the network model runs.
基于上述数据处理方法中改善处理器每次运行网络模型时都需要很大的性能开销且耗时长的问题同样的发明构思,本申请实施例还提供了一种数据处理装置100,如图6所示,该数据处理装置100可以包括:获取模块110、命令生成模块120以及存储模块130。该获取模块110也可以记为第一获取模块。Based on the same inventive concept of improving the problem that the processor needs a large performance overhead and takes a long time to run the network model each time in the above data processing method, the embodiment of the present application also provides a data processing device 100, as shown in Figure 6, the data processing device 100 may include: an acquisition module 110, a command generation module 120 and a storage module 130. The acquisition module 110 can also be recorded as a first acquisition module.
获取模块110可以被配置成用于:获取待运行的网络模型的计算图。The acquisition module 110 may be configured to: acquire a computational graph of a network model to be run.
命令生成模块120可以被配置成用于:将所述网络模型的计算图中的各个操作翻译成对应的目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息。The command generation module 120 may be configured to: translate each operation in the computation graph of the network model into a hardware execution command executable by a corresponding target hardware device, wherein the hardware execution command includes device information of the target hardware device.
存储模块130可以被配置成用于:利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备能够通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。The storage module 130 can be configured to: store the hardware execution commands using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can run the network model by executing the hardware execution commands in the network execution graph.
可选地,命令生成模块120可以被配置成用于利用预设第一API函数将所述网络模型的计算图中的各个操作的源代码编译成指令,并利用预设第二API函数获得目标硬件设备执行各个操作所需的相关信息;利用预设第三API函数根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述硬件执行命令。存储模块130可以被配置成用于:按照所述网络模型中包含的各个操作的执行顺序,依次将各个操作对应的硬件执行命令存储至所述网络执行图中,并记录每个硬件执行命令的关键信息,所述关键信息用于获取所述硬件执行命令。Optionally, the command generation module 120 can be configured to use a preset first API function to compile the source code of each operation in the computational graph of the network model into instructions, and use a preset second API function to obtain the relevant information required for the target hardware device to perform each operation; and use a preset third API function to generate the hardware execution command according to the corresponding instructions of each operation and the relevant information required to perform each operation. The storage module 130 can be configured to: store the hardware execution commands corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and record the key information of each hardware execution command, and the key information is used to obtain the hardware execution command.
可选地,该数据处理装置100还可以包括:发送模块。Optionally, the data processing device 100 may further include: a sending module.
获取模块110还可以被配置成用于:在需要运行所述网络模型时,获取预先存储在所述网络执行图中的所述硬件执行命令。The acquisition module 110 may also be configured to: when it is necessary to run the network model, acquire the hardware execution command pre-stored in the network execution graph.
发送模块还可以被配置成用于:将所述硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型。The sending module may also be configured to: send the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, thereby implementing the operation of the network model on the target hardware device.
可选地,所述发送模块可以被配置成用于:修改所述硬件执行命令中用于获取输入数据的读地址,以及/或者修改所述硬件执行命令中用于存储输出数据的写地址;将修改后的对应的硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行修改后的所述硬件执行命令,实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。Optionally, the sending module can be configured to: modify the read address in the hardware execution command for obtaining input data, and/or modify the write address in the hardware execution command for storing output data; send the modified corresponding hardware execution command to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, thereby achieving the purpose of running the network model on the target hardware device to process the input data.
可选地,该数据处理装置100还可以包括:复制模块,所述复制模块被配置成用于:根据所述AI芯片中的硬件设备的总数量,对所述硬件执行命令进行复制;根据所述AI芯片中除所述目标硬件设备外的其他硬件设备的设备信息,对复制出的所述硬件执行命令中包含的设备信息进行修改,得到修改过设备信息的硬件执行命令,其中,修改过设备信息的硬件执行命令能够被提供给所述其他硬件设备执行。Optionally, the data processing device 100 may further include: a copy module, which is configured to: copy the hardware execution command according to the total number of hardware devices in the AI chip; modify the device information contained in the copied hardware execution command according to the device information of other hardware devices in the AI chip except the target hardware device, to obtain a hardware execution command with modified device information, wherein the hardware execution command with modified device information can be provided to the other hardware devices for execution.
可选地,所述复制模块还可以被配置成用于:根据待处理的数据量,当前需要运行所述网络模型的硬件设备的第一数量。Optionally, the replication module may also be configured to: determine a first number of hardware devices currently required to run the network model based on an amount of data to be processed.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,本申请实施例中的命令生成模块120可以包括:分配模块121和翻译模块122。Based on the same inventive concept as that of improving the problem of insufficient memory in the processor each time it generates hardware execution commands for a network model in the above-mentioned data processing method, the command generation module 120 in the embodiment of the present application may include: an allocation module 121 and a translation module 122.
分配模块121可以被配置成用于为所述网络模型分配对应的虚假内存空间。The allocation module 121 may be configured to allocate a corresponding virtual memory space for the network model.
翻译模块122可以被配置成用于:基于虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性。The translation module 122 can be configured to: translate each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
若所述网络模型为多个,翻译模块122可以被配置成用于针对不同的网络模型,基于不同的虚假内存空间,将各个网络模型中包含的各个操作翻译成对应的第一硬件执行命令,不同的网络模型对应的虚假内存空间不同。If there are multiple network models, the translation module 122 can be configured to translate each operation contained in each network model into a corresponding first hardware execution command based on different virtual memory spaces for different network models, and different network models correspond to different virtual memory spaces.
在本申请实施例中,可选地,存储模块130还可以被配置成用于存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。In the embodiment of the present application, optionally, the storage module 130 can also be configured to store the first hardware execution command, where the first hardware execution command is provided to the hardware device that needs to run the network model for execution after the address is replaced.
可选地,分配模块121可以被配置成用于:根据执行所述网络模型所需的数据大小,分配与所述数据大小对应的虚假内存空间。Optionally, the allocation module 121 may be configured to allocate a virtual memory space corresponding to a data size required for executing the network model.
可选地,该命令生成模块120还可以包括判断模块,所述判断模块可以被配置成用于:判断从当前时刻开始后的预设时间段内是否执行所述网络模型。在确定从当前时刻开始后的预设时间段内不执行所述网络模型时,翻译模块122可以被配置成用于:基于虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令。Optionally, the command generation module 120 may further include a judgment module, which may be configured to: judge whether the network model is executed within a preset time period starting from the current moment. When it is determined that the network model is not executed within the preset time period starting from the current moment, the translation module 122 may be configured to: translate each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space.
在确定从当前时刻开始后的预设时间段内要执行所述网络模型时,所述翻译模块122还可以被配置成用于:基于所 述真实内存空间,将所述网络模型中包含的各个操作翻译成对应的第二硬件执行命令,其中,所述第二硬件执行命令中包含的地址均为真实地址,所述真实内存空间存储有执行所述网络模型时所需的数据。存储模块130还可以被配置成用于存储所述第二硬件执行命令。When determining that the network model is to be executed within a preset time period starting from the current moment, the translation module 122 may also be configured to: The real memory space is used to translate each operation included in the network model into a corresponding second hardware execution command, wherein the addresses included in the second hardware execution command are all real addresses, and the real memory space stores the data required for executing the network model. The storage module 130 can also be configured to store the second hardware execution command.
可选地,该命令生成模块120也可以包括获取模块和发送模块,所述获取模块可以被配置成用于:在执行所述网络模型时,将执行所述网络模型所需的数据加载到所述真实内存空间。翻译模块122还可以被配置成用于:利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。所述发送模块可以被配置成用于:将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备。Optionally, the command generation module 120 may also include an acquisition module and a sending module, and the acquisition module may be configured to: when executing the network model, load the data required for executing the network model into the real memory space. The translation module 122 may also be configured to: replace the false address in the first hardware execution command with the real address corresponding to the real memory space. The sending module may be configured to: send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
翻译模块122还可以被配置成用于:在确定从当前时刻开始后的预设时间段内不执行所述网络模型时,将所述替换后的第一硬件执行命令中的真实地址替换为所述虚假内存空间对应的虚假地址。The translation module 122 may also be configured to replace the real address in the replaced first hardware execution command with the false address corresponding to the false memory space when it is determined that the network model is not executed within a preset time period starting from the current moment.
可选地,翻译模块122可以被配置成用于:对所述第一硬件执行命令进行识别,识别出包含虚假地址的第一硬件执行命令;利用所述真实内存空间对应的真实地址替换掉所述包含虚假地址的第一硬件执行命令中的虚假地址。Optionally, the translation module 122 may be configured to: identify the first hardware execution command to identify a first hardware execution command containing a false address; and replace the false address in the first hardware execution command containing the false address with a real address corresponding to the real memory space.
翻译模块122可以被配置成用于:将所述网络模型中包含的各个操作的源代码编译成指令,并基于所述虚假内存空间,获得执行所述网络模型中包含的各个操作所需的相关信息;根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述第一硬件执行命令。The translation module 122 can be configured to: compile the source code of each operation included in the network model into instructions, and based on the virtual memory space, obtain the relevant information required to execute each operation included in the network model; generate the first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
本申请实施例所提供的命令生成模块120,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,装置实施例部分未提及之处,可参考前述方法实施例中相应内容。The command generation module 120 provided in the embodiment of the present application has the same implementation principle and technical effects as those of the aforementioned method embodiment. For the sake of brief description, for matters not mentioned in the device embodiment, reference may be made to the corresponding contents in the aforementioned method embodiment.
通过上述的命令生成模块中的各个模块执行的过程,这样有利于在不降低网络模型处理效率的情况下,对网络模型进行提前翻译,并且可以提高翻译效率,有利于提升对于网络模型的处理效率,从而可以节约处理器每次运行网络模型时所需的性能开销。The process of executing each module in the above-mentioned command generation module is beneficial to translating the network model in advance without reducing the processing efficiency of the network model, and can improve the translation efficiency, which is beneficial to improving the processing efficiency of the network model, thereby saving the performance overhead required by the processor each time the network model runs.
基于上述数据处理方法中改善处理器每次运行网络模型时都需要很大的性能开销且耗时长的问题同样的发明构思,本申请实施例还提供了又一种应用于运行网络模型进行数据处理的场景的数据处理装置200,如图8所示,该数据处理装置200包括:获取模块210和发送模块220。该获取模块210也可以记为第二获取模块。Based on the same inventive concept of improving the problem that each time the processor runs a network model, the data processing method requires a large performance overhead and takes a long time, the embodiment of the present application also provides another data processing device 200 for a scenario where a network model is run for data processing, as shown in FIG8 , the data processing device 200 includes: an acquisition module 210 and a sending module 220. The acquisition module 210 can also be recorded as a second acquisition module.
获取模块210可以被配置成用于:在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令。发送模块220可以被配置成用于:将所述硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。The acquisition module 210 may be configured to: when it is necessary to run the network model, acquire the pre-stored hardware execution command that can be executed by the target hardware device corresponding to the network model. The sending module 220 may be configured to: send the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, so as to achieve the purpose of running the network model on the target hardware device to process the input data.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,如图9所示,该获取模块210可以包括:第一硬件执行命令获取模块211和翻译模块212。Based on the same inventive concept as that of improving the problem of insufficient memory in the processor each time it generates a hardware execution command for a network model in the above-mentioned data processing method, as shown in FIG9 , the acquisition module 210 may include: a first hardware execution command acquisition module 211 and a translation module 212.
第一硬件执行命令获取模块211可以被配置成用于:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取预先存储的第一硬件执行命令。其中,所述第一硬件执行命令为基于虚假内存空间对所述网络模型中包含的各个操作进行翻译得到,所述虚假内存空间与真实内存空间具备相同属性。The first hardware execution command acquisition module 211 can be configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, and acquire the pre-stored first hardware execution command. The first hardware execution command is obtained by translating each operation included in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space.
翻译模块212可以被配置成用于:利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。The translation module 212 may be configured to replace the false address in the first hardware execution command with the real address corresponding to the real memory space.
发送模块220还可以被配置成用于:将替换后的第一硬件执行命令发送给对应的硬件设备。The sending module 220 may also be configured to send the replaced first hardware execution command to the corresponding hardware device.
翻译模块212还可以被配置成用于:在确定从当前时刻开始后的预设时间段内不执行所述网络模型时,将所述替换后的第一硬件执行命令中的真实地址替换为虚假内存空间对应的虚假地址,所述虚假内存空间与所述真实内存空间具备相同属性。The translation module 212 can also be configured to: when it is determined that the network model is not executed within a preset time period starting from the current moment, replace the real address in the replaced first hardware execution command with a virtual address corresponding to the virtual memory space, and the virtual memory space has the same properties as the real memory space.
本申请实施例所提供的获取模块210,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,装置实施例部分未提及之处,可参考前述方法实施例中相应内容。该获取模块210中的模块与前述的命令生成模块120中的模块可以集成在一起,也可以独立使用。The acquisition module 210 provided in the embodiment of the present application has the same implementation principle and technical effects as the aforementioned method embodiment. For the sake of brief description, for matters not mentioned in the device embodiment, reference can be made to the corresponding contents in the aforementioned method embodiment. The modules in the acquisition module 210 and the modules in the aforementioned command generation module 120 can be integrated together or used independently.
通过上述的获取模块中的各个模块执行的过程,这样有利于在不降低网络模型处理效率的情况下,对网络模型进行提前翻译,并且可以提高翻译效率,有利于提升对于网络模型的处理效率,从而可以节约处理器每次运行网络模型时所需的性能开销。本申请实施例所提供的数据处理装置100或数据处理装置200,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,装置实施例部分未提及之处,可参考前述方法实施例中相应内容。Through the process executed by each module in the above-mentioned acquisition module, it is beneficial to translate the network model in advance without reducing the processing efficiency of the network model, and the translation efficiency can be improved, which is beneficial to improving the processing efficiency of the network model, thereby saving the performance overhead required by the processor each time the network model is run. The data processing device 100 or data processing device 200 provided in the embodiment of the present application has the same implementation principle and technical effects as the aforementioned method embodiment. For the sake of brief description, the parts not mentioned in the device embodiment can refer to the corresponding content in the aforementioned method embodiment.
基于上述数据处理方法中改善处理器每次运行网络模型时都需要很大的性能开销且耗时长的问题同样的发明构思,本申请实施例还提供了一种AI芯片,该AI芯片可以包括:内核以及存储设备。该AI芯片可用于执行前述的数据处理方法。Based on the same inventive concept of improving the problem that the processor needs a large performance overhead and takes a long time to run the network model each time in the above data processing method, the embodiment of the present application also provides an AI chip, which may include: a core and a storage device. The AI chip can be used to execute the above data processing method.
内核,用于获取待运行的网络模型的计算图,并将所述网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;The kernel is used to obtain the computation graph of the network model to be run, and translate each operation in the computation graph of the network model into a hardware execution command executable by the target hardware device, wherein the hardware execution command contains the device information of the target hardware device;
其中,内核中部署有驱动程序,可以由该驱动程序来将网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令,并将其硬件执行命令发给存储设备。Among them, a driver is deployed in the kernel, which can translate various operations in the calculation graph of the network model into hardware execution commands that can be executed by the target hardware device, and send its hardware execution commands to the storage device.
内核可以是利用预设第一API函数将网络模型的计算图中的各个操作的源代码编译成指令,并利用预设第二API函 数获得目标硬件设备执行各个操作所需的相关信息;利用预设第三API函数根据各个操作的对应的指令与执行各个操作所需的相关信息,生成硬件执行命令。存储设备可以被配置成用于利用网络执行图存储硬件执行命令,其中,网络执行图用于记录硬件执行命令,硬件执行命令用于运行网络模型。The kernel may compile the source code of each operation in the computational graph of the network model into instructions using a preset first API function, and use a preset second API function to The storage device can be configured to store the hardware execution command using a network execution graph, wherein the network execution graph is used to record the hardware execution command, and the hardware execution command is used to run the network model.
一种可选实施方式下,存储设备可以是按照网络模型中包含的各个操作的执行顺序,依次将各个操作对应的硬件执行命令存储至网络执行图中,并记录每个硬件执行命令的关键信息,关键信息用于获取硬件执行命令。In an optional implementation, the storage device may store the hardware execution commands corresponding to each operation in the network execution graph in sequence according to the execution order of each operation contained in the network model, and record key information of each hardware execution command, which is used to obtain the hardware execution command.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,一种可选实施方式下,本申请实施例中的内核还可以被配置成用于:为所述网络模型分配对应的虚假内存空间,并基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令,第一硬件执行命令中的地址均为虚假地址,虚假内存空间与真实内存空间具备相同属性。Based on the same inventive concept as that of improving the problem of insufficient memory each time the processor generates a hardware execution command for a network model in the above-mentioned data processing method, under an optional implementation mode, the kernel in the embodiment of the present application can also be configured to: allocate a corresponding virtual memory space for the network model, and based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
其中,内核中部署有驱动程序,可以由该驱动程序来将网络模型中包含的各个操作翻译成第一硬件执行命令,并将其第一硬件执行命令发给存储设备进行存储。A driver is deployed in the kernel, and the driver can translate various operations included in the network model into first hardware execution commands, and send the first hardware execution commands to the storage device for storage.
内核还可以被配置成用于:将网络模型中包含的各个操作的源代码编译成指令,并基于虚假内存空间,获得执行所述网络模型中包含的各个操作所需的相关信息;根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述第一硬件执行命令。The kernel can also be configured to: compile the source code of each operation included in the network model into instructions, and based on the virtual memory space, obtain the relevant information required to execute each operation included in the network model; generate the first hardware execution command according to the corresponding instructions of each operation and the relevant information required to execute each operation.
存储设备也可以被配置成用于存储第一硬件执行命令,第一硬件执行命令用于在被进行地址替换后提供给需要运行网络模型的硬件设备执行。The storage device may also be configured to store a first hardware execution command, where the first hardware execution command is provided to a hardware device that needs to run the network model for execution after address replacement.
可选地,内核可以被配置成用于:根据执行所述网络模型所需的数据大小,分配与所述数据大小对应的虚假内存空间。Optionally, the kernel may be configured to: allocate a virtual memory space corresponding to a data size required for executing the network model.
可选地,内核在基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令之前,还可以被配置成用于:判断从当前时刻开始后的预设时间段内是否执行所述网络模型,在确定从当前时刻开始后的预设时间段内不执行所述网络模型时,才基于虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令。Optionally, before the kernel translates each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space, the kernel can also be configured to: determine whether the network model is executed within a preset time period after the current moment, and only translate each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space when it is determined that the network model is not executed within the preset time period after the current moment.
在确定从当前时刻开始后的预设时间段内要执行所述网络模型时,内核还用于基于所述真实内存空间,将所述网络模型中包含的各个操作翻译成对应的第二硬件执行命令,其中,所述第二硬件执行命令中包含的地址均为真实地址,所述真实内存空间存储有执行所述网络模型时所需的数据。此时,该存储设备还可以被配置成用于存储所述第二硬件执行命令。When it is determined that the network model is to be executed within a preset time period starting from the current moment, the kernel is further configured to translate each operation contained in the network model into a corresponding second hardware execution command based on the real memory space, wherein the addresses contained in the second hardware execution command are all real addresses, and the real memory space stores the data required for executing the network model. At this time, the storage device can also be configured to store the second hardware execution command.
可选地,内核还可以被配置成用于:在执行所述网络模型时,将执行所述网络模型所需的数据加载到所述真实内存空间;利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备。Optionally, the kernel can also be configured to: when executing the network model, load the data required to execute the network model into the real memory space; replace the false address in the first hardware execution command with the real address corresponding to the real memory space, and send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
可选地,内核可以被配置成用于:对所述第一硬件执行命令进行识别,确定出当前包含虚假地址的部分或全部第一硬件执行命令,作为目标命令;利用所述真实内存空间对应的真实地址,替换掉所述目标命令中的虚假地址。Optionally, the kernel can be configured to: identify the first hardware execution command, determine part or all of the first hardware execution command currently containing a false address as a target command; and replace the false address in the target command with a real address corresponding to the real memory space.
可选地,内核还可以被配置成用于:在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行之后,将所述替换后的第一硬件执行命令(即,第二硬件执行命令)中的真实地址替换为所述虚假内存空间对应的虚假地址,并缓存地址替换为虚假地址的第二硬件执行命令。Optionally, the kernel can also be configured to: after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, replace the real address in the replaced first hardware execution command (i.e., the second hardware execution command) with the false address corresponding to the false memory space, and replace the cache address with the second hardware execution command with the false address.
由于将网络模型中的各个操作翻译成硬件执行命令的过程与执行硬件执行命令的过程可以由不同的AI芯片来实现,因此,AI芯片1在得到硬件执行命令后,可以是存储在AI芯片1的存储设备中,也可以是存储到AI芯片2的存储设备中,或者存储到AI芯片1与AI芯片2共享的存储设备中。基于上述数据处理方法中改善处理器每次运行网络模型时都需要很大的性能开销且耗时长的问题同样的发明构思,本申请实施例还提供了一种AI芯片,该AI芯片可以包括:硬件设备、内核以及存储设备。该AI芯片可用于执行前述的数据处理方法。Since the process of translating each operation in the network model into a hardware execution command and the process of executing the hardware execution command can be implemented by different AI chips, after AI chip 1 obtains the hardware execution command, it can be stored in the storage device of AI chip 1, or in the storage device of AI chip 2, or in the storage device shared by AI chip 1 and AI chip 2. Based on the same inventive concept of improving the problem that the processor requires a large performance overhead and takes a long time each time to run the network model in the above-mentioned data processing method, the embodiment of the present application also provides an AI chip, which may include: a hardware device, a kernel, and a storage device. The AI chip can be used to execute the aforementioned data processing method.
存储设备可以被配置成用于存储网络模型的计算图中的各个操作对应的硬件执行命令。The storage device may be configured to store hardware execution commands corresponding to each operation in the computation graph of the network model.
内核可以被配置成用于:在需要运行网络模型时,从存储设备中获取先存储的硬件执行命令,并将的硬件执行命令发送给硬件设备。The kernel may be configured to: when the network model needs to be run, obtain the previously stored hardware execution command from the storage device, and send the hardware execution command to the hardware device.
硬件设备可以被配置成用于:执行硬件执行命令,以实现运行网络模型对输入数据进行处理的目的。The hardware device can be configured to: execute hardware execution commands to achieve the purpose of running the network model to process input data.
由于将网络模型中的各个操作翻译成硬件执行命令的过程与执行硬件执行命令的过程可以由不同的AI芯片来实现。因此,一种实施方式下,可以是该AI芯片接收其他AI芯片发送的硬件执行命令,并执行硬件执行命令。此时,内核还用于接收其他AI芯片发送的硬件执行命令,并存储给硬件设备执行。Since the process of translating each operation in the network model into a hardware execution command and the process of executing the hardware execution command can be implemented by different AI chips, therefore, in one implementation, the AI chip can receive the hardware execution command sent by other AI chips and execute the hardware execution command. At this time, the kernel is also used to receive the hardware execution command sent by other AI chips and store it for execution by the hardware device.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,一种可选实施方式下,本申请实施例中的存储设备还可以被配置成用于存储第一硬件执行命令;其中,所述第一硬件执行命令为基于虚假内存空间对网络模型中包含的各个操作进行翻译得到,所述虚假内存空间与真实内存空间具备相同属性;Based on the same inventive concept as improving the problem of insufficient memory in the process of generating hardware execution commands for the network model each time in the above-mentioned data processing method, in an optional implementation manner, the storage device in the embodiment of the present application can also be configured to store a first hardware execution command; wherein the first hardware execution command is obtained by translating each operation included in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space;
内核还可以被配置成用于:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取存储于所述存储设备中的第一硬件执行命令,利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令 中的虚假地址,并将替换后的第一硬件执行命令发送给所述硬件设备;The kernel may also be configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, obtain the first hardware execution command stored in the storage device, and replace the first hardware execution command with the real address corresponding to the real memory space. The false address in the hardware device and the replaced first hardware execution command are sent to the hardware device;
硬件设备还可以被配置成用于:执行替换后的第一硬件执行命令,以实现运行所述网络模型对输入数据进行处理的目的。The hardware device may also be configured to: execute the replaced first hardware execution command to achieve the purpose of running the network model to process input data.
本申请实施例所提供的AI芯片,其实现原理及产生的技术效果和前述方法实施例相同,为简要描述,AI芯片实施例部分未提及之处,可参考前述方法实施例中相应内容。The AI chip provided in the embodiment of the present application has the same implementation principle and technical effects as those in the aforementioned method embodiment. For the sake of brief description, for matters not mentioned in the AI chip embodiment, reference may be made to the corresponding contents in the aforementioned method embodiment.
基于上述数据处理方法中改善处理器每次运行网络模型时都需要很大的性能开销且耗时长的问题同样的发明构思,如图10所示,图10示出了本申请实施例提供的一种电子设备300的结构框图。所述电子设备300可以包括:收发器310、存储器320、通讯总线330以及处理器340。所述收发器310、所述存储器320、处理器340各元件相互之间直接或间接地电性连接,以实现数据的传输或交互。例如,这些元件相互之间可通过一条或多条通讯总线330或信号线实现电性连接。其中,收发器310可以被配置成用于收发数据。存储器320可以被配置成用于存储计算机程序,如存储有图6至图9中所示的软件功能模块,即图6的数据处理装置100或图8的数据处理装置200。其中,数据处理装置100包括至少一个可以软件或固件(Firmware)的形式存储于所述存储器320中或固化在所述电子设备300的操作系统(Operating System,OS)中的软件功能模块。所述处理器340可以被配置成用于执行存储器320中存储的可执行模块。Based on the same inventive concept of improving the problem that the processor needs a large performance overhead and takes a long time to run the network model each time in the above-mentioned data processing method, as shown in Figure 10, Figure 10 shows a structural block diagram of an electronic device 300 provided in an embodiment of the present application. The electronic device 300 may include: a transceiver 310, a memory 320, a communication bus 330 and a processor 340. The transceiver 310, the memory 320, and the processor 340 are directly or indirectly electrically connected to each other to realize data transmission or interaction. For example, these elements can be electrically connected to each other through one or more communication buses 330 or signal lines. Among them, the transceiver 310 can be configured to receive and send data. The memory 320 can be configured to store computer programs, such as storing the software function modules shown in Figures 6 to 9, that is, the data processing device 100 of Figure 6 or the data processing device 200 of Figure 8. The data processing device 100 includes at least one software function module that can be stored in the memory 320 in the form of software or firmware or fixed in the operating system (OS) of the electronic device 300. The processor 340 can be configured to execute the executable module stored in the memory 320.
例如,处理器340可以被配置成用于执行数据处理装置100包括的软件功能模块或计算机程序时,处理器340可以被配置成用于:获取待运行的网络模型的计算图;将网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备能够通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。For example, when the processor 340 can be configured to execute the software function module or computer program included in the data processing device 100, the processor 340 can be configured to: obtain the calculation graph of the network model to be run; translate each operation in the calculation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip, and the hardware execution command contains the device information of the target hardware device; use the network execution graph to store the hardware execution command, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device can run the network model by executing the hardware execution commands in the network execution graph.
处理器340可以被配置成用于执行数据处理装置200包括的软件功能模块或计算机程序时,处理器340可以被配置成用于:在需要运行所述网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令;将所述硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。When the processor 340 can be configured to execute the software function modules or computer programs included in the data processing device 200, the processor 340 can be configured to: when it is necessary to run the network model, obtain a pre-stored hardware execution command that can be executed by the target hardware device corresponding to the network model; send the hardware execution command to the target hardware device for execution, so that the target hardware device executes the hardware execution command, so as to achieve the purpose of running the network model on the target hardware device to process the input data.
可以理解的是,电子设备300也可以包括2个处理器340,其中一个处理器340负责将网络模型中的各个操作翻译成硬件执行命令,一个处理器340负责执行硬件执行命令。It is understandable that the electronic device 300 may also include two processors 340, wherein one processor 340 is responsible for translating various operations in the network model into hardware execution commands, and one processor 340 is responsible for executing the hardware execution commands.
其中,存储器320可以是,但不限于,随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)等。Among them, the memory 320 can be, but is not limited to, random access memory (Random Access Memory, RAM), read-only memory (Read Only Memory, ROM), programmable read-only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable read-only memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
处理器340可能是一种集成电路芯片,具有信号的处理能力。上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,CPU)、网络处理器(Network Processor,NP)等;还可以是图形处理器(Graphic Processing Unit,GPU)、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器340也可以是任何常规的处理器等。The processor 340 may be an integrated circuit chip with signal processing capabilities. The above-mentioned processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it may also be a graphics processor (GPU), a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The various methods, steps and logic block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general-purpose processor may be a microprocessor or the processor 340 may also be any conventional processor, etc.
其中,上述的电子设备300,包括但不限于智能手机、平板、电脑、工控机、车载设备、服务器、智能穿戴设备、边缘盒子等。Among them, the above-mentioned electronic devices 300 include but are not limited to smart phones, tablets, computers, industrial computers, vehicle-mounted equipment, servers, smart wearable devices, edge boxes, etc.
基于与上述数据处理方法中改善处理器每次在为网络模型生成硬件执行命令的过程中内存不够用的问题同样的发明构思,一种可选实施方式下,存储器可以被配置成用于存储网络模型,以及还可以被配置成用于存储执行网络模型所需的原始数据,比如待处理的输入数据,以及网络本身的特征数据。Based on the same inventive concept as that of improving the problem of insufficient memory each time the processor generates hardware execution commands for the network model in the above-mentioned data processing method, in an optional implementation, the memory can be configured to store the network model, and can also be configured to store the original data required to execute the network model, such as input data to be processed, and characteristic data of the network itself.
第一处理器可以被配置成用于为所述网络模型分配对应的虚假内存空间,将网络模型中包含的各个操作翻译成对应的第一硬件执行命令,并将第一硬件执行命令进行存储。该电子设备还可以包括中央处理器CPU,上述的第一处理器可以是协助中央处理器进行数据处理的协处理器,比如可以是图形处理器(Graphics Processing Unit,GPU)或通用图形处理器(General Purpose computing on Graphics Processing Units,GPGPU)。其中,CPU和第一处理器都可以视为上述的AI芯片。The first processor may be configured to allocate a corresponding virtual memory space for the network model, translate each operation contained in the network model into a corresponding first hardware execution command, and store the first hardware execution command. The electronic device may also include a central processing unit (CPU), and the first processor may be a coprocessor that assists the central processing unit in data processing, such as a graphics processing unit (GPU) or a general purpose computing on graphics processing units (GPGPU). The CPU and the first processor may be regarded as the above-mentioned AI chip.
待执行网络模型时,第一处理器将执行网络模型所需的数据加载到第一处理器的真实内存空间,并利用真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行。之后,在确定从当前时刻开始后的预设时间段内不执行网络模型时,可以将第二硬件执行命令中的真实地址替换为虚假内存空间对应的虚假地址,并缓存地址替换为虚假地址的第二硬件执行命令。When the network model is to be executed, the first processor loads the data required to execute the network model into the real memory space of the first processor, and replaces the false address in the first hardware execution command with the real address corresponding to the real memory space, and sends the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution. Afterwards, when it is determined that the network model will not be executed within a preset time period starting from the current moment, the real address in the second hardware execution command can be replaced with the false address corresponding to the false memory space, and the cache address is replaced with the second hardware execution command with the false address.
为了更好地理解本申请的原理,下面结合图11所示的电子设备,对本申请实例提供的命令生成方法和一种未采用虚假内存的命令生成方法进行对比说明。In order to better understand the principles of the present application, the command generation method provided in the example of the present application and a command generation method that does not use a false memory are compared and explained below in conjunction with the electronic device shown in FIG. 11 .
常规流程:General process:
步骤1.初始时,网络原始数据(包括待处理的输入数据,还可以包括网络本身的特征数据)及网络模型都是存储在 存储装置(可以是磁盘)中。Step 1. Initially, the network raw data (including the input data to be processed and the characteristic data of the network itself) and the network model are stored in In a storage device (which may be a disk).
步骤2.在需要将网络模型翻译成硬件执行命令之前,需要将网络原始数据及网络模型本身的数据都加载到CPU的DDR(Double Data Rate,双倍速率同步动态随机存储器)中,根据数据所占用的CPU DDR空间,在第一处理器的专用DDR中分配且占用锁定同样大小的真实DDR空间,通过CPU和第一处理器协作,将存储在CPU的DDR中的数据(含输入数据)全部搬运到第一处理器的DDR中。Step 2. Before the network model needs to be translated into hardware execution commands, the original network data and the data of the network model itself need to be loaded into the DDR (Double Data Rate) of the CPU. According to the CPU DDR space occupied by the data, a real DDR space of the same size is allocated and occupied in the dedicated DDR of the first processor. Through the collaboration of the CPU and the first processor, all data (including input data) stored in the DDR of the CPU are moved to the DDR of the first processor.
步骤3.在将网络模型翻译成硬件执行命令时,第一处理器基于分配的真实DDR空间,将网络模型中的各个操作算子,特征数据的DDR地址,输入数据的DDR地址,以及存储操作运行结果的DDR地址全部结合在一起,产生一系列的硬件执行命令。Step 3. When translating the network model into hardware execution commands, the first processor combines all the operation operators in the network model, the DDR addresses of the feature data, the DDR addresses of the input data, and the DDR addresses of the storage operation results based on the allocated real DDR space to generate a series of hardware execution commands.
步骤4.后续直接执行这些硬件执行命令。Step 4. Then directly execute these hardware execution commands.
而采用本申请所示的命令生成方法的一种流程可以包括:A process of using the command generation method shown in this application may include:
步骤1.初始时,网络原始数据(包括输入数据,还可以包括网络本身的特征数据)及网络模型也可以存储在存储装置(可以是磁盘)中。Step 1. Initially, the original network data (including input data and may also include characteristic data of the network itself) and the network model may also be stored in a storage device (which may be a disk).
步骤2.在需要将网络模型翻译成第一硬件执行命令之前,根据执行网络模型所需的数据大小,分配与该数据大小对应的虚假内存空间(fake memory),并将网络模型加载至第一处理器的DDR中。Step 2. Before the network model needs to be translated into the first hardware execution command, a fake memory space corresponding to the data size required to execute the network model is allocated, and the network model is loaded into the DDR of the first processor.
步骤3.在将网络模型翻译成第一硬件执行命令时,第一处理器基于分配的虚假DDR空间(虚假内存空间),将网络模型中的各个操作算子,特征数据的虚假DDR地址,输入数据的虚假DDR地址,以及存储操作运行结果的虚假DDR地址全部结合在一起,产生一系列的第一硬件执行命令,并对这些硬件执行命令进行存储。Step 3. When translating the network model into the first hardware execution command, the first processor combines all the operation operators in the network model, the false DDR addresses of the feature data, the false DDR addresses of the input data, and the false DDR addresses of the storage operation results based on the allocated false DDR space (false memory space) to generate a series of first hardware execution commands, and stores these hardware execution commands.
步骤4.在后续执行网络模型时,将网络原始数据加载到CPU的DDR中,根据数据所占用的DDR空间,在第一处理器的DDR中分配同样大小的DDR空间(是真实内存空间),通过CPU和第一处理器协作,将CPU的DDR中的数据全部搬运到第一处理器的DDR中,然后根据分配的真实内存空间对应的真实地址替换掉第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令发送给对应的硬件设备。Step 4. When the network model is subsequently executed, the original network data is loaded into the DDR of the CPU. According to the DDR space occupied by the data, a DDR space of the same size (real memory space) is allocated in the DDR of the first processor. Through the collaboration between the CPU and the first processor, all the data in the DDR of the CPU is moved to the DDR of the first processor. Then, according to the real address corresponding to the allocated real memory space, the false address in the first hardware execution command is replaced, and the replaced first hardware execution command is sent to the corresponding hardware device.
当电子设备上需要执行多种不同的网络模型时,而CPU和第一处理器的DDR又有限,那么情况就是:When multiple different network models need to be executed on an electronic device, and the DDR of the CPU and the first processor is limited, the situation is:
常规流程:反复执行步骤1,步骤2,步骤3,这样很快就会将DDR占满。Conventional process: Repeat step 1, step 2, and step 3, which will quickly fill up the DDR.
而采用本申请所示的命令生成方法:反复执行上述的步骤1,步骤2(以及步骤3),可以把所需的硬件执行命令全部生成出来,几乎不怎么占用CPU和第一处理器的DDR,当需要执行某个网络模型的时候,再执行对应步骤4。By adopting the command generation method shown in the present application: repeatedly executing the above steps 1, 2 (and 3), all the required hardware execution commands can be generated, and almost no DDR of the CPU and the first processor is occupied. When a certain network model needs to be executed, the corresponding step 4 is executed.
需要说明的是,图11所示的流程仅为众多实施例中的一种,将网络模型中包含的各个操作翻译成对应的硬件执行命令的过程也可以用上述的CPU来完成。本申请实施例还提供了一种非易失性的计算机可读取存储介质(以下简称存储介质),该存储介质上存储有计算机程序,该计算机程序被计算机如上述的电子设备300运行时,执行上述所示的数据处理方法。而前述的计算机可读存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。It should be noted that the process shown in Figure 11 is only one of many embodiments, and the process of translating each operation contained in the network model into a corresponding hardware execution command can also be completed by the above-mentioned CPU. The embodiment of the present application also provides a non-volatile computer-readable storage medium (hereinafter referred to as the storage medium), on which a computer program is stored. When the computer program is run by a computer such as the above-mentioned electronic device 300, the data processing method shown above is executed. The aforementioned computer-readable storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc. Various media that can store program codes.
需要说明的是,本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。It should be noted that the various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same or similar parts between the various embodiments can be referenced to each other.
另外,在本申请各个实施例中的各功能模块可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或两个以上模块集成形成一个独立的部分。In addition, the functional modules in the various embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个计算机可读存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,笔记本电脑,服务器,或者电子设备等)执行本申请各个实施例所述方法的全部或部分步骤。If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application, or the part that contributes to the prior art, or the part of the technical solution, can be embodied in the form of a software product, which is stored in a computer-readable storage medium and includes several instructions for enabling a computer device (which can be a personal computer, a laptop, a server, or an electronic device, etc.) to perform all or part of the steps of the methods described in the various embodiments of the present application.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。The above is only a specific implementation of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.
工业实用性Industrial Applicability
本申请涉及一种数据处理方法、装置、AI芯片、电子设备及存储介质,属于数据处理技术领域。该数据处理方法包括:获取待运行的网络模型的计算图;将网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令;利用网络执行图存储硬件执行命令。通过将网络模型的计算图中的各个操作翻译成对应的目标硬件设备能够执行的硬件执行命令,并存储起来,使得后续每次需要运行该网络模型时,直接将事先存储的硬件执行命令分发给对应的硬件执行,不需要重新将该网络模型的计算图中的各个操作翻译成硬件执行命令,从而改善处理器每次运行网络模型时都需要很大的性能开销,且耗时长的问题。The present application relates to a data processing method, device, AI chip, electronic device and storage medium, and belongs to the field of data processing technology. The data processing method includes: obtaining a calculation graph of a network model to be run; translating each operation in the calculation graph of the network model into a hardware execution command that can be executed by the target hardware device of the AI chip; and storing the hardware execution command using the network execution graph. By translating each operation in the calculation graph of the network model into a hardware execution command that can be executed by the corresponding target hardware device and storing it, each subsequent time when the network model needs to be run, the pre-stored hardware execution command is directly distributed to the corresponding hardware execution, and there is no need to re-translate each operation in the calculation graph of the network model into a hardware execution command, thereby improving the problem that the processor requires a large performance overhead and takes a long time to run the network model each time.
此外,可以理解的是,本申请的数据处理方法、装置、AI芯片、电子设备及存储介质是可以重现的,并且可以用在多种工业应用中。例如,本申请的数据处理方法、装置、AI芯片、电子设备及存储介质可以用于需要降低处理器的性能开销和提高数据处理时的效率的任何装置。 In addition, it is understood that the data processing method, device, AI chip, electronic device and storage medium of the present application are reproducible and can be used in a variety of industrial applications. For example, the data processing method, device, AI chip, electronic device and storage medium of the present application can be used in any device that needs to reduce the performance overhead of the processor and improve the efficiency of data processing.

Claims (32)

  1. 一种数据处理方法,其中,所述数据处理方法包括:A data processing method, wherein the data processing method comprises:
    获取待运行的网络模型的计算图;Get the computational graph of the network model to be run;
    将所述网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;Translate each operation in the computation graph of the network model into a hardware execution command executable by the target hardware device of the AI chip, wherein the hardware execution command includes device information of the target hardware device;
    利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备用于通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。The hardware execution commands are stored using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device is used to run the network model by executing the hardware execution commands in the network execution graph.
  2. 根据权利要求1所述的数据处理方法,其中,将所述网络模型的计算图中包含的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,包括:The data processing method according to claim 1, wherein translating each operation contained in the computation graph of the network model into a hardware execution command executable by a target hardware device of the AI chip comprises:
    利用预设第一API函数将所述网络模型的计算图中的各个操作的源代码编译成指令,并利用预设第二API函数获得目标硬件设备执行各个操作所需的相关信息;Compile the source code of each operation in the computation graph of the network model into instructions using a preset first API function, and obtain relevant information required for the target hardware device to perform each operation using a preset second API function;
    利用预设第三API函数根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述硬件执行命令。The hardware execution command is generated by using the preset third API function according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  3. 根据权利要求1或2所述的数据处理方法,其中,利用网络执行图存储所述硬件执行命令,包括:The data processing method according to claim 1 or 2, wherein storing the hardware execution command using a network execution graph comprises:
    按照所述网络模型中包含的各个操作的执行顺序,依次将各个操作对应的硬件执行命令存储至所述网络执行图中,并记录每个硬件执行命令的关键信息,所述关键信息用于获取所述硬件执行命令。According to the execution order of each operation contained in the network model, the hardware execution command corresponding to each operation is stored in the network execution graph in sequence, and the key information of each hardware execution command is recorded, and the key information is used to obtain the hardware execution command.
  4. 根据权利要求1至3中任一项所述的数据处理方法,其中,所述数据处理方法还包括:The data processing method according to any one of claims 1 to 3, wherein the data processing method further comprises:
    在需要运行所述网络模型时,获取预先存储在所述网络执行图中的所述硬件执行命令;When the network model needs to be run, obtaining the hardware execution command pre-stored in the network execution graph;
    将所述硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型。The hardware execution command is sent to the target hardware device for execution, so that the target hardware device executes the hardware execution command, thereby realizing running the network model on the target hardware device.
  5. 根据权利要求4所述的数据处理方法,其中,将所述硬件执行命令发送给所述目标硬件设备执行,包括:The data processing method according to claim 4, wherein sending the hardware execution command to the target hardware device for execution comprises:
    修改所述硬件执行命令中用于获取输入数据的读地址,以及/或者修改所述硬件执行命令中用于存储输出数据的写地址;Modifying a read address in the hardware execution command for obtaining input data, and/or modifying a write address in the hardware execution command for storing output data;
    将修改后的硬件执行命令发送给所述目标硬件设备执行,以使所述目标硬件设备执行修改后的所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型来对输入数据进行处理。The modified hardware execution command is sent to the target hardware device for execution, so that the target hardware device executes the modified hardware execution command, thereby running the network model on the target hardware device to process the input data.
  6. 根据权利要求1至5中任一项所述的数据处理方法,其中,所述数据处理方法还包括:The data processing method according to any one of claims 1 to 5, wherein the data processing method further comprises:
    根据所述AI芯片中的硬件设备的总数量,对所述硬件执行命令进行复制;Copying the hardware execution command according to the total number of hardware devices in the AI chip;
    根据所述AI芯片中除所述目标硬件设备外的其他硬件设备的设备信息,对复制出的所述硬件执行命令中包含的设备信息进行修改,得到修改过设备信息的硬件执行命令,其中,修改过设备信息的硬件执行命令能够被提供给所述其他硬件设备执行。According to the device information of other hardware devices in the AI chip except the target hardware device, the device information contained in the copied hardware execution command is modified to obtain the hardware execution command with the modified device information, wherein the hardware execution command with the modified device information can be provided to the other hardware devices for execution.
  7. 根据权利要求6所述的数据处理方法,其中,所述数据处理方法还包括:The data processing method according to claim 6, wherein the data processing method further comprises:
    根据待处理的数据量,确定当前需要运行所述网络模型的硬件设备的第一数量。According to the amount of data to be processed, a first number of hardware devices currently required to run the network model is determined.
  8. 根据权利要求1至7中任一项所述的数据处理方法,其中,将所述网络模型的计算图中的各个操作翻译成AI芯片的目标硬件设备能够执行的硬件执行命令,包括:The data processing method according to any one of claims 1 to 7, wherein translating each operation in the computation graph of the network model into a hardware execution command executable by a target hardware device of the AI chip comprises:
    为所述网络模型分配对应的虚假内存空间;以及Allocating corresponding false memory space for the network model; and
    基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性。Based on the virtual memory space, each operation included in the network model is translated into a corresponding first hardware execution command, the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
  9. 根据权利要求8所述的数据处理方法,其中,利用网络执行图存储所述硬件执行命令,包括:在基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令之后,存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。According to the data processing method of claim 8, wherein the hardware execution command is stored using a network execution graph, comprising: after translating each operation contained in the network model into a corresponding first hardware execution command based on the virtual memory space, storing the first hardware execution command, the first hardware execution command being used to provide the first hardware execution command to a hardware device that needs to run the network model for execution after address replacement.
  10. 根据权利要求8或9所述的数据处理方法,其中,为所述网络模型分配对应的虚假内存空间,包括:The data processing method according to claim 8 or 9, wherein allocating a corresponding virtual memory space to the network model comprises:
    根据执行所述网络模型所需的数据大小,分配与所述数据大小对应的虚假内存空间。According to the data size required for executing the network model, a virtual memory space corresponding to the data size is allocated.
  11. 根据权利要求8至10中任一项所述的数据处理方法,其中,在基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令之前,所述数据处理方法还包括:The data processing method according to any one of claims 8 to 10, wherein, before translating each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space, the data processing method further comprises:
    判断从当前时刻开始后的预设时间段内是否执行所述网络模型;Determining whether the network model is executed within a preset time period starting from the current moment;
    在确定从当前时刻开始后的预设时间段内不执行所述网络模型时,执行步骤:基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令。When it is determined that the network model is not executed within a preset time period starting from the current moment, the step of: translating each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space.
  12. 根据权利要求11所述的数据处理方法,其中,所述数据处理方法还包括:The data processing method according to claim 11, wherein the data processing method further comprises:
    在确定从当前时刻开始后的预设时间段内要执行所述网络模型时,基于所述真实内存空间,将所述网络模型中包含的各个操作翻译成对应的第二硬件执行命令,其中,所述第二硬件执行命令中包含的地址均为真实地址,所述真实内存 空间用于存储执行所述网络模型时所需的数据。When it is determined that the network model is to be executed within a preset time period starting from the current moment, each operation included in the network model is translated into a corresponding second hardware execution command based on the real memory space, wherein the addresses included in the second hardware execution command are all real addresses, and the real memory space The space is used to store the data required for executing the network model.
  13. 根据权利要求9至12中任一项所述的数据处理方法,其中,在存储所述第一硬件执行命令之后,所述数据处理方法还包括:The data processing method according to any one of claims 9 to 12, wherein, after storing the first hardware execution command, the data processing method further comprises:
    在需要执行所述网络模型时,将执行所述网络模型所需的数据加载到所述真实内存空间;When the network model needs to be executed, the data required for executing the network model is loaded into the real memory space;
    利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址;Replacing the false address in the first hardware execution command with the real address corresponding to the real memory space;
    将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备,以供所述对应的硬件设备执行所述第二硬件执行命令。The replaced first hardware execution command is sent to the corresponding hardware device as the second hardware execution command, so that the corresponding hardware device executes the second hardware execution command.
  14. 根据权利要求13所述的数据处理方法,其中,利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址,包括:The data processing method according to claim 13, wherein replacing the false address in the first hardware execution command with the real address corresponding to the real memory space comprises:
    对所述第一硬件执行命令进行识别,确定出当前包含虚假地址的部分或全部第一硬件执行命令,作为目标命令;Identifying the first hardware execution command, and determining part or all of the first hardware execution command currently containing the false address as the target command;
    利用所述真实内存空间对应的真实地址,替换掉所述目标命令中的虚假地址。The real address corresponding to the real memory space is used to replace the false address in the target command.
  15. 根据权利要求13或14所述的数据处理方法,其中,在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行之后,所述数据处理方法还包括:The data processing method according to claim 13 or 14, wherein after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, the data processing method further comprises:
    将所述第二硬件执行命令中的真实地址替换为所述虚假内存空间对应的虚假地址,并将地址替换为虚假地址的命令进行缓存。The real address in the second hardware execution command is replaced with the false address corresponding to the false memory space, and the command in which the address is replaced with the false address is cached.
  16. 根据权利要求8至15任一项所述的数据处理方法,其中,基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,包括:The data processing method according to any one of claims 8 to 15, wherein, based on the virtual memory space, translating each operation included in the network model into a corresponding first hardware execution command comprises:
    将所述网络模型中包含的各个操作的源代码编译成各操作分别对应的指令,并基于所述虚假内存空间,获得执行所述网络模型中包含的各个操作所需的相关信息,所述相关信息包括地址信息;以及Compiling source codes of various operations included in the network model into instructions corresponding to the various operations, and obtaining relevant information required to execute various operations included in the network model based on the virtual memory space, wherein the relevant information includes address information; and
    根据各个操作的对应的指令与执行各个操作所需的相关信息,生成所述第一硬件执行命令。The first hardware execution command is generated according to the corresponding instructions of each operation and the relevant information required to execute each operation.
  17. 一种数据处理方法,其中,所述数据处理方法包括:A data processing method, wherein the data processing method comprises:
    在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令;When the network model needs to be run, a pre-stored hardware execution command executable by a target hardware device corresponding to the network model is obtained;
    将所述硬件执行命令发送给所述目标硬件设备,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。The hardware execution command is sent to the target hardware device so that the target hardware device executes the hardware execution command, thereby achieving the purpose of running the network model on the target hardware device to process the input data.
  18. 根据权利要求17所述的数据处理方法,其中,在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设备能够执行的硬件执行命令,包括:The data processing method according to claim 17, wherein, when it is necessary to run the network model, obtaining the pre-stored hardware execution command executable by the target hardware device corresponding to the network model comprises:
    在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取预先存储的第一硬件执行命令,其中,所述第一硬件执行命令是基于虚假内存空间对所述网络模型中包含的各个操作进行翻译得到的,所述虚假内存空间与所述真实内存空间具备相同属性;以及When the network model needs to be executed, the network original data corresponding to the network model is loaded into the real memory space, and a pre-stored first hardware execution command is obtained, wherein the first hardware execution command is obtained by translating each operation included in the network model based on the virtual memory space, and the virtual memory space has the same attributes as the real memory space; and
    利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。The false address in the first hardware execution command is replaced by the real address corresponding to the real memory space.
  19. 根据权利要求18所述的数据处理方法,其中,在利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址之后,所述数据处理方法还包括:将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备。According to the data processing method of claim 18, after replacing the false address in the first hardware execution command with the real address corresponding to the real memory space, the data processing method further includes: sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
  20. 根据权利要求19所述的数据处理方法,其中,在将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备执行之后,所述数据处理方法还包括:将所述第二硬件执行命令中的真实地址替换为虚假内存空间对应的虚假地址,并缓存地址替换为虚假地址的第二硬件执行命令。According to the data processing method of claim 19, after sending the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device for execution, the data processing method further comprises: replacing the real address in the second hardware execution command with a false address corresponding to the false memory space, and replacing the cache address with the second hardware execution command of the false address.
  21. 一种数据处理装置,其中,所述数据处理装置包括:A data processing device, wherein the data processing device comprises:
    获取模块,所述获取模块被配置成用于:获取待运行的网络模型的计算图;An acquisition module, wherein the acquisition module is configured to: acquire a computational graph of a network model to be run;
    命令生成模块,所述命令生成模块被配置成用于:将所述网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;以及A command generation module, the command generation module is configured to: translate each operation in the computation graph of the network model into a hardware execution command executable by a target hardware device, the hardware execution command including device information of the target hardware device; and
    存储模块,所述存储模块被配置成用于:利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备用于通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。A storage module, wherein the storage module is configured to: store the hardware execution commands using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device is used to run the network model by executing the hardware execution commands in the network execution graph.
  22. 根据权利要求21所述的数据处理装置,其中,所述命令生成模块包括:The data processing device according to claim 21, wherein the command generation module comprises:
    分配模块,所述分配模块被配置成用于:为所述网络模型分配对应的虚假内存空间;以及An allocation module, the allocation module being configured to: allocate a corresponding virtual memory space for the network model; and
    翻译模块,所述翻译模块被配置成用于:基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性。A translation module, wherein the translation module is configured to: based on the virtual memory space, translate each operation contained in the network model into a corresponding first hardware execution command, wherein the addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same properties as the real memory space.
  23. 根据权利要求22所述的数据处理装置,其中,所述存储模块还被配置成用于:存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。The data processing device according to claim 22, wherein the storage module is further configured to: store the first hardware execution command, and the first hardware execution command is used to be provided to the hardware device that needs to run the network model for execution after the address is replaced.
  24. 一种数据处理装置,所述数据处理装置包括:A data processing device, comprising:
    获取模块,所述获取模块被配置成用于:在需要运行网络模型时,获取预先存储的所述网络模型对应的目标硬件设 备能够执行的硬件执行命令;以及The acquisition module is configured to: when the network model needs to be run, obtain the target hardware device corresponding to the pre-stored network model; hardware execution commands that the device can execute; and
    发送模块,所述发送模块被配置成用于:将所述硬件执行命令发送给所述目标硬件设备,以使所述目标硬件设备执行所述硬件执行命令,以实现在所述目标硬件设备上运行所述网络模型对输入数据进行处理的目的。A sending module is configured to: send the hardware execution command to the target hardware device so that the target hardware device executes the hardware execution command to achieve the purpose of running the network model on the target hardware device to process the input data.
  25. 根据权利要求24所述的数据处理装置,其中,所述获取模块包括:The data processing device according to claim 24, wherein the acquisition module comprises:
    第一硬件执行命令获取模块,所述第一硬件执行命令获取模块被配置成用于:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到真实内存空间,获取预先存储的第一硬件执行命令;其中,所述第一硬件执行命令是基于虚假内存空间对所述网络模型中包含的各个操作进行翻译得到的,所述虚假内存空间与真实内存空间具备相同属性;以及A first hardware execution command acquisition module, the first hardware execution command acquisition module is configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, and acquire a pre-stored first hardware execution command; wherein the first hardware execution command is obtained by translating each operation included in the network model based on the virtual memory space, and the virtual memory space has the same properties as the real memory space; and
    翻译模块,所述翻译模块被配置成用于:利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址。A translation module is configured to replace the false address in the first hardware execution command with the real address corresponding to the real memory space.
  26. 根据权利要求25所述的数据处理装置,The data processing device according to claim 25,
    其中,所述发送模块还被配置成用于:将替换后的第一硬件执行命令作为第二硬件执行命令,发送给对应的硬件设备。The sending module is further configured to send the replaced first hardware execution command as the second hardware execution command to the corresponding hardware device.
  27. 一种AI芯片,其中,所述AI芯片包括:An AI chip, wherein the AI chip comprises:
    内核,所述内核被配置成用于:获取待运行的网络模型的计算图,并将所述网络模型的计算图中的各个操作翻译成目标硬件设备能够执行的硬件执行命令,所述硬件执行命令中包含所述目标硬件设备的设备信息;以及A kernel, the kernel being configured to: obtain a computation graph of a network model to be run, and translate each operation in the computation graph of the network model into a hardware execution command executable by a target hardware device, wherein the hardware execution command includes device information of the target hardware device; and
    存储设备,所述存储设备被配置成用于:利用网络执行图存储所述硬件执行命令,其中,所述网络执行图用于记录为所述网络模型生成的所有硬件执行命令,所述目标硬件设备用于通过执行所述网络执行图中的硬件执行命令来运行所述网络模型。A storage device, wherein the storage device is configured to: store the hardware execution commands using a network execution graph, wherein the network execution graph is used to record all hardware execution commands generated for the network model, and the target hardware device is used to run the network model by executing the hardware execution commands in the network execution graph.
  28. 根据权利要求27所述的AI芯片,其中,The AI chip according to claim 27, wherein:
    所述内核还被配置成:用于为所述网络模型分配对应的虚假内存空间,并基于所述虚假内存空间,将所述网络模型中包含的各个操作翻译成对应的第一硬件执行命令,所述第一硬件执行命令中的地址均为虚假地址,所述虚假内存空间与真实内存空间具备相同属性;以及The kernel is further configured to: allocate a corresponding virtual memory space for the network model, and translate each operation included in the network model into a corresponding first hardware execution command based on the virtual memory space, wherein addresses in the first hardware execution command are all virtual addresses, and the virtual memory space has the same attributes as the real memory space; and
    所述存储设备还被配置成用于:存储所述第一硬件执行命令,所述第一硬件执行命令用于在被进行地址替换后提供给需要运行所述网络模型的硬件设备执行。The storage device is further configured to store the first hardware execution command, where the first hardware execution command is provided to a hardware device that needs to run the network model for execution after address replacement.
  29. 一种AI芯片,其中,所述AI芯片包括:An AI chip, wherein the AI chip comprises:
    硬件设备;hardware equipment;
    存储设备,所述存储设备被配置成用于:存储网络模型的计算图中的各个操作对应的硬件执行命令;以及A storage device, the storage device being configured to: store hardware execution commands corresponding to each operation in a computational graph of a network model; and
    内核,所述内核被配置成用于:在需要运行所述网络模型时,从所述存储设备中获取先存储的所述硬件执行命令,并将所述硬件执行命令发送给所述硬件设备,A kernel, wherein the kernel is configured to: when it is necessary to run the network model, obtain the previously stored hardware execution command from the storage device and send the hardware execution command to the hardware device,
    其中,所述硬件设备被配置成用于:执行所述硬件执行命令,以实现运行所述网络模型对输入数据进行处理的目的。Wherein, the hardware device is configured to: execute the hardware execution command to achieve the purpose of running the network model to process the input data.
  30. 根据权利要求29所述的AI芯片,其中,The AI chip according to claim 29, wherein:
    所述存储设备还被配置成:用于存储第一硬件执行命令,其中,所述第一硬件执行命令是基于虚假内存空间对所述网络模型中包含的各个操作进行翻译得到的,所述虚假内存空间与真实内存空间具备相同属性;The storage device is further configured to: store a first hardware execution command, wherein the first hardware execution command is obtained by translating each operation included in the network model based on a virtual memory space, and the virtual memory space has the same properties as the real memory space;
    所述内核还被配置成用于:在需要执行所述网络模型时,将所述网络模型对应的网络原始数据加载到所述真实内存空间,获取存储于所述存储设备中的第一硬件执行命令,利用所述真实内存空间对应的真实地址替换掉所述第一硬件执行命令中的虚假地址,并将替换后的第一硬件执行命令作为第二硬件执行命令,发送给所述第一硬件设备;The kernel is further configured to: when the network model needs to be executed, load the network original data corresponding to the network model into the real memory space, obtain the first hardware execution command stored in the storage device, replace the false address in the first hardware execution command with the real address corresponding to the real memory space, and send the replaced first hardware execution command as the second hardware execution command to the first hardware device;
    所述硬件设备还被配置成用于:执行所述第二硬件执行命令。The hardware device is further configured to: execute the second hardware execution command.
  31. 一种电子设备,其中,所述电子设备包括:An electronic device, wherein the electronic device comprises:
    存储器和处理器,所述处理器与所述存储器连接;A memory and a processor, wherein the processor is connected to the memory;
    所述存储器被配置成用于:存储程序;The memory is configured to: store a program;
    所述处理器被配置成用于:调用存储于所述存储器中的程序,以执行如权利要求1至20中任一项所述的方法。The processor is configured to call a program stored in the memory to execute the method according to any one of claims 1 to 20.
  32. 一种计算机可读存储介质,其中,在所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时,执行如权利要求1至20中任一项所述的方法。 A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 20 is executed.
PCT/CN2023/092113 2022-11-25 2023-05-04 Data processing method and apparatus, ai chip, electronic device, and storage medium WO2024108907A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202211486830.0A CN115586972B (en) 2022-11-25 2022-11-25 Command generation method and device, AI chip, electronic device and storage medium
CN202211486830.0 2022-11-25
CN202211486836.8A CN115576699B (en) 2022-11-25 2022-11-25 Data processing method, device, AI chip, electronic equipment and storage medium
CN202211486836.8 2022-11-25

Publications (1)

Publication Number Publication Date
WO2024108907A1 true WO2024108907A1 (en) 2024-05-30

Family

ID=91195095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092113 WO2024108907A1 (en) 2022-11-25 2023-05-04 Data processing method and apparatus, ai chip, electronic device, and storage medium

Country Status (1)

Country Link
WO (1) WO2024108907A1 (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070079106A1 (en) * 2005-09-22 2007-04-05 International Business Machines Corporation Method and apparatus for translating a virtual address to a real address using blocks of contiguous page table entries
CN110647981A (en) * 2019-09-23 2020-01-03 北京中科寒武纪科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN112529169A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Data processing method, model optimization device and model execution device
CN114461221A (en) * 2022-01-27 2022-05-10 北京奕斯伟计算技术有限公司 Compiling method, compiling device, electronic device, and storage medium
CN114528022A (en) * 2015-04-24 2022-05-24 优创半导体科技有限公司 Computer processor implementing pre-translation of virtual addresses
CN115576699A (en) * 2022-11-25 2023-01-06 成都登临科技有限公司 Data processing method, data processing device, AI chip, electronic device and storage medium
CN115586972A (en) * 2022-11-25 2023-01-10 成都登临科技有限公司 Command generation method and device, AI chip, electronic device and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070079106A1 (en) * 2005-09-22 2007-04-05 International Business Machines Corporation Method and apparatus for translating a virtual address to a real address using blocks of contiguous page table entries
CN114528022A (en) * 2015-04-24 2022-05-24 优创半导体科技有限公司 Computer processor implementing pre-translation of virtual addresses
CN112529169A (en) * 2019-09-18 2021-03-19 华为技术有限公司 Data processing method, model optimization device and model execution device
CN110647981A (en) * 2019-09-23 2020-01-03 北京中科寒武纪科技有限公司 Data processing method, data processing device, computer equipment and storage medium
CN114461221A (en) * 2022-01-27 2022-05-10 北京奕斯伟计算技术有限公司 Compiling method, compiling device, electronic device, and storage medium
CN115576699A (en) * 2022-11-25 2023-01-06 成都登临科技有限公司 Data processing method, data processing device, AI chip, electronic device and storage medium
CN115586972A (en) * 2022-11-25 2023-01-10 成都登临科技有限公司 Command generation method and device, AI chip, electronic device and storage medium

Similar Documents

Publication Publication Date Title
US11010681B2 (en) Distributed computing system, and data transmission method and apparatus in distributed computing system
US9047196B2 (en) Usage aware NUMA process scheduling
US8700838B2 (en) Allocating heaps in NUMA systems
CN111078323B (en) Data processing method and device based on coroutine, computer equipment and storage medium
US20220292082A1 (en) Method, apparatus and device for parallel execution of smart contract, and medium
CN111309649B (en) Data transmission and task processing method, device and equipment
CA2616070A1 (en) Adaptive process dispatch in a computer system having a plurality of processors
JP2014504768A (en) Method, computer program product, and apparatus for progressively unloading classes using a region-based garbage collector
US9063805B2 (en) Method and system for enabling access to functionality provided by resources outside of an operating system environment
CN115576699B (en) Data processing method, device, AI chip, electronic equipment and storage medium
JP2000347876A (en) Method and device for stack slot allocation
US11366689B2 (en) Hardware for supporting OS driven observation and anticipation based on more granular, variable sized observation units
US20230289187A1 (en) Method and apparatus for rectifying weak memory ordering problem
US20240143397A1 (en) Data processing method and system, and related device
CN111666210A (en) Chip verification method and device
KR102326280B1 (en) Method, apparatus, device and medium for processing data
CN114253713B (en) Asynchronous batch processing method and system based on reactor
CN115586972B (en) Command generation method and device, AI chip, electronic device and storage medium
WO2024108907A1 (en) Data processing method and apparatus, ai chip, electronic device, and storage medium
CN116680209A (en) WASM-based multi-intelligent contract instance management method
CN112214443B (en) Secondary unloading device and method arranged in graphic processor
US20120137300A1 (en) Information Processor and Information Processing Method
CN113177211A (en) FPGA chip for privacy computation, heterogeneous processing system and computing method
US20130166887A1 (en) Data processing apparatus and data processing method
CN110618794A (en) Method and system for accessing NandFlash by SSD firmware