WO2022105743A1 - 一种算子计算方法、装置、设备及系统 - Google Patents

一种算子计算方法、装置、设备及系统 Download PDF

Info

Publication number
WO2022105743A1
WO2022105743A1 PCT/CN2021/130883 CN2021130883W WO2022105743A1 WO 2022105743 A1 WO2022105743 A1 WO 2022105743A1 CN 2021130883 W CN2021130883 W CN 2021130883W WO 2022105743 A1 WO2022105743 A1 WO 2022105743A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
shape
dimension
computing
computing unit
Prior art date
Application number
PCT/CN2021/130883
Other languages
English (en)
French (fr)
Inventor
鲍旭
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21893890.0A priority Critical patent/EP4242880A1/en
Publication of WO2022105743A1 publication Critical patent/WO2022105743A1/zh
Priority to US18/319,680 priority patent/US20230289183A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30003Arrangements for executing specific machine instructions
    • G06F9/30007Arrangements for executing specific machine instructions to perform operations on data operands
    • G06F9/3001Arithmetic instructions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/30098Register arrangements
    • G06F9/30105Register structure
    • G06F9/30112Register structure comprising data of variable length
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present application relates to the field of computer technology, and in particular, to an operator computing method, apparatus, device, and system.
  • AI Artificial Intelligence
  • AI is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that responds in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theory.
  • AI networks have also been widely used.
  • AI networks are becoming more and more complex, and there are more and more types of AI operators in AI networks.
  • the data shapes that need to be processed are more and more.
  • an AI operator needs to be recompiled, which makes the compilation more and more time-consuming, and also reduces the startup speed of the AI network.
  • the embodiments of the present application provide an operator computing method, device, device and system, which support the change of the shape of data in any range by combining at least two computing units, thereby realizing the AI dynamic shape operator function and improving the AI Network startup speed.
  • an embodiment of the present application provides a method for calculating an operator, and the method includes:
  • Acquire parameter data of the first data shape of the AI network where the first data shape is the data length in each dimension supported and processed by the AI network, and the parameter data includes combination information of at least two computing units, each of which is The data supported by the computing unit to be processed is data with a second data shape, and the data length of the second data shape of each computing unit in any dimension after being combined according to the combination information is greater than or equal to the first data shape in the same dimension data length on;
  • the at least two calculation units are called to calculate the first target data having the first data shape.
  • the parameter data of the first data shape is obtained, and the parameter data includes the combination information of at least two computing units, and these computing unit to realize the calculation of the first target data with the first data shape, thereby avoiding the need to recompile an AI operator for each different first data shape, but by combining at least two calculation units. It supports the change of data shape in any range, realizes the function of AI dynamic shape operator, and improves the startup speed of AI network.
  • the number of the first data shapes in the method may be one or more. Since the first data shape actually refers to the attribute of the operator, that is, the data length in each dimension that each operator in one or more operators in the AI network supports processing, the number of the first data shape may be one or more.
  • the multiple operators here can be operators of the same type or operators of different types. For operators of the same type, if the data shapes supported by the operators are different, the number of first data shapes may be multiple; The number of data shapes may be multiple.
  • the second data shape refers to the length of data in each dimension that the computing unit supports processing.
  • the relationship between the second data shape and the first data shape is: the data length of the second data shape of each computing unit in any dimension after being combined according to the combination information is greater than or equal to the length of the first data shape in the same dimension Data length.
  • the second data shape includes three dimensions of length, width and height; after the second data shape of each computing unit is combined according to the combination information: the data length in the dimension of length is greater than or equal to the first data shape in The data length in the dimension of length, the data length in the width dimension is greater than or equal to the data length in the width dimension of the first data shape, and the data length in the height dimension is greater than or equal to the first data shape in the height dimension The length of the data in the dimension.
  • the computing unit in this method may be equivalent to an operator, and the computing unit may be an operator of the AI network, or may be a component of the operator.
  • the combination information in the method may include combination patterns of at least two computing units.
  • the data length of the first data shape in a certain dimension is 11, and the combination mode of at least two calculation units in the combination information can be a calculation unit with a data length of 5 + a calculation unit with a data length of 5 + a data length of 5 computing units; it may also be a computing unit with a data length of 5 + a computing unit with a data length of 5 + a computing unit with a data length of 1.
  • the parameter data in this method can be saved in the cache in the form of parameter table.
  • the at least two computing units include the same computing unit; or different computing units; or the same computing unit and different computing units;
  • the second data shape of the same computing unit has the same data length in each dimension; the second data shape of different computing units has different data lengths in at least one dimension.
  • the same computing unit and different computing units can be determined by whether the data lengths in each dimension are the same.
  • the at least two computing units are both computing units of the AI network.
  • the calculation of the first target data of the first data shape supported and processed by the AI network can be realized by invoking at least two computing units of the AI network.
  • the at least two computing units in this manner may be computing units of an AI network, or may be computing units of other networks except the AI network.
  • the AI network and other networks here can be used to implement different functions, such as: object detection, image classification, audio processing, natural language processing and other functions.
  • AI networks and other networks that implement different functions
  • the same computing unit can be included.
  • both AI networks and other networks include convolution computing units; they can also include different computing units.
  • AI networks do not include convolution computing units, while Other networks include convolutional operator units.
  • the AI network does not include a convolution calculation unit, but other networks include a convolution operator unit, when the AI network needs to use a convolution calculation unit, the convolution operator unit included in the other network can be used.
  • the combination information includes a combination mode of the at least two computing units
  • the data length in any dimension is greater than or equal to the data length of the first data shape in the same dimension.
  • the relationship between the second data shape and the first data shape is: after the second data shape of each computing unit is combined according to a certain combination pattern, the data length in any dimension is greater than or equal to The data length of the first data shape in the same dimension.
  • the parameter data further includes identification information for a specified computing unit
  • the designated calculation unit refers to a calculation unit in which the data to be processed in the at least two calculation units is data with a third data shape, and the data length of the third data shape in at least one dimension is smaller than the data length in at least one dimension.
  • the above specifies the data length of the second data shape in the same dimension that the computing unit supports to process.
  • identification information can be added to the specified computing unit in the parameter data, so that the specified computing unit can be called subsequently to the specified computing unit with the third data.
  • the data of the data shape is calculated, thereby improving the accuracy of the operator calculation.
  • the parameter data further includes a specified processing manner of the specified calculation unit for the data having the third data shape.
  • a specified processing method can also be added to the specified computing unit in the parameter data, so that the specified computing unit can be called subsequently, using Specifies the processing method to perform calculations on the data of the third data shape.
  • the specified processing manner includes:
  • the data overlap is to overlap the invalid data with the data to be processed by another computing unit.
  • the specified processing method can be discarding invalid data or overlapping data, so that the data of the third data shape can be subsequently calculated according to the specified processing method, which enriches the implementation of operator calculation. In this way, the reliability of the operator calculation is also improved.
  • the parameter data further includes that the specified calculation unit supports a specified variation range of the third data shape in each dimension.
  • the third data shape to be processed by the designated computing unit can be changed, but the change is within a certain range. Therefore, the specified variation range of the third data shape in each dimension can be added to the parameter data, so that the same computing unit can support the variation of the data shape with a certain variation range.
  • the specified variation range is the data length in each dimension of the second data shape supported by the specified computing unit to process; or the data in each dimension of the second data shape The specified part of the length in length.
  • different ranges of variation can be selected according to the actual situation. If the data length of the data in each dimension is relatively small, it can be changed within the entire data length. For example, if the data length is 16, it can be It can be changed from 0 to 16; if the data length of each dimension is relatively large, it can be changed within a small range at the end of the data length. For example, if the data length is 100, it can be changed from 90 to 100. In this way, the efficiency of operator calculation can be ensured and a large number of repeated calculations can be avoided.
  • the parameter data includes binning parameter data, and the binning parameter data is used to support a data shape of a specified variation range.
  • the parameter data of different first data shapes may be the same, that is, the parameter data of bins, so that each different data shape does not have to correspond to different parameter data, thereby effectively reducing the parameters in the cache The amount of data to avoid wasting resources.
  • the invoking the at least two calculation units to calculate the first target data having the first data shape includes:
  • the first target data having the first data shape is calculated by the at least two calculation units.
  • the computing unit operator library may include a number of pre-compiled computing units, which can be directly obtained from the computing unit operator library when performing operator calculation, thereby improving the computational efficiency.
  • the efficiency of sub-computing also improves the startup speed of the AI network.
  • the computing units included in the computing unit operator library can be used to implement different operations, such as: convolution, addition, matrix multiplication, and so on.
  • These computing units that implement different operations can be used by multiple AI networks, where multiple AI networks can be used to implement different functions, such as: target detection, image classification, audio processing, natural language processing and other functions.
  • the invoking the at least two calculation units to calculate the first target data having the first data shape includes:
  • For any computing unit determine the target position in the first target data of the second target data to be processed in the any computing unit
  • the second target data is calculated by any of the calculation units.
  • the target position of the second target data to be processed in the first target data can be determined first, and then the second target can be obtained from the memory space according to the target position.
  • the calculation unit completes the calculation of the second target data, thereby improving the reliability of the operator calculation.
  • the memory space in this manner may refer to a storage space used for storing data in the memory, and the address thereof is one-dimensional.
  • the second target data since the second target data may be multi-dimensional, it is necessary to obtain the second target data from the memory space by means of skip reading and skip writing, and after the calculation is completed, use skip reading and skip writing. way to save the calculated output data to the memory space.
  • the parameter data when determining the target position of the second target data to be processed in the first target data, if the parameter data includes the position information of each second data shape in the first data shape, it can be determined according to the position information.
  • the target position of the second target data in the first target data if the parameter data includes the position information of each second data shape in the first data shape, it can be determined according to the position information.
  • the target position includes: each dimension where the second target data is located; and, for any dimension, the offset of the second target data in the any dimension and Data length.
  • the target position needs to include each dimension where the second target data is located, as well as the offset and data length of the second target data in any dimension. , thereby improving the accuracy and efficiency of target data acquisition.
  • the at least two computing units belong to different types of operators.
  • At least two calculation units in the parameter data may belong to the same type of operator, that is, implement the same function; or belong to different types of operators, that is, implement different functions, such as: Product operator, add (addition) operator, matmul (matrix multiplication) operator, etc.
  • the different types of operators in this manner may refer to respective operators that are cascaded into fusion operators.
  • the fusion operator refers to the cascaded operators of different types, which are fused into one operator for one-time calculation.
  • at least two calculation units in the parameter data may be calculation units of these different types of operators, such as: the calculation unit of the conv operator, the calculation unit of the relu operator, the calculation unit of the abs operator, the calculation unit of the exp operator Calculation unit, etc., so that when the operator is calculated, the calculation unit of these different types of operators in the parameter data can be called to complete the calculation of the fusion operator, thereby avoiding the calculation of one type of operator and then calling another type of operator.
  • the calculation is performed by the operator, which improves the calculation efficiency of the fusion operator.
  • the computing unit is a pre-compiled operator.
  • the computing unit in the method may be equivalent to an operator, and is a pre-compiled operator.
  • the computing unit operator library can include many pre-compiled computing units. When performing operator calculation, it can be obtained directly from the computing unit operator library, thereby improving the efficiency of operator computing and improving the AI network startup speed.
  • the pre-compiled operator in this method can be compiled by the host to compile a releasable static computing unit binary package, and all execution hosts only need to import the static computing unit binary package; it can also be pre-compiled by the execution host
  • these pre-compiled computing units are stored in the cache, so that when the operator is calculated, it can be obtained directly from the cache, which also improves the efficiency of the operator calculation and improves the AI network startup. speed.
  • the static in the static computing unit means that the data shape supported by the computing unit is fixed, so that the pre-compiled computing unit can be directly used for operator computing without recompiling.
  • an embodiment of the present application provides an operator computing device, the device comprising:
  • an acquisition module configured to acquire parameter data of a first data shape of the AI network, where the first data shape is the data length in each dimension that the AI network supports and processes, and the parameter data includes at least two computing units combination information, the data supported by each computing unit to be processed is data with a second data shape, and the data length of the second data shape of each computing unit in any dimension after being combined according to the combination information is greater than or equal to the The data length of a data shape in the same dimension;
  • a calculation module configured to invoke the at least two calculation units to calculate the first target data having the first data shape.
  • the at least two computing units include the same computing unit; or different computing units; or the same computing unit and different computing units;
  • the second data shape of the same computing unit has the same data length in each dimension; the second data shape of different computing units has different data lengths in at least one dimension.
  • the at least two computing units are both computing units of the AI network.
  • the combination information includes a combination mode of the at least two computing units
  • the data length in any dimension is greater than or equal to the data length of the first data shape in the same dimension.
  • the parameter data further includes identification information for a specified computing unit
  • the designated calculation unit refers to a calculation unit in which the data to be processed in the at least two calculation units is data with a third data shape, and the data length of the third data shape in at least one dimension is smaller than the data length in at least one dimension.
  • the above specifies the data length of the second data shape in the same dimension that the computing unit supports to process.
  • the parameter data further includes a specified processing manner of the specified calculation unit for the data having the third data shape.
  • the specified processing manner includes:
  • the data overlap is to overlap the invalid data with the data to be processed by another computing unit.
  • the parameter data further includes that the specified calculation unit supports a specified variation range of the third data shape in each dimension.
  • the specified variation range is the data length in each dimension of the second data shape supported by the specified computing unit to process; or the data in each dimension of the second data shape The specified part of the length in length.
  • the parameter data includes binning parameter data, and the binning parameter data is used to support a data shape of a specified variation range.
  • the computing module includes:
  • a first obtaining submodule configured to obtain the at least two computing units from a computing unit operator library
  • the first calculation sub-module is configured to calculate the first target data having the first data shape by the at least two calculation units.
  • the computing module includes:
  • a determination submodule configured to determine, for any computing unit, a target position in the first target data of the second target data to be processed in the any computing unit;
  • a second acquisition sub-module configured to acquire the second target data that needs to be processed by any of the computing units from the memory space in which the first target data is stored according to the target location;
  • the second calculation sub-module is configured to calculate the second target data by using any of the calculation units.
  • the target position includes: each dimension where the second target data is located; and, for any dimension, the offset of the second target data in the any dimension and Data length.
  • the at least two computing units belong to different types of operators.
  • the computing unit is a pre-compiled operator.
  • an operator computing device including:
  • At least one memory for storing programs
  • At least one processor is configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to execute the method in the first aspect.
  • an embodiment of the present application provides an operator computing device, including the apparatus provided in the second aspect or the third aspect.
  • an embodiment of the present application provides an operator computing system, including the operator computing device and the operator compiling device provided in the fourth aspect;
  • the operator computing device includes the apparatus provided in the second aspect or the third aspect;
  • the operator compiling device is used to compile a releasable computing unit package
  • the operator computing device is used for importing the computing unit package.
  • an embodiment of the present application provides a computer storage medium, where instructions are stored in the computer storage medium, and when the instructions are executed on the computer, the computer is made to execute the method provided in the first aspect.
  • an embodiment of the present application provides a computer program product including instructions, which, when the instructions are run on a computer, cause the computer to execute the method provided in the first aspect.
  • an embodiment of the present application provides a chip, including at least one processor and an interface
  • At least one processor is configured to execute program line instructions to implement the method provided in the first aspect.
  • the present application discloses an operator computing method, device, device and system.
  • the first data shape is the data length in each dimension supported and processed by the AI network
  • the parameter data Including the combination information of at least two computing units
  • the data supported by each computing unit is data with a second data shape
  • the second data shape of each computing unit is combined according to the combination information.
  • the data length in any dimension is greater than or Equal to the data length of the first data shape in the same dimension; at least two calculation units are called to calculate the first target data with the first data shape, so that any range of data can be supported by combining at least two calculation units
  • the shape changes, thus realizing the AI dynamic shape operator function, and improving the startup speed of the AI network.
  • Figure 1 is a schematic diagram of an artificial intelligence main frame
  • FIG. 2 is a schematic diagram of a system architecture of operator computing
  • Fig. 3 is a kind of schematic diagram of data shape change
  • Fig. 4 is a kind of schematic diagram of operator calculation process
  • 5 is a schematic diagram of a system architecture of operator computing
  • Fig. 6 is a kind of component structure diagram of terminal equipment
  • FIG. 7 is a hardware structure diagram of an AI chip
  • Fig. 8 is a kind of schematic diagram of skip read skip write scene used in operator calculation process
  • Fig. 9 is a kind of schematic diagram of skip read skip write support mode
  • Figure 10 is a schematic diagram of a tail data processing method
  • Figure 11 is a schematic diagram of a parameter table structure
  • FIG. 12 is a schematic diagram of an application scenario of a binned data table
  • FIG. 13 is a schematic diagram of a fusion operator
  • 15 is a schematic diagram of an operator calculation process
  • 16 is a schematic diagram of an operator calculation process
  • 17 is a schematic flowchart of an operator calculation method provided by an embodiment of the present application.
  • FIG. 18 is a schematic structural diagram of an operator computing device provided by an embodiment of the present application.
  • FIG. 19 is a schematic structural diagram of an operator computing device provided by an embodiment of the present application.
  • FIG. 20 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • words such as “exemplary”, “such as” or “for example” are used to mean serving as an example, illustration or illustration. Any embodiments or designs described in the embodiments of the present application as “exemplary,” “such as,” or “by way of example” should not be construed as preferred or advantageous over other embodiments or designs. Rather, use of words such as “exemplary,” “such as,” or “by way of example” is intended to present the related concepts in a specific manner.
  • the term "and/or" is only an association relationship for describing associated objects, indicating that there may be three relationships, for example, A and/or B, which may indicate: A alone exists, A alone exists There is B, and there are three cases of A and B at the same time.
  • the term "plurality" means two or more.
  • multiple systems refer to two or more systems
  • multiple screen terminals refer to two or more screen terminals.
  • first and second are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implying the indicated technical features. Thus, a feature defined as “first” or “second” may expressly or implicitly include one or more of that feature.
  • the terms “including”, “including”, “having” and their variants mean “including but not limited to” unless specifically emphasized otherwise.
  • Figure 1 is a schematic diagram of an artificial intelligence main frame, which describes the overall workflow of the artificial intelligence system and is suitable for general artificial intelligence field requirements.
  • the "intelligent information chain” reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom".
  • the "IT value chain” reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.
  • the infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • smart chips hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA
  • the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.
  • the data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.
  • machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.
  • Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.
  • Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall artificial intelligence solution, and the productization of intelligent information decision-making and implementation of applications. Its application areas mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, autonomous driving, safe city, smart terminals, etc.
  • FIG. 2 is a schematic diagram of a system architecture for operator computing.
  • the machine learning platform 4011 can parse out all AI operators in the AI network and each AI operator needs to be supported in the AI network initialization 4012 stage.
  • the processed data shape, and the AI compiler 4018 is used to complete the operator compilation; in the execution stage of the AI network (ie, the runtime engine 4014), the operator 4016 can be called, and the operator calculation is performed (ie, the execution module 4017).
  • the AI operator can refer to the unit module that implements a specific calculation in the AI network. For example: convolution operator, add (addition) operator, matmul (matrix multiplication) operator, etc.
  • the data shape (shape) involved in this application refers to the data length of the data calculated by the operator in each dimension.
  • the shape of the data can vary in one dimension or in multiple dimensions simultaneously.
  • the graphic data may vary in one dimension of length, or may vary in both dimensions of length and width.
  • the present application provides an operator computing method, device, equipment and system, which realizes the AI dynamic shape operator function by combining binary static computing units, can support the change of data shape in any range, and improve the The AI network startup speed is improved.
  • static in the static computing unit involved in this application means that the data shape supported by the computing unit is fixed, so that the pre-compiled static computing unit can be directly used for operator computing. , without recompiling.
  • the parameter table involved in this application refers to the parameter data describing the first data shape supported by the AI network in the form of a table.
  • the parameter data includes the combination information of at least two computing units, and the data supported by each computing unit is: Data with the second data shape.
  • FIG. 4 is a schematic diagram of an operator calculation process. As shown in Figure 4, this operator calculation can be used in AI networks, where there are many operators. For example: convolution operator, add (addition) operator, matmul (matrix multiplication) operator, etc.
  • the type of operator required is obtained by analyzing the operator and simplifying the network.
  • the data shapes that some operators need to support are variable, including the left "data shape unchanged during execution” and the right "data shape variable during execution” in Figure 4.
  • “Two scenarios For an AI network whose data shape does not change during execution, it only includes the scenario on the left side of Figure 4 that "data shape does not change during execution”.
  • the static computing unit in FIG. 4 may be stored in the static computing unit operator library in advance, and when the static computing unit is called to complete the calculation, it may be taken out from the static computing unit operator library for use. The meaning of each part in Figure 4 is as follows:
  • Static calculation unit An operator unit that only completes the calculation of fixed data shape, and this operator unit is also equivalent to an operator.
  • Each type of operator can include several optimized static computing units with different data shapes. Different types of operators can implement different functions, such as: convolution operator, add (addition) operator, matmul (matrix) multiplication) operator, etc.
  • Parameter table A data structure used to describe the combination mode of static computing units. Each static computing unit completes the calculation of a piece of data, and can complete the calculation of all data after combining multiple static computing units according to the parameters in the parameter table.
  • AI network initialization analyze the initialization operation stages such as AI network operator types.
  • Initial execution of AI network the process of invoking operators and completing calculations.
  • the operator calculation can be completed through the combination of the parameter table and the static computing unit, and the parameter table is used as an input parameter of the static computing unit.
  • the parameter table can be generated when the AI network is initialized and stored in the cache, and can be obtained directly from the cache when the AI network is executing.
  • a parameter table can be generated during execution of the AI network, and then the static computing unit can be called to complete the calculation according to the parameter table. Can be fetched from the cache without needing to regenerate the parameter table every time it is used.
  • FIG. 5 is a schematic diagram of a system architecture of operator computing.
  • the product realization form of this application is the program code included in the AI compiler, machine learning/deep learning platform software, and deployed on the host hardware.
  • the program code of the present application exists in the static computing unit compilation module of the AI compiler, the initialization module of the platform software, and the runtime engine.
  • the program code of the present application runs in the CPU of the compiling host; when running, the static computing unit 4016 of the present application runs in the AI chip of the execution host, and the AI chip can be equipped with a binary static computing unit and the application provided by the present application.
  • the software program of the operator calculation process FIG.
  • FIG. 5 shows the implementation form of the present application in the host AI compiler and platform software, wherein the parts 4013 , 4015 , 4016 , 4017 , and 4019 shown in dotted boxes are newly added modules based on the existing platform software of the present application.
  • the present application designs a combined algorithm module 4013; inside the runtime engine 4013, the execution module 4017 can complete the operator calculation according to the parameter table 4015 and the called static calculation unit 4016; the AI compiler 4018 includes static calculation
  • the unit compiling module 4019, the static computing unit compiling module 4019 can obtain the static computing unit binary package 4020 after the static computing unit compiling.
  • FIG. 5 shows a typical application scenario of binary distribution of static computing units.
  • the compilation host 4002 and the execution host 4001 are separated.
  • the releasable static computing unit binary package 4020 is compiled on the compilation host 4002. All execution hosts 4001 only need to import the static computing unit binary package. 4020 will do.
  • the machine learning platform 4011 includes an AI compiler 4018, and the static computing unit is compiled on the execution host 4001. That is, the function of the compilation host 4002 is implemented on the execution host 4001 .
  • the static computing unit 4016 compiles when the initialization 4012 is performed.
  • FIG. 6 is a component structure diagram of a terminal device.
  • a static computing unit published in binary is used on the terminal device to provide AI network execution capability to all APPs on the terminal device through the general interface NNAPI (Neural Networks Application Programming Interface) 4011.
  • the static computing unit 4016 of the present application uses the binary-released operator package 4020, which does not need to be recompiled. When initializing 4012, it is only necessary to call the combination algorithm 4013 to generate the parameter table 4015 of the data shape corresponding to the operator. This embodiment can minimize the startup time of the AI network when an APP (application) is opened, thereby greatly improving user experience.
  • the static computing unit 4016 of the present application may run in an AI chip of a terminal device, and the AI chip may be equipped with a binary static computing unit and a software program for the operator computing process provided by the present application.
  • the terminal device can use the published binary package of the static computing unit to realize the decoupling between the shape information of the data and the operator code, which reduces the difficulty of operator development and tuning.
  • using the released binary package of static computing units can greatly improve the initialization speed of AI networks in APPs.
  • FIG. 7 is a hardware structure diagram of an AI chip, and the AI chip can be equipped with a binary static computing unit and a software program of the operator computing process provided by the present application.
  • Neural network processor (NPU) 50NPU is mounted on the main CPU (Host CPU) as a co-processor, and tasks are assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 50, and the operation circuit 503 is controlled by the controller 504 to extract the matrix data in the memory and perform multiplication operations.
  • the arithmetic circuit 503 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.
  • PE Processing Unit
  • arithmetic circuit 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 503 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 502 and buffers it on each PE in the arithmetic circuit.
  • the arithmetic circuit fetches the matrix A data from the input memory 501 and performs matrix operations on the matrix B, and the obtained partial result or final result of the matrix is stored in the accumulator 508 accumulator.
  • Unified memory 506 is used to store input data and output data.
  • the weight data directly accesses the controller 505 through the storage unit, and the DMAC is transferred to the weight memory 502.
  • Input data is also moved to unified memory 506 via the DMAC.
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch memory 509 Instruction Fetch Buffer.
  • the bus interface unit 510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 509 to obtain instructions from the external memory, and also for the storage unit access controller 505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 506 , the weight data to the weight memory 502 , or the input data to the input memory 501 .
  • the vector calculation unit 507 has multiple operation processing units, and if necessary, further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • Mainly used for non-convolutional/FC layer network calculations in neural networks such as Pooling (pooling), Batch Normalization (batch normalization), Local Response Normalization (local response normalization), etc.
  • vector computation unit 507 can store the processed output vectors to unified buffer 506 .
  • the vector calculation unit 507 may apply a nonlinear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate activation values.
  • vector computation unit 507 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 503, eg, for use in subsequent layers in a neural network.
  • the instruction fetch memory (instruction fetch buffer) 509 connected to the controller 504 is used to store the instructions used by the controller 504;
  • the unified memory 506, the input memory 501, the weight memory 502 and the instruction fetch memory 509 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • FIG. 8 is a schematic diagram of a skip read skip write scenario used in an operator calculation process.
  • the static calculation unit in this application only completes the calculation of data with a fixed data shape, and the key to its realization is that the data required for each calculation needs to be segmented in the data space.
  • the data space refers to the logical space defined by the data shape for saving data, and its address is multi-dimensional, and is actually located in the memory space; and the memory space can refer to the storage space used to save data in the memory, which Addresses are one-dimensional.
  • the gray part is the data that needs to be calculated by a certain static calculation unit.
  • the gray part is discontinuous, so the method of skipping reading and skipping can be used to obtain the data required for each calculation of the static computing unit from the memory space, and after the calculation is completed, the calculated data The output data is saved to the memory space.
  • FIG. 9 is a schematic diagram of a skip read skip write support mode.
  • One of the methods to support skip read skip write in this application is to add an interface, such as: bind_buffer (axis, stride, offset) interface, the function of this interface is to establish the Tensor (tensor) and buffer (cache) of the static computing unit Mapping relations.
  • Tensor tensor
  • buffer cache
  • FIG. 8 Tensor (tensor) corresponds to the data to be processed by the static computing unit (gray part in Figure 9), and the size of the buffer (cache) is consistent with the size of the data stored in the data space.
  • Axis refers to each dimension where the Tensor is located; Offset (offset) is the offset of the Tensor in the dimension where the data needs to be read; Stride (stride) refers to the time when the Tensor needs to be read The length of the data to be skipped in the dimension of the data. It can be seen that, by mapping multiple dimensions, the application can realize skip reading and skip writing in a multi-dimensional data space.
  • FIG 10 is a schematic diagram of a tail data processing method.
  • the tail data refers to a piece of data whose length to be processed is smaller than the length of data supported by the static computing unit when it is logically divided by the static computing unit in one dimension. As shown in Figure 10, there are four processing methods for the tail data:
  • Mode 2 Part of the data overlaps, that is, the excess part of the tail data is shifted forward, and some data overlaps.
  • Advantages small scalar calculations
  • disadvantages many repeated calculations. That is to say, the excess part of the tail data can overlap with the data to be processed by another computing unit, that is, when the static computing unit that calculates the tail data reads data, the starting position can be set in the data to be processed by another computing unit. , causing some data to overlap.
  • Mode 3 Support data changes in the entire range, that is, the data length in each dimension is relatively small, and the data can be changed within the entire data length.
  • the data length is 16, which can support changes from 0 to 16.
  • Mode 4 Supports changes within a range of data parts, that is, when the data length of each dimension is relatively large, only changes within a small range at the end of the data length are supported.
  • the data length is 100, which can be changed from 90 to 100.
  • Fig. 11 is a schematic diagram of the structure of a parameter table, which is a form of parameter data used to describe the data shape supported by the AI network.
  • the left side is a logical segmentation method of the data shape of the target data processed by the AI network (it can also be divided into data shapes of different sizes).
  • Each data shape obtained by logical segmentation corresponds to A static computing unit call
  • the right side is the data structure of the parameter list.
  • the parameter table is a data structure used to describe the combination mode of the static computing unit.
  • Each item in the table corresponds to a call of the static computing unit and the position of the data to be calculated by the static computing unit to be called in the target data.
  • the data parameters in the table entry mainly include:
  • the data length of the target data in each dimension for example: the data length 0 in Figure 11 is the data length of the target data in the width dimension, and the data length 1 is the data length of the target data in the length dimension.
  • the offset of the data to be calculated by the static computing unit to be called in each dimension for example, the offset 0 in Figure 11 is the offset of the target data in the dimension of width, and the offset 1 is the offset of the target data in the length of the Offset in dimension.
  • the parameter table is generated by the combinatorial algorithm.
  • the combination algorithm is to do the splicing of rectangular blocks; in 3D, the combination algorithm is equivalent to the splicing of cubes.
  • Parameter tables can be binned data tables. Because the static computing unit can be lengthened when processing tail data. Exemplarily, as shown in Figure 12, the three examples can support data shapes with a certain range of variation, so it is not necessary to generate a parameter table for each data shape. If it is in the form of a binned parameter table, the parameters in the cache can be effectively reduced. number of tables.
  • the variation range on the left in Figure 12 refers to the global change, for example: the data length in a certain dimension is 16, and the global variation range can be 0 to 16; the variation range in the middle refers to partial changes, such as: The data length in a certain dimension is 100, and the partial variation range can be 90 to 100; the variation range on the right means that only the tail data changes.
  • the parameter table corresponding to the data shape that the AI network needs to support is automatically generated and does not require user configuration.
  • the features of the parameter table may include: 1) a data structure that can describe the calling sequence of the static computing units; 2) can also describe the position of the data blocks calculated by each static computing unit in the original data (that is, the corresponding static computing units correspond to The position of the data block with the second data shape in the data block with the first data shape); 3)
  • the parameter table can describe the combination mode of the static computing unit, and control the static computing unit to complete the calculation of all data.
  • the static computing unit may be a binary static computing unit. That is, the static computing unit can be compiled into a binary file before the version is released, and is provided to the user in the form of a binary file when the version is released.
  • the features of this method may include: 1) One call can only complete the calculation of a piece of data; 2) It supports extracting a piece of data in the original data space by skipping, and writing to the output data space by skipping after the calculation is completed. Corresponding position; 3) Support data-driven calculation, including which static calculation unit to call and which piece of data to complete, can be imported in the form of data through the parameter table.
  • FIG 13 is a schematic diagram of a fusion operator.
  • a fusion operator refers to cascaded operators of different types, which are fused into one operator for one-time calculation.
  • the static computing unit 4016 may contain static computing units of different types of operators, which are distinguished by IDs when called. Then the parameter table 4015 is used to describe the calling sequence and related parameters of the static computing units of different types of operators, so that the calculation of the fusion operator can be completed.
  • the static computing unit 4016 includes the static computing unit of the conv operator, the static computing unit of the relu operator, the static computing unit of the abs operator, the static computing unit of the exp operator, etc., and is distinguished by ID when calling;
  • the parameter table 4015 describes the calling sequence and related parameters of the static computing unit of the conv operator, the static computing unit of the relu operator, the static computing unit of the abs operator, and the static computing unit of the exp operator.
  • the relevant parameters may include the size of the second data shape supported by each static computing unit; and may also include position information of the data of the second data shape supported by each static computing unit in the data having the first data shape.
  • Figure 14 is a schematic diagram of an operator calculation process.
  • the parameter table is used as one of the input parameters of the static computing unit, and the operator will be called in sequence according to the content of the parameter table
  • the static computing unit completes the operator calculation.
  • FIG. 15 is a schematic diagram of an operator calculation process.
  • the calculation process schematic diagram 1 on the left refers to the operator calculation process in a three-dimensional single-input scenario; the calculation process schematic diagram 2 on the right refers to two Operator input, a two-dimensional input data and a one-dimensional input data.
  • Figure 16 is a schematic diagram of an operator calculation process.
  • the simplified combination algorithm means that only the same static computing unit is used, and the combination algorithm takes the shortest time; the optimized combination algorithm refers to the use of different static computing units for optimal combination, and the combination algorithm takes a long time.
  • FIG. 17 is a schematic flowchart of an operator calculation method provided by an embodiment of the present application.
  • the operator calculation method can be used for AI network.
  • the operator calculation method may include the following steps:
  • Acquire parameter data of a first data shape of the AI network where the first data shape is the data length in each dimension supported by the AI network for processing, the parameter data includes combination information of at least two computing units, and each computing unit supports processing The data of is data with a second data shape, and the data length of the second data shape of each computing unit in any dimension after combining according to the combination information is greater than or equal to the data length of the first data shape in the same dimension.
  • At least two computing units in the parameter data may include the same computing unit; or different computing units; or the same computing unit and different computing units; wherein the second data shape of the same computing unit , the data lengths in each dimension are the same; the second data shapes of different computing units have different data lengths in at least one dimension.
  • At least two computing units in the parameter data may both be computing units of the AI network.
  • the combination information of at least two calculation units in the parameter data may include a combination mode, so that the data length of the second data shape of each calculation unit in any dimension after being combined according to the combination mode is greater than or equal to the first data shape in Length of data in the same dimension.
  • the parameter data may also include identification information for the specified calculation unit; wherein, the specified calculation unit refers to a calculation unit in which the data to be processed in the at least two calculation units is data having a third data shape, and the third data shape
  • the data length in at least one dimension is smaller than the data length in the same dimension of the second data shape that the specified computing unit supports to process.
  • the data of the third data shape may be the tail data involved in FIG. 10 .
  • the parameter data may further include a specified processing method of the specified calculation unit for the data having the third data shape.
  • specifying the processing method may include: discarding invalid data, where the invalid data is data other than the data having the third data shape in the second data shape supported to be processed by the specified computing unit; or the data overlaps, so The data overlap is to overlap the invalid data with the data to be processed by another computing unit.
  • discarding invalid data may be method 1 involved in FIG. 10 ; data overlapping may be method 2 involved in FIG. 10 .
  • the parameter data may further include a specified calculation unit to support a specified variation range of the third data shape in each dimension.
  • the specified variation range may be the data length in each dimension of the second data shape supported by the specified computing unit for processing; or a specified part length of the data length in each dimension of the second data shape.
  • the specified variation range may be the way 3 and the way 4 involved in FIG. 10 .
  • the parameter data includes binned parameter data, which is used to support the data shape of the specified variation range.
  • binned parameter data which is used to support the data shape of the specified variation range.
  • the three examples can support data shapes with a certain range of variation, so it is not necessary to generate a parameter table for each data shape. If it is in the form of a binned parameter table, the parameters in the cache can be effectively reduced. number of tables.
  • At least two calculation units in the parameter data can belong to different types of operators, so that the calculation of the fusion operator can be completed by calling the calculation units of these different types of operators in the parameter data, thus avoiding the calculation of one type of operator After that, another type of operator is called for calculation, which improves the calculation efficiency of the fusion operator.
  • the static computing unit 4016 includes the static computing unit of the conv operator, the static computing unit of the relu operator, the static computing unit of the abs operator, the static computing unit of the exp operator, etc. Distinguish by ID; parameter table 4015 describes the calling sequence and related parameters of the static computing unit of the conv operator, the static computing unit of the relu operator, the static computing unit of the abs operator, and the static computing unit of the exp operator.
  • S172 Invoke at least two calculation units to calculate the first target data having the first data shape.
  • At least two calculation units may be obtained from the calculation unit operator library; the first target data having the first data shape is calculated by the at least two calculation units.
  • the execution host 4001 may import the static computing unit binary package 4020 published by the compiling host 4002, so that at least two computing units may be obtained from the static computing unit binary package 4020.
  • the calculation unit calculates the first target data having the first data shape.
  • the target position in the first target data of the second target data to be processed in any computing unit can be determined; according to the target position, the second target to be processed by any computing unit is obtained from the memory space where the first target data is stored data; the second target data is calculated by any calculation unit.
  • the left side is a logical segmentation method of the data shape of the target data processed by the AI network (it can also be divided into data shapes of different sizes), and each data obtained by logical segmentation The shapes all correspond to a static computing unit call, and the right side is the data structure of the parameter table.
  • the parameter table is a data structure used to describe the combination mode of static computing units.
  • Each item in the table corresponds to a call of a static computing unit and the position of the data to be calculated by the static computing unit to be called in the target data.
  • the position of the data to be calculated by calling the static computing unit in the target data is obtained to obtain the data to be calculated, and the static computing unit to be called is called to calculate the data to be calculated.
  • an embodiment of the present application further provides an operator computing device, wherein the operator computing device is used in an AI network.
  • FIG. 18 is a schematic structural diagram of an operator computing device provided by an embodiment of the present application. As shown in FIG. 18, the operator computing device includes:
  • the obtaining module 181 is configured to obtain parameter data of a first data shape of the AI network, where the first data shape is the data length in each dimension supported and processed by the AI network, and the parameter data includes at least two computing units.
  • Combination information, the data that each computing unit supports to process is data with a second data shape, and the data length of the second data shape of each computing unit in any dimension after combining according to the combination information is greater than or equal to the first The data length of the data shape in the same dimension;
  • the calculation module 182 is configured to invoke the at least two calculation units to calculate the first target data having the first data shape.
  • the at least two computing units include the same computing unit; or different computing units; or the same computing unit and different computing units;
  • the second data shape of the same computing unit has the same data length in each dimension; the second data shape of different computing units has different data lengths in at least one dimension.
  • the at least two computing units are both computing units of an AI network.
  • the combination information includes a combination mode of the at least two computing units
  • the data length in any dimension is greater than or equal to the data length of the first data shape in the same dimension.
  • the parameter data further includes identification information for a specified computing unit
  • the designated calculation unit refers to a calculation unit in which the data to be processed in the at least two calculation units is data with a third data shape, and the data length of the third data shape in at least one dimension is smaller than the data length in at least one dimension.
  • the above specifies the data length of the second data shape in the same dimension that the computing unit supports to process.
  • the parameter data further includes a specified processing manner of the specified calculation unit for the data having the third data shape.
  • the specified processing manner includes:
  • the data overlap is to overlap the invalid data with the data to be processed by another computing unit.
  • the parameter data further includes that the specified calculation unit supports a specified variation range of the third data shape in each dimension.
  • the specified variation range is the data length in each dimension of the second data shape supported by the specified computing unit to process; or the data in each dimension of the second data shape The specified part of the length in length.
  • the parameter data includes binning parameter data, and the binning parameter data is used to support a data shape of a specified variation range.
  • the computing module 182 includes:
  • a first obtaining submodule configured to obtain the at least two computing units from a computing unit operator library
  • the first calculation sub-module is configured to calculate the first target data having the first data shape by the at least two calculation units.
  • the computing module 182 includes:
  • a determination submodule configured to determine, for any computing unit, a target position in the first target data of the second target data to be processed in the any computing unit;
  • a second acquisition sub-module configured to acquire the second target data that needs to be processed by any of the computing units from the memory space in which the first target data is stored according to the target location;
  • the second calculation sub-module is configured to calculate the second target data by using any of the calculation units.
  • the target position includes: each dimension where the second target data is located; and, for any dimension, the offset of the second target data in the any dimension and Data length.
  • the at least two computing units belong to different types of operators.
  • the computing unit is a pre-compiled operator.
  • FIG. 19 is a schematic structural diagram of an operator computing apparatus provided by an embodiment of the present application. As shown in FIG. 19 , the operator computing apparatus provided by the embodiment of the present application can be used to implement the method described in the foregoing method embodiment.
  • the operator computing device includes at least one processor 1601, and the at least one processor 1601 can support the operator computing device to implement the control methods provided in the embodiments of this application.
  • the processor 1601 may be a general purpose processor or a special purpose processor.
  • the processor 1601 may include a central processing unit (CPU) and/or a baseband processor.
  • the baseband processor may be used for processing communication data (for example, determining a target screen terminal), and the CPU may be used for implementing corresponding control and processing functions, executing software programs, and processing data of software programs.
  • the operator computing apparatus may further include a transceiving unit 1605 to implement signal input (reception) and output (send).
  • the transceiver unit 1605 may include a transceiver or a radio frequency chip.
  • Transceiver unit 1605 may also include a communication interface.
  • the operator computing device may further include an antenna 1606, which may be used to support the transceiver unit 1605 to implement the transceiver function of the operator computing device.
  • the operator computing device may include one or more memories 1602 on which programs (or instructions or codes) 1604 are stored, and the programs 1604 may be executed by the processor 1601, so that the processor 1601 executes the above method to implement method described in the example.
  • data may also be stored in the memory 1602 .
  • the processor 1601 may also read data stored in the memory 1602 (for example, pre-stored first feature information), the data may be stored at the same storage address as the program 1604, and the data may also be stored with the program 1604 at different storage addresses.
  • the processor 1601 and the memory 1602 can be provided separately, or can be integrated together, for example, integrated on a single board or a system on chip (system on chip, SOC).
  • SOC system on chip
  • an embodiment of the present application further provides an operator computing device, where the operator computing device includes any operator computing device provided in the foregoing embodiments.
  • the operator computing device may be a terminal device such as a mobile phone, a tablet computer, a digital camera, a personal digital assistant (PDA), a wearable device, a smart TV, and a Huawei smart screen.
  • exemplary embodiments of terminal devices include, but are not limited to, terminal devices equipped with iOS, android, Windows, Harmony OS or other operating systems.
  • the above-mentioned terminal device may also be other terminal devices, such as a laptop or the like with a touch-sensitive surface (eg, a touch panel).
  • the embodiment of the present application does not specifically limit the type of the terminal device.
  • the component structure diagram of the terminal device is shown in FIG. 6 .
  • the embodiments of the present application further provide an operator computing system, an operator computing device, and an operator compiling device; wherein, the operator computing device includes the operator computing devices provided in the foregoing embodiments.
  • the operator compiling device is used to compile a releasable static computing unit binary package; the operator computing device is used to import the static computing unit binary package.
  • the operator computing device may be the execution host 4001 in FIG. 5 or the terminal device in FIG. 6
  • the operator compiling device may be the compilation host 4002 in FIG. 5 .
  • an embodiment of the present application further provides a chip.
  • FIG. 20 is a schematic structural diagram of a chip provided by an embodiment of the present application.
  • the chip 1900 includes one or more processors 1901 and an interface circuit 1902 .
  • the chip 1900 may further include a bus 1903 . in:
  • the processor 1901 may be an integrated circuit chip with signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1901 or an instruction in the form of software.
  • the above-mentioned processor 1901 may be a general purpose processor, a digital communicator (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components .
  • DSP digital communicator
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the interface circuit 1902 can be used to send or receive data, instructions or information.
  • the processor 1901 can use the data, instructions or other information received by the interface circuit 1902 to process, and can send the processing completion information through the interface circuit 1902.
  • the chip further includes a memory, which may include a read-only memory and a random access memory, and provides operation instructions and data to the processor.
  • a portion of the memory may also include non-volatile random access memory (NVRAM).
  • NVRAM non-volatile random access memory
  • the memory stores executable software modules or data structures
  • the processor may execute corresponding operations by calling operation instructions stored in the memory (the operation instructions may be stored in the operating system).
  • the interface circuit 1902 can be used to output the execution result of the processor 1901 .
  • processor 1901 and the interface circuit 1902 can be realized by hardware design, software design, or a combination of software and hardware, which is not limited here.
  • processor in the embodiments of the present application may be a central processing unit (central processing unit, CPU), and may also be other general-purpose processors, digital signal processors (digital signal processors, DSP), application-specific integrated circuits (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, transistor logic devices, hardware components or any combination thereof.
  • CPU central processing unit
  • DSP digital signal processors
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • a general-purpose processor may be a microprocessor or any conventional processor.
  • the method steps in the embodiments of the present application may be implemented in a hardware manner, or may be implemented in a manner in which a processor executes software instructions.
  • Software instructions can be composed of corresponding software modules, and software modules can be stored in random access memory (RAM), flash memory, read-only memory (ROM), programmable read-only memory (programmable rom) , PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically erasable programmable read-only memory (electrically EPROM, EEPROM), registers, hard disks, removable hard disks, CD-ROMs or known in the art in any other form of storage medium.
  • An exemplary storage medium is coupled to the processor, such that the processor can read information from, and write information to, the storage medium.
  • the storage medium can also be an integral part of the processor.
  • the processor and storage medium may reside in an ASIC.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware or any combination thereof.
  • software it can be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated.
  • the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.
  • the computer instructions may be stored in or transmitted over a computer-readable storage medium.
  • the computer instructions can be sent from one website site, computer, server, or data center to another website site by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.) , computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that includes an integration of one or more available media.
  • the usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVDs), or semiconductor media (eg, solid state disks (SSDs)), and the like.

Abstract

人工智能领域的一种算子计算方法、装置、设备及系统,该方法包括:获取AI网络的第一数据形状的参数数据,第一数据形状是AI网络支持处理的每个维度上的数据长度,参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照组合信息组合后在任一维度上的数据长度大于或等于第一数据形状在同一维度上的数据长度(S171);调用至少两个计算单元,对具有第一数据形状的第一目标数据进行计算(S172)。因此,该方法通过组合至少两个计算单元的方式来支持任意范围数据形状的变化,实现了AI动态形状算子功能,提高了AI网络启动速度。

Description

一种算子计算方法、装置、设备及系统
本申请要求于2020年11月19日提交中国国家知识产权局、申请号为2020113019355、申请名称为“一种算子计算方法、装置、设备及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及一种算子计算方法、装置、设备及系统。
背景技术
AI(Artificial Intelligence,人工智能)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
目前,随着计算机技术的不断发展,AI网络也得到了广泛的应用。并且,AI网络越来越复杂,AI网络中AI算子的类型也越来越多,即使是同一类型AI算子,其需要支持处理的数据的数据形状也越来越多,若针对每个不同的数据形状,均需要重新编译一个AI算子,使得编译越来越耗时,也降低了AI网络启动速度。
发明内容
本申请实施例提供了一种算子计算方法、装置、设备及系统,通过组合至少两个计算单元的方式来支持任意范围数据形状的变化,从而实现了AI动态形状算子功能,提高了AI网络启动速度。
第一方面,本申请实施例提供了一种算子计算方法,所述方法包括:
获取AI网络的第一数据形状的参数数据,所述第一数据形状是所述AI网络支持处理的每个维度上的数据长度,所述参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照所述组合信息组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度;
调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算。
也就是说,针对AI网络支持处理的第一数据形状,不是重新编译算子,而是获取第一数据形状的参数数据,该参数数据包括至少两个计算单元的组合信息,并通过调用这些计算单元来实现对具有第一数据形状的第一目标数据的计算,从而避免了针对每个不同的第一数据形状均需要重新编译一个AI算子,而是通过组合至少两个计算单元的方式来支持任意范围数据形状的变化,实现了AI动态形状算子功能,提高了AI网络启动速度。
其中,该方法中的第一数据形状的数量可能是一个或多个。由于第一数据形状实际上指的是算子的属性,即AI网络中一个或多个算子中每个算子支持处理的每个维度上的数据长 度,所以第一数据形状的数量可能是一个或多个。这里的多个算子可以是相同类型的算子,也可以是不同类型的算子。针对相同类型的算子,若算子支持处理的数据形状不一样,将会导致第一数据形状的数量可能为多个;针对不同类型的算子,由于算子类型不同,也会导致第一数据形状的数量可能为多个。
第二数据形状指的是计算单元支持处理的每个维度上的数据长度。
第二数据形状和第一数据形状的关系为:每个计算单元的第二数据形状按照所述组合信息组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
比如:第二数据形状包括长度、宽度和高度这三个维度;每个计算单元的第二数据形状按照所述组合信息组合后:在长度这个维度上的数据长度大于或等于第一数据形状在长度这个维度上的数据长度,在宽度这个维度上的数据长度大于或等于第一数据形状在宽度这个维度上的数据长度,在高度这个维度上的数据长度大于或等于第一数据形状在高度这个维度上的数据长度。该方法中的计算单元可以等同于算子,该计算单元可以是AI网络的算子,也可以是算子的一个组成部分。该方法中的组合信息可以包括至少两个计算单元的组合模式。比如:第一数据形状在某个维度上的数据长度为11,组合信息中的至少两个计算单元的组合模式可以是数据长度为5的计算单元+数据长度为5的计算单元+数据长度为5的计算单元;也可以是数据长度为5的计算单元+数据长度为5的计算单元+数据长度为1的计算单元。
该方法中的参数数据可以以参数表的形式保存在缓存中。
在一种可能的实现方式中,所述至少两个计算单元包括相同的计算单元;或不同的计算单元;或相同的计算单元和不同的计算单元;
其中,相同的计算单元的第二数据形状,在每个维度上的数据长度均相同;不同的计算单元的第二数据形状,在至少一个维度上的数据长度不同。
也就是说,在该种实现方式中,可以通过在每个维度上的数据长度是否相同,来确定相同的计算单元和不同的计算单元。
在一种可能的实现方式中,所述至少两个计算单元均为所述AI网络的计算单元。
也就是说,在该种实现方式中,可以通过调用AI网络的至少两个计算单元,来实现对AI网络支持处理的第一数据形状的第一目标数据的计算。
其中,该方式中的至少两个计算单元可以是AI网络的计算单元,还可以是除了AI网络之外的其他网络的计算单元。这里的AI网络和其他网络可以用于实现不同的功能,比如:目标检测、图像分类、音频处理、自然语言处理等功能。
针对实现不同功能的AI网络和其他网络,可以包括相同的计算单元,比如:AI网络和其他网络均包括卷积计算单元;也可以包括不同的计算单元,AI网络不包括卷积计算单元,而其他网络包括卷积算子单元。示例性的,若AI网络不包括卷积计算单元,而其他网络包括卷积算子单元,在AI网络需要使用卷积计算单元时,可以使用其他网络包括的卷积算子单元。
在一种可能的实现方式中,所述组合信息包括所述至少两个计算单元的组合模式;
每个计算单元的第二数据形状按照所述组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
也就是说,在该种实现方式中,第二数据形状和第一数据形状的关系为:每个计算单元的第二数据形状按照一定的组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
在一种可能的实现方式中,所述所述参数数据中还包括针对指定计算单元的标识信息;
其中,所述指定计算单元指的是所述至少两个计算单元中需要处理的数据为具有第三数据形状的数据的计算单元,所述第三数据形状在至少一个维度上的数据长度小于所述指定计算单元支持处理的第二数据形状在同一维度上的数据长度。
也就是说,在该种实现方式中,针对需要处理第三数据形状的数据的指定计算单元,可以在参数数据中对该指定计算单元添加标识信息,以便后续调用该指定计算单元对具有第三数据形状的数据进行计算,从而提高了算子计算的准确性。
在一种可能的实现方式中,所述参数数据中还包括所述指定计算单元针对所述具有第三数据形状的数据的指定处理方式。
也就是说,在该种实现方式中,针对需要处理第三数据形状的数据的指定计算单元,还可以在参数数据中对该指定计算单元添加指定处理方式,以便后续调用该指定计算单元,采用指定处理方式对第三数据形状的数据进行计算。
在一种可能的实现方式中,所述指定处理方式包括:
丢弃无效数据,所述无效数据是所述指定计算单元支持处理的第二数据形状中除了具有所述第三数据形状的数据之外的数据;或
数据重叠,所述数据重叠是将所述无效数据与另一计算单元需要处理的数据进行重叠。
也就是说,在该种实现方式中,指定处理方式可以为丢弃无效数据,也可以是数据重叠,以便后续根据该指定处理方式对第三数据形状的数据进行计算,丰富了算子计算的实现方式,也提高了算子计算的可靠性。
在一种可能的实现方式中,所述参数数据中还包括所述指定计算单元支持所述第三数据形状在每个维度上的指定变化范围。
也就是说,在该种实现方式中,由于指定计算单元支持处理的第二数据形状是固定的,指定计算单元需要处理的第三数据形状是可以变化的,但其变化是有一定变化范围,所以在参数数据中可以添加第三数据形状在每个维度上的指定变化范围,以便实现同一个计算单元可以支持一定变化范围的数据形状的变化。
在一种可能的实现方式中,所述指定变化范围为所述指定计算单元支持处理的第二数据形状在每个维度上的数据长度;或所述第二数据形状在每个维度上的数据长度中的指定部分长度。
也就是说,在该种实现方式中,可以实际情况选择不同的变化范围,若数据在每个维度上的数据长度比较小时,可以在整个数据长度内的变化,比如:数据长度为16,可以在0至16内变化;若数据在每个维度上的数据长度比较大时,可以在该数据长度的尾部小范围内的变化,比如:数据长度为100,可以支持在90至100内变化,这样可以保证算子计算的效率,避免了大量的重复计算。
在一种可能的实现方式中,所述参数数据包括分档参数数据,所述分档参数数据用于支持指定变化范围的数据形状。
也就是说,在该种实现方式中,不同的第一数据形状的参数数据可以相同,即分档参数数据,这样不必每个不同数据形状都对应不同的参数数据,从而有效减少了缓存中参数数据的数量,避免了资源浪费。
在一种可能的实现方式中,所述调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算,包括:
从计算单元算子库中获取所述至少两个计算单元;
通过所述至少两个计算单元对具有所述第一数据形状的第一目标数据进行计算。
也就是说,在该种实现方式中,计算单元算子库可以包括很多个预先编译好的计算单元,在进行算子计算时,可以直接从计算单元算子库获取即可,从而提高了算子计算的效率,也提高了AI网络启动速度。
其中,计算单元算子库包括的计算单元可以用于实现不同的运算,比如:卷积、相加、矩阵相乘等。这些实现不同运算的计算单元可以被多个AI网络使用,这里的多个AI网络可以用于实现不同的功能,比如:目标检测、图像分类、音频处理、自然语言处理等功能。
在一种可能的实现方式中,所述调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算,包括:
针对任一计算单元,确定所述任一计算单元中需要处理的第二目标数据在所述第一目标数据中的目标位置;
按照所述目标位置从存储有所述第一目标数据的内存空间中,获取所述任一计算单元需要处理的第二目标数据;
通过所述任一计算单元对所述第二目标数据进行计算。
也就是说,在该种实现方式中,在进行算子计算时,可以先确定需要处理的第二目标数据在第一目标数据中的目标位置,再按照目标位置从内存空间中获取第二目标数据,并通过计算单元来完成对第二目标数据的计算,从而提高了算子计算的可靠性。
其中,该方式中的内存空间可以指的是在内存中用于保存数据的存储空间,其地址是一维的。在内存空间中获取第二目标数据时,由于第二目标数据可能是多维的,需要采用跳读跳写的方式从内存空间中获取第二目标数据,以及计算完成后,再采用跳读跳写的方式将计算得到的输出数据保存至内存空间中。
值得说明的是,在确定需要处理的第二目标数据在第一目标数据中的目标位置时,若参数数据中包括各个第二数据形状在第一数据形状的位置信息,可以根据该位置信息确定第二目标数据在第一目标数据中的目标位置。
在一种可能的实现方式中,所述目标位置包括:所述第二目标数据所在的各个维度;以及,针对任一维度,所述第二目标数据在所述任一维度上的偏移和数据长度。
也就是说,在该种实现方式中,由于第二目标数据可能是多维的,所以目标位置需要包括第二目标数据所在的各个维度,以及第二目标数据在任一维度上的偏移和数据长度,从而提高了目标数据获取的准确性和效率。
在一种可能的实现方式中,所述至少两个计算单元属于不同类型的算子。
也就是说,在该种实现方式中,参数数据中的至少两个计算单元可以属于同一类型的算子,即实现同一功能;也可以属于不同类型的算子,即实现不同功能,比如:卷积算子,add(相加)算子,matmul(矩阵相乘)算子等。
其中,该方式中的不同类型的算子可以指的是级联成融合算子的各个算子。融合算子指的是级联的不同类型算子,融合成一个算子一次性计算完成。此时,参数数据中的至少两个计算单元可以是这些不同类型算子的计算单元,比如:conv算子的计算单元、relu算子的计算单元、abs算子的计算单元、exp算子的计算单元等,这样在算子计算时,可以通过调用参数数据中这些不同类型算子的计算单元来完成融合算子的计算,从而避免了一种类型的算子计算完再调用另一个类型的算子进行计算,提高了融合算子的计算效率。
在一种可能的实现方式中,所述计算单元为预先编译好的算子。
也就是说,在该种实现方式中,该方法中的计算单元可以等同于算子,并且是预先编译好的算子。比如:计算单元算子库中可以包括很多个预先编译好的计算单元,在进行算子计算时,可以直接从计算单元算子库获取即可,从而提高了算子计算的效率,也提高了AI网络启动速度。
其中,该方式中的预先编译好的算子可以是编译主机编译出可发布的静态计算单元二进制包,所有的执行主机都只需要导入静态计算单元二进制包即可;也可以是执行主机预先编译好很多个计算单元,并将这些预先编译好的计算单元存入缓存,这样在进行算子计算时,可以直接从缓存获取即可,同样提高了算子计算的效率,也提高了AI网络启动速度。值得说明的是,静态计算单元中的静态指的是该计算单元支持处理的数据形状是固定不变的,这样预先编译好的计算单元可以直接用于算子计算,而不用重新编译。
第二方面,本申请实施例提供了一种算子计算装置,所述装置包括:
获取模块,被配置为获取AI网络的第一数据形状的参数数据,所述第一数据形状是所述AI网络支持处理的每个维度上的数据长度,所述参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照所述组合信息组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度;
计算模块,被配置为调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算。
在一种可能的实现方式中,所述至少两个计算单元包括相同的计算单元;或不同的计算单元;或相同的计算单元和不同的计算单元;
其中,相同的计算单元的第二数据形状,在每个维度上的数据长度均相同;不同的计算单元的第二数据形状,在至少一个维度上的数据长度不同。
在一种可能的实现方式中,所述至少两个计算单元均为所述AI网络的计算单元。
在一种可能的实现方式中,所述组合信息包括所述至少两个计算单元的组合模式;
每个计算单元的第二数据形状按照所述组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
在一种可能的实现方式中,所述参数数据中还包括针对指定计算单元的标识信息;
其中,所述指定计算单元指的是所述至少两个计算单元中需要处理的数据为具有第三数据形状的数据的计算单元,所述第三数据形状在至少一个维度上的数据长度小于所述指定计算单元支持处理的第二数据形状在同一维度上的数据长度。
在一种可能的实现方式中,所述参数数据中还包括所述指定计算单元针对所述具有第三数据形状的数据的指定处理方式。
在一种可能的实现方式中,所述指定处理方式包括:
丢弃无效数据,所述无效数据是所述指定计算单元支持处理的第二数据形状中除了具有所述第三数据形状的数据之外的数据;或
数据重叠,所述数据重叠是将所述无效数据与另一计算单元需要处理的数据进行重叠。
在一种可能的实现方式中,所述参数数据中还包括所述指定计算单元支持所述第三数据形状在每个维度上的指定变化范围。
在一种可能的实现方式中,所述指定变化范围为所述指定计算单元支持处理的第二数据形状在每个维度上的数据长度;或所述第二数据形状在每个维度上的数据长度中的指定部分 长度。
在一种可能的实现方式中,所述参数数据包括分档参数数据,所述分档参数数据用于支持指定变化范围的数据形状。
在一种可能的实现方式中,所述计算模块包括:
第一获取子模块,被配置为从计算单元算子库中获取所述至少两个计算单元;
第一计算子模块,被配置为通过所述至少两个计算单元对具有所述第一数据形状的第一目标数据进行计算。
在一种可能的实现方式中,所述计算模块包括:
确定子模块,被配置为针对任一计算单元,确定所述任一计算单元中需要处理的第二目标数据在所述第一目标数据中的目标位置;
第二获取子模块,被配置为按照所述目标位置从存储有所述第一目标数据的内存空间中,获取所述任一计算单元需要处理的第二目标数据;
第二计算子模块,被配置为通过所述任一计算单元对所述第二目标数据进行计算。
在一种可能的实现方式中,所述目标位置包括:所述第二目标数据所在的各个维度;以及,针对任一维度,所述第二目标数据在所述任一维度上的偏移和数据长度。
在一种可能的实现方式中,所述至少两个计算单元属于不同类型的算子。
在一种可能的实现方式中,所述计算单元为预先编译好的算子。
第三方面,本申请实施例提供了一种算子计算装置,包括:
至少一个存储器,用于存储程序;
至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面中的方法。
第四方面,本申请实施例提供了一种算子计算设备,包含第二方面或第三方面中所提供的装置。
第五方面,本申请实施例提供了一种算子计算系统,包括第四方面中所提供的算子计算设备和算子编译设备;
其中,所述算子计算设备包含包含第二方面或第三方面中所提供的装置;
所述算子编译设备用于编译出可发布的计算单元包;
所述算子计算设备用于导入所述计算单元包。
第六方面,本申请实施例提供了一种计算机存储介质,计算机存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行第一方面中所提供的方法。
第七方面,本申请实施例提供了一种包含指令的计算机程序产品,当指令在计算机上运行时,使得计算机执行第一方面中所提供的方法。
第八方面,本申请实施例提供了一种芯片,包括至少一个处理器和接口;
接口,用于为至少一个处理器提供程序指令或者数据;
至少一个处理器用于执行程序行指令,以实现第一方面中所提供的方法。
本申请公开了一种算子计算方法、装置、设备及系统,通过获取AI网络的第一数据形状的参数数据,第一数据形状是AI网络支持处理的每个维度上的数据长度,参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照组合信息组合后在任一维度上的数据长度大于或等于第一数据形状在同一维度上的数据长度;调用至少两个计算单元,对具有第一数据形状的第一目 标数据进行计算,这样可以通过组合至少两个计算单元的方式来支持任意范围数据形状的变化,从而实现了AI动态形状算子功能,提高了AI网络启动速度。
附图说明
图1是一种人工智能主体框架示意图;
图2是一种算子计算的系统架构示意图;
图3是一种数据形状变化示意图;
图4是一种算子计算过程的示意图;
图5是一种算子计算的系统架构示意图;
图6是一种终端设备的组件结构图;
图7是一种AI芯片硬件结构图;
图8是一种用于算子计算过程中的跳读跳写场景示意图;
图9是一种跳读跳写支持方式示意图;
图10是一种尾部数据处理方式示意图;
图11是一种参数表结构示意图;
图12是一种分档数据表的应用场景示意图;
图13是一种融合算子示意图;
图14是一种算子计算过程示意图;
图15是一种算子计算过程示意图;
图16是一种算子计算过程示意图;
图17是本申请实施例提供的一种算子计算方法的流程示意图;
图18是本申请实施例提供的一种算子计算装置的结构示意图;
图19是本申请实施例提供的一种算子计算装置的结构示意图;
图20是本申请实施例提供的一种芯片的结构示意图。
具体实施方式
为了使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图,对本申请实施例中的技术方案进行描述。
在本申请实施例的描述中,“示例性的”、“例如”或者“举例来说”等词用于表示作例子、例证或说明。本申请实施例中被描述为“示例性的”、“例如”或者“举例来说”的任何实施例或设计方案不应被解释为比其它实施例或设计方案更优选或更具优势。确切而言,使用“示例性的”、“例如”或者“举例来说”等词旨在以具体方式呈现相关概念。
在本申请实施例的描述中,术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,单独存在B,同时存在A和B这三种情况。另外,除非另有说明,术语“多个”的含义是指两个或两个以上。例如,多个系统是指两个或两个以上的系统,多个屏幕终端是指两个或两个以上的屏幕终端。
此外,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
图1是一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“IT价值链”(垂直轴)两个维度对上述人工智能主题框架进行阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。通过传感器与外部沟通;计算能力由智能芯片(CPU、NPU、GPU、ASIC、FPGA等硬件加速芯片)提供;基础平台包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。举例来说,传感器和外部沟通获取数据,这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据
基础设施的上一层的数据用于表示人工智能领域的数据来源。数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理
数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,平安城市,智能终端等。
需要说明的是,本申请涉及到的算子计算位于上述(3)中的数据处理阶段。
图2是一种算子计算的系统架构示意图,如图2所示,机器学习平台4011在AI网络的初始化4012阶段,可以解析出AI网络中所有的AI算子、以及各个AI算子需要支持处理的数据形状,并利用AI编译器4018完成算子编译;在AI网络的执行阶段(即运行时引擎4014),可以调用算子4016,并执行算子计算(即执行模块4017)。其中,AI算子可以指的是AI网 络中实现一个特定计算的单元模块。比如:卷积算子,add(相加)算子,matmul(矩阵相乘)算子等。
可见,AI网络越复杂,AI网络中AI算子的类型也越多,即使是同一类型AI算子,其需要支持处理的数据的数据形状也越来越多,若针对每个不同的数据形状,均需要重新编译一个AI算子,使得编译越来越耗时,也降低了AI网络启动速度。
需要说明的是,本申请涉及到的数据形状(shape)指的是算子计算的数据在每个维度上的数据长度。该数据形状可以在一个维度上变化,也可以在多个维度上同时变化。如图3所示,该图形数据可以在长度这一个维度上发生变化,也可以在长度、宽度这两个维度上均发生变化。
为了解决上述技术问题,本申请提供了一种算子计算方法、装置、设备及系统,通过组合二进制的静态计算单元的方式实现AI动态shape算子功能,能够支持任意范围数据shape的变化,提高了AI网络启动速度。
需要说明的是,本申请涉及到的静态计算单元中的“静态”指的是该计算单元支持处理的数据形状是固定不变的,这样预先编译好的静态计算单元可以直接用于算子计算,而不用重新编译。
本申请涉及到的参数表指的是以表格的形式来描述AI网络支持的第一数据形状的参数数据,该参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据。
下面通过具体实施例进行说明。
图4是一种算子计算过程的示意图。如图4所示,该算子计算可以用于AI网络,AI网络中算子很多。比如:卷积算子,add(相加)算子,matmul(矩阵相乘)算子等。在AI网络初始化时,通过分析算子并简化网络,得到所需算子的类型。对于执行时数据形状可变的AI网络,部分算子需要支持处理的数据形状是可变的,包含了图4中左侧“执行时数据形状不变”和右侧“执行时数据形状可变”两种场景。对于执行时数据形状不变的AI网络,则只包含图4中左侧“执行时数据形状不变”这一种场景。图4中的静态计算单元可以事先保存在静态计算单元算子库中,在调用静态计算单元完成计算时,可以从静态计算单元算子库取出使用。图4中各部分的含义,具体如下:
静态计算单元:只完成固定数据形状计算的算子单元,该算子单元也等同于一种算子。每种类型算子可以包含调优后的若干个不同数据形状的静态计算单元,不同类型的算子可以实现不同的功能,比如:卷积算子,add(相加)算子,matmul(矩阵相乘)算子等。
参数表:用于描述静态计算单元组合模式的数据结构。每个静态计算单元完成一块数据的计算,根据参数表里的参数将多个静态计算单元组合后能完成全部数据的计算。
AI网络初始化:分析AI网络算子类型等初始化操作阶段。
AI网络初执行:调用算子并完成计算的过程。其中,算子计算可以通过参数表和静态计算单元的结合完成,参数表作为静态计算单元的一个入参。
执行时数据形状不变场景:针对数据形状不变的算子,可以在AI网络初始化时生成参数表并保存在缓存中,在AI网络执行时直接从缓存中获取即可。
执行时数据形状可变场景:针对数据形状可变的算子,可以在AI网络执行时生成参数表,再根据参数表调用静态计算单元完成计算,若使用了缓存机制,再使用该参数表时可以从缓 存中获取,而不需要每次使用参数表时都重新生成该参数表。
图5是一种算子计算的系统架构示意图。本申请的产品实现形态,是包含在AI编译器、机器学习/深度学习平台软件中,并部署在主机硬件上的程序代码。以图5所示的应用场景为例,本申请的程序代码存在于AI编译器的静态计算单元编译模块内部和平台软件的初始化模块、以及运行时引擎内部。编译时,本申请的程序代码运行于编译主机的CPU中;运行时,本申请的静态计算单元4016运行于执行主机的AI芯片中,该AI芯片可以搭载有二进制的静态计算单元以及本申请提供的算子计算过程的软件程序。图5示出了本申请在主机AI编译器及平台软件中的实现形态,其中虚线框所示部分4013、4015,4016,4017,4019为本申请在现有平台软件基础上新增加的模块。在初始化模块4012内部,本申请设计了组合算法模块4013;在运行时引擎4013内部,执行模块4017可以根据参数表4015和调用的静态计算单元4016完成算子计算;AI编译器4018,包括静态计算单元编译模块4019,该静态计算单元编译模块4019完成静态计算单元编译后可以得到静态计算单元二进制包4020。
另外,上述图5表示了一种典型的静态计算单元二进制发布的应用场景。在该场景下,编译主机4002和执行主机4001是分开的,发布软件版本前在编译主机4002编译出可发布的静态计算单元二进制包4020,所有的执行主机4001都只需要导入静态计算单元二进制包4020即可。
但是,在另一种应用场景下,机器学习平台4011包含了AI编译器4018,静态计算单元是在执行主机4001上编译出来的。也就是编译主机4002的功能放在执行主机4001上实现,在该种应用场景下,静态计算单元4016在初始化4012时编译。
图6是一种终端设备的组件结构图。如图6所示,在终端设备上使用二进制发布的静态计算单元,通过通用接口NNAPI(Neural Networks Application Programming Interface,神经网络应用程序接口)4011对终端设备上所有的APP提供AI网络执行能力。本申请的静态计算单元4016使用的是二进制发布的算子包4020,无需再次编译。初始化4012时仅需调用组合算法4013,生成算子对应数据形状的参数表4015即可。本实施例能够在APP(应用)打开时将AI网络的启动时间降到最小,大大提高用户体验。其中,本申请的静态计算单元4016可以运行于终端设备的AI芯片中,该AI芯片可以搭载有二进制的静态计算单元以及本申请提供的算子计算过程的软件程序。
值得说明的是,终端设备使用已发布的静态计算单元二进制包可以实现数据的shape信息与算子代码间的解耦,降低算子开发与调优的难度。比如:在手机、平板电脑、智能电视等应用场景下,使用已发布的静态计算单元二进制包,可以大大提高APP中AI网络初始化速度。
图7是一种AI芯片硬件结构图,该AI芯片可以搭载有二进制的静态计算单元以及本申请提供的算子计算过程的软件程序。神经网络处理器(Neural network processor,NPU)50NPU作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路50,通过控制器504控制运算电路503提取存储器中的矩阵数据并进行乘法运算。
在一些实现中,运算电路503内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路503是二维脉动阵列。运算电路503还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路503是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器502 中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器501中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器508accumulator中。
统一存储器506用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器505Direct Memory Access Controller,DMAC被搬运到权重存储器502中。输入数据也通过DMAC被搬运到统一存储器506中。
BIU为Bus Interface Unit即,总线接口单元510,用于AXI总线与DMAC和取指存储器509Instruction Fetch Buffer的交互。
总线接口单元510(Bus Interface Unit,简称BIU),用于取指存储器509从外部存储器获取指令,还用于存储单元访问控制器505从外部存储器获取输入矩阵A或者权重矩阵B的原数据。
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器506或将权重数据搬运到权重存储器502中或将输入数据数据搬运到输入存储器501中。
向量计算单元507多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。
在一些实现种,向量计算单元能507将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元507可以将非线性函数应用到运算电路503的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元507生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路503的激活输入,例如用于在神经网络中的后续层中的使用。
控制器504连接的取指存储器(instruction fetch buffer)509,用于存储控制器504使用的指令;
统一存储器506,输入存储器501,权重存储器502以及取指存储器509均为On-Chip存储器。外部存储器私有于该NPU硬件架构。
图8是一种用于算子计算过程中的跳读跳写场景示意图。本申请中的静态计算单元只完成固定数据形状的数据的计算,其实现关键是在数据空间里,需要切分出每次计算所需的数据。该数据空间指的是通过数据形状定义的用于保存数据的逻辑空间,其地址是多维的,实际位于内存空间中;而内存空间可以指的是在内存中用于保存数据的存储空间,其地址是一维的。如图8所示,灰色部分为某次静态计算单元需要计算的数据。在内存空间展开后,可以看出灰色部分是不连续的,所以可以采用跳读跳写的方式,从内存空间中获取静态计算单元每次计算需要的数据,以及计算完成后,将计算得到的输出数据保存至内存空间中。
图9是一种跳读跳写支持方式示意图。本申请中支持跳读跳写的方法之一是增加一个接口,比如:bind_buffer(axis,stride,offset)接口,该接口的作用是建立静态计算单元的Tensor(张量)与buffer(缓存)的映射关系。如图8所示,Tensor(张量)对应于该静态计算单元需要处理的数据(图9中灰色部分),buffer(缓存)的大小与数据空间保存的数据的大小一致。其中,Axis(轴)指的是该Tensor所在的各个维度;Offset(偏移)为该Tensor在需要读取数据的维度上的偏移;Stride(步幅)指的是该Tensor在需要读取数据的维度上需要跳过的数据长度。可见,本申请通过映射多个维度,可以实现多维数据空间的跳读跳写。
图10是一种尾部数据处理方式示意图,该尾部数据指的是在一个维度上被静态计算单元逻辑划分时,得到的需要处理的数据长度比静态计算单元支持处理的数据长度小的一段数据。如图10所示,针对该尾部数据,可以有4种处理方式:
方式1、丢弃无效数据,即将尾部数据超出部分丢弃,优点:标量计算小,缺点:无效计算多。
方式2、部分数据重叠,即将尾部数据超出部分向前平移,部分数据重叠,优点:标量计算小,缺点:重复计算多。也就是说,可以尾部数据超出部分与另一计算单元需要处理的数据进行重叠,即计算尾部数据的静态计算单元读取数据时,可以将起始位置设置在另一计算单元需要处理的数据中,从而造成部分数据重叠。
方式3、支持数据全范围内变化,即数据在每个维度上的数据长度比较小时,可以在整个数据长度内的变化。比如:数据长度为16,可以支持在0至16内变化。
方式4、支持数据部分范围内变化,即数据在每个维度上的数据长度比较大时,只支持在该数据长度的尾部小范围内的变化。比如:数据长度为100,可以支持在90至100内变化。
上述方式3和方式4均是数据形状变化的处理方式,优点:重复计算少,缺点:标量计算多
图11是一种参数表结构示意图,该参数表是用于描述AI网络支持处理的数据形状的参数数据的一种形式。如图11所示,左侧为AI网络支持处理的目标数据的数据形状的一种逻辑切分方式(也可以切分成不同大小的数据形状),逻辑切分得到的每个数据形状都对应着一次静态计算单元调用,右侧为参数表的数据结构。参数表是用于描述静态计算单元组合模式的数据结构,表中每一项都对应着一次静态计算单元的调用和待调用静态计算单元需要计算的数据在目标数据中的位置。其中,表项中的数据参数主要包括:
1)待调用静态计算单元的ID;
2)目标数据在每个维度上的数据长度,比如:图11中的数据长度0为目标数据在宽度这个维度上的数据长度,数据长度1为目标数据在长度这个维度上的数据长度。
3)待调用静态计算单元需要计算的数据在每个维度上的偏移,比如:图11中的偏移0为目标数据在宽度这个维度上的偏移,偏移1为目标数据在长度这个维度上的偏移。
4)待调用静态计算单元需要计算的数据在每个维度上的数据长度。
另外,参数表中有个公共参数区用于保存各个表项中相同的参数,比如:目标数据在每个维度上的数据长度。
参数表由组合算法生成的。比如:在二维下,组合算法就是做矩形块的拼接;在三维下,组合算法相当于立方体的拼接。
参数表可以分档数据表。由于静态计算单元在处理尾部数据时,是可以变长的。示例性的,如图12所示三个例子均能支持一定变化范围的数据形状,所以不必每个数据形状都生成一个参数表,若以分档参数表的形式,可以有效减少缓存中的参数表个数。其中,图12中左侧的变化范围指的是全局变化,比如:在某个维度上的数据长度为16,全局变化范围可以为0至16;中间的变化范围指的是部分变化,比如:在某个维度上的数据长度为100,部分变化范围可以为90至100;右侧的变化范围指的是只有尾部数据变化。
值得说明的是,AI网络需要支持的数据形状对应的参数表是自动生成的,无需用户配置。该参数表的特征可以包括:1)一种可以描述静态计算单元调用顺序的数据结构;2)还可以描述每个静态计算单元计算的数据块在原始数据中的位置(即各静态计算单元对应的具有第 二数据形状的数据块在具有第一数据形状的数据块中的位置);3)参数表可以描述静态计算单元的组合模式,控制静态计算单元完成所有数据的计算。其中,静态计算单元可以是二进制的静态计算单元。即静态计算单元可以在版本发布前就编译成二进制文件,发布时以二进制文件的形式提供给用户。
并且,通过组合二进制静态计算单元的方式可以支持任意范围数据shape的变化。该方式的特征可以包括:1)一次调用只能完成一块数据的计算;2)支持在原始数据空间中用跳读的方式抽取一块数据,计算完后用跳写的方式写入输出数据空间的对应位置;3)支持用数据驱动计算,包括具体调用哪个静态计算单元、完成哪块数据的计算,都可以通过参数表以数据形式导入。
图13是一种融合算子示意图,融合算子指的是级联的不同类型算子,融合成一个算子一次性计算完成。静态计算单元4016内可以包含不同类型算子的静态计算单元,调用时以ID区分。然后用参数表4015描述不同类型算子的静态计算单元的调用顺序以及相关参数,这样就能完成融合算子的计算。如图13所示,静态计算单元4016包含conv算子的静态计算单元、relu算子的静态计算单元、abs算子的静态计算单元、exp算子的静态计算单元等,调用时以ID区分;参数表4015描述conv算子的静态计算单元、relu算子的静态计算单元、abs算子的静态计算单元、exp算子的静态计算单元等的调用顺序和相关参数。其中,相关参数可以包括各个静态计算单元支持处理的第二数据形状的大小;还可以包括各个静态计算单元支持处理的第二数据形状的数据在具有第一数据形状的数据中的位置信息。
图14是一种算子计算过程示意图,在AI网络执行状态调用算子时,如图14所示,将参数表作为静态计算单元的入参之一,算子会根据参数表的内容依次调用静态计算单元完成算子计算。
图15是一种算子计算过程示意图,如图15所示,左侧的计算过程示意图一指的是三维单输入场景下的算子计算过程;右侧的计算过程示意图二指的是两个算子输入,一个两维的输入数据和一个一维的输入数据。
图16是一种算子计算过程示意图,如图16所示,在遍历每个AI算子需要支持的数据形状时,先查询缓存中是否有支持的参数表,若有,则使用缓存中的参数表;若没有,则先用简化的组合算法生成参数表置于缓存中,同时起线程调用优化的组合算法生成参数表并更新缓存,这样在AI网络第一次起动时,能以最快速度完成网络初始化,而后面再打开AI网络时就可以用优化的参数表,从而提高了用户体验。其中,简化的组合算法指的是只用相同的静态计算单元,组合算法时间最短;优化的组合算法指的是使用不同的静态计算单元做最优组合,组合算法时间较长。
接下来,请参阅图17,图17是本申请实施例提供的一种算子计算方法的流程示意图。其中,该算子计算方法可以用于AI网络。如图17所示,该算子计算方法可以包括以下步骤:
S171、获取AI网络的第一数据形状的参数数据,第一数据形状是AI网络支持处理的每个维度上的数据长度,参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照组合信息组合后在任一维度上的数据长度大于或等于第一数据形状在同一维度上的数据长度。
在一些实施例中,参数数据中的至少两个计算单元可以包括相同的计算单元;或不同的计算单元;或相同的计算单元和不同的计算单元;其中,相同的计算单元的第二数据形状, 在每个维度上的数据长度均相同;不同的计算单元的第二数据形状,在至少一个维度上的数据长度不同。
参数数据中的至少两个计算单元可以均为AI网络的计算单元。
参数数据中的至少两个计算单元的组合信息可以包括组合模式,这样每个计算单元的第二数据形状按照该组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
参数数据中还可以包括针对指定计算单元的标识信息;其中,指定计算单元指的是所述至少两个计算单元中需要处理的数据为具有第三数据形状的数据的计算单元,第三数据形状在至少一个维度上的数据长度小于指定计算单元支持处理的第二数据形状在同一维度上的数据长度。示例性的,第三数据形状的数据可以是图10中涉及到的尾部数据。
参数数据中还可以包括指定计算单元针对具有第三数据形状的数据的指定处理方式。其中,指定处理方式可以包括:丢弃无效数据,所述无效数据是所述指定计算单元支持处理的第二数据形状中除了具有所述第三数据形状的数据之外的数据;或数据重叠,所述数据重叠是将所述无效数据与另一计算单元需要处理的数据进行重叠。示例性的,丢弃无效数据可以是图10中涉及到的方式1;数据重叠可以是图10中涉及到的方式2。
参数数据中还可以包括指定计算单元支持所述第三数据形状在每个维度上的指定变化范围。其中,指定变化范围可以为所述指定计算单元支持处理的第二数据形状在每个维度上的数据长度;或所述第二数据形状在每个维度上的数据长度中的指定部分长度。示例性的,指定变化范围可以是图10中涉及到的方式3和方式4。
参数数据包括分档参数数据,分档参数数据用于支持指定变化范围的数据形状。示例性的,如图12所示三个例子均能支持一定变化范围的数据形状,所以不必每个数据形状都生成一个参数表,若以分档参数表的形式,可以有效减少缓存中的参数表个数。
参数数据中的至少两个计算单元可以属于不同类型的算子,这样可以通过调用参数数据中这些不同类型算子的计算单元来完成融合算子的计算,从而避免了一种类型的算子计算完再调用另一个类型的算子进行计算,提高了融合算子的计算效率。示例性的,如图13所示,静态计算单元4016包含conv算子的静态计算单元、relu算子的静态计算单元、abs算子的静态计算单元、exp算子的静态计算单元等,调用时以ID区分;参数表4015描述conv算子的静态计算单元、relu算子的静态计算单元、abs算子的静态计算单元、exp算子的静态计算单元等的调用顺序和相关参数。
S172、调用至少两个计算单元,对具有第一数据形状的第一目标数据进行计算。
在一些实施例中,可以从计算单元算子库中获取至少两个计算单元;通过至少两个计算单元对具有第一数据形状的第一目标数据进行计算。示例性的,如图5所示,执行主机4001可以导入编译主机4002已发布的静态计算单元二进制包4020,这样可以从静态计算单元二进制包4020获取至少两个计算单元,通过所述至少两个计算单元对具有第一数据形状的第一目标数据进行计算。
可以确定任一计算单元中需要处理的第二目标数据在第一目标数据中的目标位置;按照目标位置从存储有第一目标数据的内存空间中,获取任一计算单元需要处理的第二目标数据;通过任一计算单元对第二目标数据进行计算。示例性的,如图11所示,左侧为AI网络支持处理的目标数据的数据形状的一种逻辑切分方式(也可以切分成不同大小的数据形状),逻辑切分得到的每个数据形状都对应着一次静态计算单元调用,右侧为参数表的数据结构。参数 表是用于描述静态计算单元组合模式的数据结构,表中每一项都对应着一次静态计算单元的调用和待调用静态计算单元需要计算的数据在目标数据中的位置,这样可以根据待调用静态计算单元需要计算的数据在目标数据中的位置获取需要计算的数据,调用待调用静态计算单元对需要计算的数据进行计算。
由此,通过上述方案,可以支持任意范围数据形状的变化,实现了AI动态形状算子功能,提高了AI网络启动速度。
基于上述实施例中的方法,本申请实施例还提供了一种算子计算装置,其中,该算子计算装置用于AI网络。请参阅图18,图18是本申请实施例提供的一种算子计算装置的结构示意图,如图18所示,该算子计算装置包括:
获取模块181,被配置为获取AI网络的第一数据形状的参数数据,所述第一数据形状是AI网络支持处理的每个维度上的数据长度,所述参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照所述组合信息组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度;
计算模块182,被配置为调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算。
在一种可能的实现方式中,所述至少两个计算单元包括相同的计算单元;或不同的计算单元;或相同的计算单元和不同的计算单元;
其中,相同的计算单元的第二数据形状,在每个维度上的数据长度均相同;不同的计算单元的第二数据形状,在至少一个维度上的数据长度不同。
在一种可能的实现方式中,所述至少两个计算单元均为AI网络的计算单元。
在一种可能的实现方式中,所述组合信息包括所述至少两个计算单元的组合模式;
每个计算单元的第二数据形状按照所述组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
在一种可能的实现方式中,所述参数数据中还包括针对指定计算单元的标识信息;
其中,所述指定计算单元指的是所述至少两个计算单元中需要处理的数据为具有第三数据形状的数据的计算单元,所述第三数据形状在至少一个维度上的数据长度小于所述指定计算单元支持处理的第二数据形状在同一维度上的数据长度。
在一种可能的实现方式中,所述参数数据中还包括所述指定计算单元针对所述具有第三数据形状的数据的指定处理方式。
在一种可能的实现方式中,所述指定处理方式包括:
丢弃无效数据,所述无效数据是所述指定计算单元支持处理的第二数据形状中除了具有所述第三数据形状的数据之外的数据;或
数据重叠,所述数据重叠是将所述无效数据与另一计算单元需要处理的数据进行重叠。
在一种可能的实现方式中,所述参数数据中还包括所述指定计算单元支持所述第三数据形状在每个维度上的指定变化范围。
在一种可能的实现方式中,所述指定变化范围为所述指定计算单元支持处理的第二数据形状在每个维度上的数据长度;或所述第二数据形状在每个维度上的数据长度中的指定部分长度。
在一种可能的实现方式中,所述参数数据包括分档参数数据,所述分档参数数据用于支持指定变化范围的数据形状。
在一种可能的实现方式中,所述计算模块182包括:
第一获取子模块,被配置为从计算单元算子库中获取所述至少两个计算单元;
第一计算子模块,被配置为通过所述至少两个计算单元对具有所述第一数据形状的第一目标数据进行计算。
在一种可能的实现方式中,所述计算模块182包括:
确定子模块,被配置为针对任一计算单元,确定所述任一计算单元中需要处理的第二目标数据在所述第一目标数据中的目标位置;
第二获取子模块,被配置为按照所述目标位置从存储有所述第一目标数据的内存空间中,获取所述任一计算单元需要处理的第二目标数据;
第二计算子模块,被配置为通过所述任一计算单元对所述第二目标数据进行计算。
在一种可能的实现方式中,所述目标位置包括:所述第二目标数据所在的各个维度;以及,针对任一维度,所述第二目标数据在所述任一维度上的偏移和数据长度。
在一种可能的实现方式中,所述至少两个计算单元属于不同类型的算子。
在一种可能的实现方式中,所述计算单元为预先编译好的算子。
应当理解的是,上述装置用于执行上述实施例中的方法,装置中相应的程序模块,其实现原理和技术效果与上述方法中的描述类似,该装置的工作过程可参考上述方法中的对应过程,此处不再赘述。
基于上述实施例中的方法,本申请实施例还提供了一种算子计算装置。请参阅图19,图19是本申请实施例提供的一种算子计算装置的结构示意图。如图19所示,本申请实施例提供的算子计算装置,该算子计算装置可用于实现上述方法实施例中描述的方法。
该算子计算装置包括至少一个处理器1601,该至少一个处理器1601可支持算子计算装置实现本申请实施例中所提供的控制方法。
该处理器1601可以是通用处理器或者专用处理器。例如,处理器1601可以包括中央处理器(central processing unit,CPU)和/或基带处理器。其中,基带处理器可以用于处理通信数据(例如,确定目标屏幕终端),CPU可以用于实现相应的控制和处理功能,执行软件程序,处理软件程序的数据。
进一步的,算子计算装置还可以包括收发单元1605,用以实现信号的输入(接收)和输出(发送)。例如,收发单元1605可以包括收发器或射频芯片。收发单元1605还可以包括通信接口。
可选地,算子计算装置还可以包括天线1606,可以用于支持收发单元1605实现算子计算装置的收发功能。
可选地,算子计算装置中可以包括一个或多个存储器1602,其上存有程序(也可以是指令或者代码)1604,程序1604可被处理器1601运行,使得处理器1601执行上述方法实施例中描述的方法。可选地,存储器1602中还可以存储有数据。可选地,处理器1601还可以读取存储器1602中存储的数据(例如,预存储的第一特征信息),该数据可以与程序1604存储在相同的存储地址,该数据也可以与程序1604存储在不同的存储地址。
处理器1601和存储器1602可以单独设置,也可以集成在一起,例如,集成在单板或者 系统级芯片(system on chip,SOC)上。
关于算子计算装置在上述各种可能的设计中执行的操作的详细描述可以参照本申请实施例提供的算子计算方法的实施例中的描述,在此就不再一一赘述。
基于上述实施例中的装置,本申请实施例还提供了一种算子计算设备,该算子计算设备包含上述实施例中所提供的任一算子计算装置。
可以理解的是,本申请实施例中,算子计算设备可以为手机、平板电脑、数码相机、个人数字助理(personal digitalassistant,PDA)、可穿戴设备、智能电视、华为智慧屏等终端设备。终端设备的示例性实施例包括但不限于搭载iOS、android、Windows、鸿蒙系统(Harmony OS)或者其他操作系统的终端设备。上述终端设备也可以是其他终端设备,诸如具有触敏表面(例如触控面板)的膝上型计算机(laptop)等。本申请实施例对终端设备的类型不做具体限定。其中,终端设备的组件结构图,如图6所示。
基于上述实施例中的算子计算设备,本申请实施例还提供了一种算子计算系统,算子计算设备和算子编译设备;其中,所述算子计算设备包含上述实施例中所提供的任一算子计算装置;所述算子编译设备用于编译出可发布的静态计算单元二进制包;所述算子计算设备用于导入所述静态计算单元二进制包。示例性的,算子计算设备可以为图5中的执行主机4001或图6中的终端设备,算子编译设备可以为图5中的编译主机4002。
基于上述实施例中的方法,本申请实施例还提供了一种芯片。请参阅图20,图20为本申请实施例提供的一种芯片的结构示意图。如图20所示,芯片1900包括一个或多个处理器1901以及接口电路1902。可选的,芯片1900还可以包含总线1903。其中:
处理器1901可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1901中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1901可以是通用处理器、数字通信器(DSP)、专用集成电路(ASIC)、现场可编程门阵列(FPGA)或者其它可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
接口电路1902可以用于数据、指令或者信息的发送或者接收,处理器1901可以利用接口电路1902接收的数据、指令或者其它信息,进行加工,可以将加工完成信息通过接口电路1902发送出去。
可选的,芯片还包括存储器,存储器可以包括只读存储器和随机存取存储器,并向处理器提供操作指令和数据。存储器的一部分还可以包括非易失性随机存取存储器(NVRAM)。
可选的,存储器存储了可执行软件模块或者数据结构,处理器可以通过调用存储器存储的操作指令(该操作指令可存储在操作系统中),执行相应的操作。
可选的,接口电路1902可用于输出处理器1901的执行结果。
需要说明的,处理器1901、接口电路1902各自对应的功能既可以通过硬件设计实现,也可以通过软件设计来实现,还可以通过软硬件结合的方式来实现,这里不作限制。
应理解,上述方法实施例的各步骤可以通过处理器中的硬件形式的逻辑电路或者软件形式的指令完成。
可以理解的是,本申请的实施例中的处理器可以是中央处理单元(central processing unit,CPU),还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、晶体管逻辑器件,硬件部件或者其任意组合。通用处理器可以是微处理器,也可以是任何常规的处理器。
本申请的实施例中的方法步骤可以通过硬件的方式来实现,也可以由处理器执行软件指令的方式来实现。软件指令可以由相应的软件模块组成,软件模块可以被存放于随机存取存储器(random access memory,RAM)、闪存、只读存储器(read-only memory,ROM)、可编程只读存储器(programmable rom,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)、寄存器、硬盘、移动硬盘、CD-ROM或者本领域熟知的任何其它形式的存储介质中。一种示例性的存储介质耦合至处理器,从而使处理器能够从该存储介质读取信息,且可向该存储介质写入信息。当然,存储介质也可以是处理器的组成部分。处理器和存储介质可以位于ASIC中。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(solid state disk,SSD))等。
可以理解的是,在本申请的实施例中涉及的各种数字编号仅为描述方便进行的区分,并不用来限制本申请的实施例的范围。

Claims (36)

  1. 一种算子计算方法,其特征在于,所述方法包括:
    获取人工智能AI网络的第一数据形状的参数数据,所述第一数据形状是所述AI网络支持处理的每个维度上的数据长度,所述参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照所述组合信息组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度;
    调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算。
  2. 根据权利要求1所述的方法,其特征在于,所述至少两个计算单元包括相同的计算单元;或不同的计算单元;或相同的计算单元和不同的计算单元;
    其中,相同的计算单元的第二数据形状,在每个维度上的数据长度均相同;不同的计算单元的第二数据形状,在至少一个维度上的数据长度不同。
  3. 根据权利要求1所述的方法,其特征在于,所述至少两个计算单元均为所述AI网络的计算单元。
  4. 根据权利要求1所述的方法,其特征在于,所述组合信息包括所述至少两个计算单元的组合模式;
    每个计算单元的第二数据形状按照所述组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
  5. 根据权利要求1所述的方法,其特征在于,所述参数数据中还包括针对指定计算单元的标识信息;
    其中,所述指定计算单元指的是所述至少两个计算单元中需要处理的数据为具有第三数据形状的数据的计算单元,所述第三数据形状在至少一个维度上的数据长度小于所述指定计算单元支持处理的第二数据形状在同一维度上的数据长度。
  6. 根据权利要求5所述的方法,其特征在于,所述参数数据中还包括所述指定计算单元针对所述具有第三数据形状的数据的指定处理方式。
  7. 根据权利要求6所述的方法,其特征在于,所述指定处理方式包括:
    丢弃无效数据,所述无效数据是所述指定计算单元支持处理的第二数据形状中除了具有所述第三数据形状的数据之外的数据;或
    数据重叠,所述数据重叠是将所述无效数据与另一计算单元需要处理的数据进行重叠。
  8. 根据权利要求5所述的方法,其特征在于,所述参数数据中还包括所述指定计算单元支持所述第三数据形状在每个维度上的指定变化范围。
  9. 根据权利要求8所述的方法,其特征在于,所述指定变化范围为所述指定计算单元支持处理的第二数据形状在每个维度上的数据长度;或所述第二数据形状在每个维度上的数据长度中的指定部分长度。
  10. 根据权利要求1所述的方法,其特征在于,所述参数数据包括分档参数数据,所述分档参数数据用于支持指定变化范围的数据形状。
  11. 根据权利要求1所述的方法,其特征在于,所述调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算,包括:
    从计算单元算子库中获取所述至少两个计算单元;
    通过所述至少两个计算单元对具有所述第一数据形状的第一目标数据进行计算。
  12. 根据权利要求1所述的方法,其特征在于,所述调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算,包括:
    针对任一计算单元,确定所述任一计算单元中需要处理的第二目标数据在所述第一目标数据中的目标位置;
    按照所述目标位置从存储有所述第一目标数据的内存空间中,获取所述任一计算单元需要处理的第二目标数据;
    通过所述任一计算单元对所述第二目标数据进行计算。
  13. 根据权利要求12所述的方法,其特征在于,所述目标位置包括:所述第二目标数据所在的各个维度;以及,针对任一维度,所述第二目标数据在所述任一维度上的偏移和数据长度。
  14. 根据权利要求1所述的方法,其特征在于,所述至少两个计算单元属于不同类型的算子。
  15. 根据权利要求1至14任一项所述的方法,其特征在于,所述计算单元为预先编译好的算子。
  16. 一种算子计算装置,其特征在于,所述装置包括:
    获取模块,被配置为获取人工智能AI网络的第一数据形状的参数数据,所述第一数据形状是所述AI网络支持处理的每个维度上的数据长度,所述参数数据包括至少两个计算单元的组合信息,每个计算单元支持处理的数据为具有第二数据形状的数据,每个计算单元的第二数据形状按照所述组合信息组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度;
    计算模块,被配置为调用所述至少两个计算单元,对具有所述第一数据形状的第一目标数据进行计算。
  17. 根据权利要求16所述的装置,其特征在于,所述至少两个计算单元包括相同的计算单元;或不同的计算单元;或相同的计算单元和不同的计算单元;
    其中,相同的计算单元的第二数据形状,在每个维度上的数据长度均相同;不同的计算单元的第二数据形状,在至少一个维度上的数据长度不同。
  18. 根据权利要求16所述的装置,其特征在于,所述至少两个计算单元均为所述AI网络的计算单元。
  19. 根据权利要求16所述的装置,其特征在于,所述组合信息包括所述至少两个计算单元的组合模式;
    每个计算单元的第二数据形状按照所述组合模式组合后在任一维度上的数据长度大于或等于所述第一数据形状在同一维度上的数据长度。
  20. 根据权利要求16所述的装置,其特征在于,所述参数数据中还包括针对指定计算单元的标识信息;
    其中,所述指定计算单元指的是所述至少两个计算单元中需要处理的数据为具有第三数据形状的数据的计算单元,所述第三数据形状在至少一个维度上的数据长度小于所述指定计算单元支持处理的第二数据形状在同一维度上的数据长度。
  21. 根据权利要求20所述的装置,其特征在于,所述参数数据中还包括所述指定计算单 元针对所述具有第三数据形状的数据的指定处理方式。
  22. 根据权利要求21所述的装置,其特征在于,所述指定处理方式包括:
    丢弃无效数据,所述无效数据是所述指定计算单元支持处理的第二数据形状中除了具有所述第三数据形状的数据之外的数据;或
    数据重叠,所述数据重叠是将所述无效数据与另一计算单元需要处理的数据进行重叠。
  23. 根据权利要求20所述的装置,其特征在于,所述参数数据中还包括所述指定计算单元支持所述第三数据形状在每个维度上的指定变化范围。
  24. 根据权利要求23所述的装置,其特征在于,所述指定变化范围为所述指定计算单元支持处理的第二数据形状在每个维度上的数据长度;或所述第二数据形状在每个维度上的数据长度中的指定部分长度。
  25. 根据权利要求16所述的装置,其特征在于,所述参数数据包括分档参数数据,所述分档参数数据用于支持指定变化范围的数据形状。
  26. 根据权利要求16所述的装置,其特征在于,所述计算模块包括:
    第一获取子模块,被配置为从计算单元算子库中获取所述至少两个计算单元;
    第一计算子模块,被配置为通过所述至少两个计算单元对具有所述第一数据形状的第一目标数据进行计算。
  27. 根据权利要求16所述的装置,其特征在于,所述计算模块包括:
    确定子模块,被配置为针对任一计算单元,确定所述任一计算单元中需要处理的第二目标数据在所述第一目标数据中的目标位置;
    第二获取子模块,被配置为按照所述目标位置从存储有所述第一目标数据的内存空间中,获取所述任一计算单元需要处理的第二目标数据;
    第二计算子模块,被配置为通过所述任一计算单元对所述第二目标数据进行计算。
  28. 根据权利要求27所述的装置,其特征在于,所述目标位置包括:所述第二目标数据所在的各个维度;以及,针对任一维度,所述第二目标数据在所述任一维度上的偏移和数据长度。
  29. 根据权利要求16所述的装置,其特征在于,所述至少两个计算单元属于不同类型的算子。
  30. 根据权利要求16至29任一项所述的装置,其特征在于,所述计算单元为预先编译好的算子。
  31. 一种算子计算装置,其特征在于,包括:
    至少一个存储器,用于存储程序;
    至少一个处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行如权利要求1-15任一所述的方法。
  32. 一种算子计算设备,其特征在于,包含权利要求16-30任一项所述的装置。
  33. 一种算子计算系统,其特征在于,包括算子计算设备和算子编译设备;
    其中,所述算子计算设备包含权利要求16-30任一项所述的装置;
    所述算子编译设备用于编译出可发布的计算单元包;
    所述算子计算设备用于导入所述计算单元包。
  34. 一种计算机存储介质,所述计算机存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行如权利要求1-15任一所述的方法。
  35. 一种包含指令的计算机程序产品,当所述指令在计算机上运行时,使得所述计算机执行如权利要求1-15任一所述的方法。
  36. 一种芯片,其特征在于,包括至少一个处理器和接口;
    所述接口,用于为所述至少一个处理器提供程序指令或者数据;
    所述至少一个处理器用于执行所述程序行指令,以实现如权利要求1-15中任一项所述的方法。
PCT/CN2021/130883 2020-11-19 2021-11-16 一种算子计算方法、装置、设备及系统 WO2022105743A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21893890.0A EP4242880A1 (en) 2020-11-19 2021-11-16 Operator calculation method and apparatus, device, and system
US18/319,680 US20230289183A1 (en) 2020-11-19 2023-05-18 Operator calculation method, apparatus, device, and system

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011301935.5 2020-11-19
CN202011301935.5A CN114519167A (zh) 2020-11-19 2020-11-19 一种算子计算方法、装置、设备及系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/319,680 Continuation US20230289183A1 (en) 2020-11-19 2023-05-18 Operator calculation method, apparatus, device, and system

Publications (1)

Publication Number Publication Date
WO2022105743A1 true WO2022105743A1 (zh) 2022-05-27

Family

ID=81595233

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/130883 WO2022105743A1 (zh) 2020-11-19 2021-11-16 一种算子计算方法、装置、设备及系统

Country Status (4)

Country Link
US (1) US20230289183A1 (zh)
EP (1) EP4242880A1 (zh)
CN (1) CN114519167A (zh)
WO (1) WO2022105743A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116756589B (zh) * 2023-08-16 2023-11-17 北京壁仞科技开发有限公司 匹配算子的方法、计算设备和计算机可读存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722412A (zh) * 2011-03-31 2012-10-10 国际商业机器公司 组合计算装置和方法
CN109656623A (zh) * 2019-03-13 2019-04-19 北京地平线机器人技术研发有限公司 执行卷积运算操作的方法及装置、生成指令的方法及装置
CN110515626A (zh) * 2019-08-20 2019-11-29 Oppo广东移动通信有限公司 深度学习计算框架的代码编译方法及相关产品
US20200034698A1 (en) * 2017-04-20 2020-01-30 Shanghai Cambricon Information Technology Co., Ltd. Computing apparatus and related product
CN111563262A (zh) * 2020-04-15 2020-08-21 清华大学 一种基于可逆深度神经网络的加密方法及系统

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102722412A (zh) * 2011-03-31 2012-10-10 国际商业机器公司 组合计算装置和方法
US20200034698A1 (en) * 2017-04-20 2020-01-30 Shanghai Cambricon Information Technology Co., Ltd. Computing apparatus and related product
CN109656623A (zh) * 2019-03-13 2019-04-19 北京地平线机器人技术研发有限公司 执行卷积运算操作的方法及装置、生成指令的方法及装置
CN110515626A (zh) * 2019-08-20 2019-11-29 Oppo广东移动通信有限公司 深度学习计算框架的代码编译方法及相关产品
CN111563262A (zh) * 2020-04-15 2020-08-21 清华大学 一种基于可逆深度神经网络的加密方法及系统

Also Published As

Publication number Publication date
CN114519167A (zh) 2022-05-20
US20230289183A1 (en) 2023-09-14
EP4242880A1 (en) 2023-09-13

Similar Documents

Publication Publication Date Title
US11704553B2 (en) Neural network system for single processing common operation group of neural network models, application processor including the same, and operation method of neural network system
CN111401406B (zh) 一种神经网络训练方法、视频帧处理方法以及相关设备
WO2021000970A1 (zh) 深度学习算法的编译方法、装置及相关产品
WO2022068623A1 (zh) 一种模型训练方法及相关设备
CN112292667B (zh) 选择处理器的方法和装置
KR20190030034A (ko) 뉴럴 네트워크 모델을 변형하는 뉴럴 네트워크 시스템, 이를 포함하는 어플리케이션 프로세서 및 뉴럴 네트워크 시스템의 동작방법
CN111275199A (zh) 一种深度学习模型文件的转换方法、系统、计算机设备及计算机可读存储介质
Tsimpourlas et al. A design space exploration framework for convolutional neural networks implemented on edge devices
WO2021000971A1 (zh) 操作数据的生成方法、装置及相关产品
CN111695596A (zh) 一种用于图像处理的神经网络以及相关设备
CN111414915B (zh) 一种文字识别方法以及相关设备
US20240062116A1 (en) Model processing method and apparatus
US11789913B2 (en) Integration of model execution engine containers with a model development environment
US20210158131A1 (en) Hierarchical partitioning of operators
WO2022105743A1 (zh) 一种算子计算方法、装置、设备及系统
CN112070202B (zh) 一种融合图的生成方法、生成装置和计算机可读存储介质
CN112099882B (zh) 一种业务处理方法、装置及设备
US11562554B1 (en) Workload reduction for non-maximum suppression operation
CN111667060B (zh) 深度学习算法的编译方法、装置及相关产品
Nguyen et al. FPGA implementation of HOOFR bucketing extractor-based real-time embedded SLAM applications
WO2023030507A1 (zh) 编译优化方法、装置、计算机设备以及存储介质
WO2022253075A1 (zh) 一种编译方法及相关装置
CN114707070A (zh) 一种用户行为预测方法及其相关设备
Cui et al. Real-time stereo vision implementation on the texas instruments keystone ii soc
CN112148303A (zh) 文件生成方法、装置、终端及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21893890

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021893890

Country of ref document: EP

Effective date: 20230606

NENP Non-entry into the national phase

Ref country code: DE