WO2023093724A1 - 神经网络模型的处理方法及装置 - Google Patents

神经网络模型的处理方法及装置 Download PDF

Info

Publication number
WO2023093724A1
WO2023093724A1 PCT/CN2022/133524 CN2022133524W WO2023093724A1 WO 2023093724 A1 WO2023093724 A1 WO 2023093724A1 CN 2022133524 W CN2022133524 W CN 2022133524W WO 2023093724 A1 WO2023093724 A1 WO 2023093724A1
Authority
WO
WIPO (PCT)
Prior art keywords
subgraphs
processors
subgraph
data
graph
Prior art date
Application number
PCT/CN2022/133524
Other languages
English (en)
French (fr)
Inventor
翟智强
凌乔敏
张学同
朱忠锐
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023093724A1 publication Critical patent/WO2023093724A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiments of the present application relate to the field of artificial intelligence, and more specifically, to a method and device for processing a neural network model.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • the complexity of the model is getting higher and higher, the scale of the model is gradually increasing, and the time delay of the operation process is longer.
  • operators in the neural network model are usually deployed according to the priority of the hardware.
  • the priority of the graphics processing unit (GPU) is higher than that of the central processing unit (CPU).
  • CPU central processing unit
  • operators will be deployed on the GPU as much as possible, and operators not supported by the GPU will be deployed on the CPU.
  • This solution can ensure the operation of the model.
  • the processing speed of the CPU is lower than that of the GPU, which affects the overall processing speed of the model. Therefore, it is an important optimization direction to run the neural network model as a whole on a neural network processor such as GPU or NPU.
  • a single neural network processor is used to run the model, other hardware is idle, resulting in a waste of resources.
  • Embodiments of the present application provide a processing method and device for a neural network model, which can distribute computing tasks to multiple processors for parallel execution through operator splitting, thereby improving the processing efficiency of the neural network model.
  • a method for processing a neural network model including: acquiring the computing capabilities of m processors, where m is an integer greater than 1; The operator is divided to obtain the second computation graph.
  • the second computation graph includes parallel n subgraphs corresponding to the first subgraph.
  • the first subgraph is divided into operators according to the computing capabilities of different processors to obtain multiple parallel subgraphs matching the computing capabilities of different processors, so that the multiple parallel subgraphs Computational tasks can be executed in parallel on processors with matching computing capabilities, and the rational use of hardware resources is conducive to improving the processing efficiency of the model.
  • the m processors are used to execute operations of the neural network model.
  • the parallel n subgraphs corresponding to the first subgraph refer to n subgraphs obtained by performing operator segmentation on the first subgraph.
  • the cost of the first subgraph is greater than the cost of at least half of the subgraphs in the first computation graph.
  • the overhead of the first subgraph is greater than or equal to the second threshold.
  • the execution sequence of all operators in the first subgraph is serial execution.
  • the first subgraph may be a continuous convolution structure, that is, the first subgraph may include multiple continuous convolution operators.
  • the first subgraph includes a plurality of topologically ordered operators, the first subgraph is divided into operators, and the calculation tasks of the n subgraphs obtained after division are assigned to The corresponding processors, that is, after splitting the computing tasks of the multiple operators, assign the split computing tasks to the corresponding processors, and complete the split computing tasks in each processor according to the topological order After the task is completed, the calculation results of each processor are combined to obtain the calculation result of the first subgraph, which can reduce the communication overhead introduced by segmentation and ensure the processing efficiency of the model.
  • the second computation graph further includes a segmentation operator, which is used to perform data segmentation on the input data of the first subgraph to obtain the Input data for n subgraphs.
  • the dimension of data segmentation is determined according to the data arrangement of the input data of the first subgraph.
  • the input data of each sub-graph in the n sub-graphs is continuous data in the data arrangement of the input data of the first sub-graph.
  • the input data of n subgraphs is the first subgraph
  • the input data of the graph is obtained by segmenting according to the height dimension.
  • the input data cannot be split on this dimension.
  • Split according to the H axis, and the split data is continuous data. Segmenting according to the W-axis and the C-axis requires jumpy reading of data, which means that there are intervals in reading data, which will increase the overhead of underlying data synchronization, resulting in the failure to improve the processing performance of the model.
  • the input data is segmented through the H axis, and computing tasks are distributed to different processors for parallel processing, which effectively improves the processing performance of the model.
  • the operator in the first subgraph is a convolution operator, and the sliding step of the convolution operator is smaller than the height of the convolution kernel Next, part of the input data of at least two sub-graphs in n sub-graphs is the same.
  • the input data of at least two subgraphs in the n subgraphs overlap.
  • the input data of at least two subgraphs in the n subgraphs overlap in height dimension.
  • the operator in the first subgraph is a convolutional operator, it can be ensured that the output data of the first subgraph is consistent with the n subgraphs by overlapping the input data.
  • the results of the combined output data are the same, thereby ensuring that the calculation results of the second calculation graph are the same as the calculation results of the first calculation graph.
  • the second computation graph further includes a merge operator, which is used to merge the output data of the parallel n subgraphs to obtain the first subgraph The output data for the graph.
  • the overhead of the n subgraphs and the computing capabilities of the n processors among the m processors satisfy the first matching relationship, including: the ratio of the overhead of the n subgraphs The difference with the ratio of the computing capabilities of the n processors is less than or equal to the first threshold.
  • processors with stronger computing power can execute the computing tasks of subgraphs with higher cost
  • processors with weaker computing power can execute the computing tasks of subgraphs with lower cost.
  • the degree of adaptation between overheads is conducive to further improving the processing efficiency of the model.
  • the p subgraphs include q parallel subgraphs in the first calculation graph, and the computing tasks of the q subgraphs are assigned to q processing tasks in the m processors respectively
  • Each processor in the q processors executes one of the computing tasks of the q sub-graphs, the overhead of the q sub-graphs and the computing capabilities of the q processors satisfy the second matching relationship, and q is an integer greater than 1.
  • the subgraphs that can be executed in parallel are deployed on multiple processors according to the computing capabilities of different processors, so that the computing tasks of the multiple parallel subgraphs can be executed in parallel in processing that matches the computing capabilities On the server, rational use of hardware resources is conducive to improving the processing efficiency of the model.
  • assigning the computing tasks of the p subgraphs of the second computation graph to m processors for execution includes: converting the p subgraphs into p operator Actors respectively , during the execution of p Actors, the m processors are scheduled to execute the computing tasks of the p subgraphs.
  • the behavior of p Actors is defined according to the computation tasks of p subgraphs.
  • the Actor model is used as an asynchronous scheduling framework, which can schedule corresponding processors to execute computing tasks in parallel.
  • a second aspect provides an apparatus for processing a neural network model, and the apparatus includes a unit for performing the method of the above-mentioned first aspect and any one of the implementation manners of the first aspect.
  • a processing device for a neural network model comprising: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, The processor is configured to execute the first aspect and the method in any one implementation manner of the first aspect.
  • the processor in the third aspect above can be either a central processing unit (central processing unit, CPU), or a combination of a CPU and a neural network computing processor, where the neural network computing processor can include a graphics processing unit (graphics processing unit). unit, GPU), neural-network processing unit (NPU) and tensor processing unit (TPU), etc.
  • TPU is an artificial intelligence accelerator ASIC fully customized by Google for machine learning.
  • a computer-readable medium stores program code for execution by a device, where the program code includes the method for executing the first aspect and any implementation manner of the first aspect.
  • a computer program product including instructions is provided, and when the computer program product is run on a computer, it causes the computer to execute the method in the above-mentioned first aspect and any one of the implementation manners of the first aspect.
  • a chip in a sixth aspect, includes a processor and a data interface, and the processor reads instructions stored on the memory through the data interface, and executes any one of the above-mentioned first aspect and the first aspect methods in methods.
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the first aspect and the method in any one implementation manner of the first aspect.
  • Fig. 1 is a schematic diagram of an artificial intelligence subject framework provided by an embodiment of the present application
  • FIG. 2 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 3 is a schematic structural diagram of a convolutional neural network provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another system architecture provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a device-side reasoning framework provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another device-side reasoning framework provided by the embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a processing method of a neural network model provided in an embodiment of the present application.
  • Fig. 8 is a schematic flowchart of operator segmentation provided by the embodiment of the present application.
  • FIG. 9 is a schematic flow chart of a paper calculation process provided by the embodiment of the present application.
  • FIG. 10 is a schematic flowchart of another neural network model processing method provided by the embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a heterogeneous graph composition process provided by the embodiment of the present application.
  • Fig. 12 is a schematic flowchart of another operator segmentation process provided by the embodiment of the present application.
  • FIG. 13 is a schematic flowchart of the heterogeneous parallel graph composition process provided by the embodiment of the present application.
  • FIG. 14 is a schematic flowchart of the heterogeneous parallel execution stage provided by the embodiment of the present application.
  • Fig. 15 is a schematic block diagram of a processing device for a neural network model provided by an embodiment of the present application.
  • Fig. 16 is a schematic block diagram of another neural network model processing device provided by an embodiment of the present application.
  • Figure 1 shows a schematic diagram of an artificial intelligence main framework, which describes the overall workflow of an artificial intelligence system and is applicable to general artificial intelligence field requirements.
  • Intelligent information chain reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, the data has undergone a condensed process of "data-information-knowledge-wisdom".
  • IT value chain reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of artificial intelligence, information (provided and processed by technology) to the systematic industrial ecological process.
  • the infrastructure provides computing power support for the artificial intelligence system, realizes communication with the outside world, and realizes support through the basic platform.
  • the infrastructure can communicate with the outside through sensors, and the computing power of the infrastructure can be provided by smart chips.
  • the smart chip here can be a central processing unit (central processing unit, CPU), a neural network processor (neural-network processing unit, NPU), a graphics processing unit (graphics processing unit, GPU), an application specific integrated circuit (application specific) Integrated circuit, ASIC) and field programmable gate array (field programmable gate array, FPGA) and other hardware acceleration chips.
  • CPU central processing unit
  • NPU neural network processor
  • NPU graphics processing unit
  • ASIC application specific integrated circuit
  • FPGA field programmable gate array
  • the basic platform of infrastructure can include related platform guarantees and supports such as distributed computing framework and network, and can include cloud storage and computing, interconnection and interworking network, etc.
  • data can be obtained through sensors and external communication, and then these data can be provided to smart chips in the distributed computing system provided by the basic platform for calculation.
  • Data from the upper layer of the infrastructure is used to represent data sources in the field of artificial intelligence.
  • the data involves graphics, images, voice, text, and IoT data of traditional equipment, including business data of existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.
  • the above data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making and other processing methods.
  • machine learning and deep learning can symbolize and formalize intelligent information modeling, extraction, preprocessing, training, etc. of data.
  • Reasoning refers to the process of simulating human intelligent reasoning in a computer or intelligent system, and using formalized information to carry out machine thinking and solve problems according to reasoning control strategies.
  • the typical functions are search and matching.
  • Decision-making refers to the process of decision-making after intelligent information is reasoned, and usually provides functions such as classification, sorting, and prediction.
  • some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image processing identification, etc.
  • Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. It is the packaging of the overall solution of artificial intelligence, which commercializes intelligent information decision-making and realizes landing applications. Its application fields mainly include: intelligent manufacturing, intelligent transportation, Smart home, smart medical care, smart security, automatic driving, smart city, smart terminal, etc.
  • the embodiments of the present application can be applied in many fields of artificial intelligence, such as smart manufacturing, smart transportation, smart home, smart medical care, smart security, automatic driving, smart city and other fields.
  • the embodiments of the present application can be specifically applied in fields that require the use of (deep) neural networks, such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing. It is especially suitable for mission scenarios that require low latency.
  • deep neural networks such as automatic driving, image classification, image retrieval, image semantic segmentation, image quality enhancement, image super-resolution, and natural language processing. It is especially suitable for mission scenarios that require low latency.
  • a terminal device for example, a mobile phone
  • a cloud disk When a user stores a large number of pictures on a terminal device (for example, a mobile phone) or a cloud disk, it is convenient for the user or the system to classify and manage the album by identifying the images in the album, thereby improving user experience.
  • Utilizing the processing method of the neural network model in the embodiment of the present application can improve the processing speed of the neural network model, that is, increase the speed of classifying pictures, reduce time delay, and facilitate real-time labeling of pictures of different categories, which is convenient for users View and find.
  • the classification tags of these pictures can also be provided to the album management system for classification management, which saves management time for users, improves the efficiency of album management, and improves user experience.
  • Surveillance scenarios include: smart city, field surveillance, indoor surveillance, outdoor surveillance, in-vehicle surveillance, etc.
  • multiple attribute recognition is required, such as pedestrian attribute recognition and riding attribute recognition.
  • Deep neural networks play an important role in multiple attribute recognition by virtue of their powerful capabilities.
  • the processing efficiency of the neural network model can be improved, which is beneficial to real-time processing of the input road picture, and faster identification of different attribute information in the road picture.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • s 1, 2, ... n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to transform the input signal in the neural unit into an output signal.
  • the output signal of this activation function can be used as the input of the next layer.
  • the activation function can be a ReLU, tanh or sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is actually not complicated in terms of the work of each layer.
  • it is the following linear relationship expression: in, is the input vector, is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation, the output vector is obtained. Due to the large number of DNN layers, the coefficient W and the offset vector The number is also higher.
  • DNN The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a subsampling layer, which can be regarded as a filter.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can only be connected to some adjacent neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units of the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as a way to extract image information that is independent of location.
  • the convolution kernel can be formalized as a matrix of random size, and the convolution kernel can obtain reasonable weights through learning during the training process of the convolutional neural network.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the calculation graph corresponding to the neural network model is used to indicate the calculation process of the neural network model.
  • the calculation graph corresponding to the neural network model can be understood as a form of expression of the calculation process of the neural network model.
  • Any part of a calculation graph can be used as a subgraph of the calculation graph, and the calculation graph itself can also be regarded as a subgraph of itself. Any part of the calculation graph can also be understood as any graph structure in the calculation graph.
  • the subgraph of the calculation graph can also be called the graph structure of the calculation graph.
  • the overhead of the graph may include the computational overhead of the graph.
  • the computational overhead of a graph refers to the computational overhead of the computational tasks corresponding to all operators in the graph.
  • Computational overhead can also be referred to as computational volume.
  • the overhead of the graph may also include the communication overhead of the graph.
  • the Actor Framework can also be called the Actor Model.
  • the Actor model is used to handle concurrent computations.
  • An actor refers to the most basic computing unit, which can receive a message and perform calculations based on it.
  • each Actor has an address and can communicate by sending messages.
  • Each Actor is completely independent and can perform its own operations concurrently.
  • Each Actor can run on the same machine or on different machines.
  • the Actor model usually has two task scheduling methods: thread-based scheduling and event-based scheduling.
  • Thread-based scheduling assign a thread to each Actor. When receiving a message, if the current Actor's mailbox (mail box) is empty, the current thread will be blocked.
  • Event-based scheduling Events can be understood as the arrival of tasks or events, at which point threads will be allocated and executed for Actor's tasks.
  • the input of the Actor is the received message. After receiving the message, the Actor processes the task defined in the message. After the Actor finishes processing the task, it can send it to other Actors.
  • large-scale tasks can be decomposed into multiple small tasks, and these small tasks can be executed concurrently by multiple Actors, thereby reducing the task completion time.
  • the embodiment of the present application provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data may include training images and processing results corresponding to the training images.
  • the classification result corresponding to the training image the classification result of the training image may be a manual pre-labeled result.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 .
  • the training device 120 obtains the target model/rule 101 based on the training data.
  • the training device 120 processes the input raw data and compares the output value with the target value until the difference between the value output by the training device 120 and the target value The value is less than a certain threshold, thus completing the training of the target model/rule 101.
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network model.
  • a neural network model For example, Convolutional Neural Networks or Residual Networks.
  • the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rules 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. Laptop, augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), vehicle terminal, etc., can also be a server or cloud, etc.
  • an execution device 110 is configured with an input/output (I/O) interface 112 for data interaction with external devices, and a user may input data to the I/O interface 112 through a client device 140 .
  • I/O input/output
  • the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the processing result of the data obtained above, to the client device 140, thereby providing it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete the above-mentioned task, thereby providing the user with the desired result.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
  • the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample The data is stored in database 130 .
  • FIG. 2 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • the target model/rule 101 is obtained according to the training device 120, the target model/rule 101 in the embodiment of the application may be the neural network model in the application, specifically, the neural network in the embodiment of the application
  • the model can be CNN or residual network, etc.
  • CNN is a very common neural network.
  • the structure of CNN will be introduced in detail below in conjunction with Figure 3.
  • the convolutional neural network is a deep neural network with a convolutional structure, and it is a deep learning architecture.
  • the deep learning architecture refers to the algorithm through machine learning. Multiple levels of learning are performed on the abstraction level.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images input into it.
  • a convolutional neural network (CNN) 200 may include an input layer 210 , a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 230 .
  • the convolutional layer/pooling layer 220 may include layers 221-226 as examples, for example: in one implementation, the 221st layer is a convolutional layer, the 222nd layer is a pooling layer, and the 223rd layer is a volume Layers, 224 are pooling layers, 225 are convolutional layers, and 226 are pooling layers; in another implementation, 221 and 222 are convolutional layers, 223 are pooling layers, and 224 and 225 are convolutional layers Layer, 226 is a pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 221 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter for extracting specific information from the input image matrix.
  • the convolution operators are essentially It can be a weight matrix. This weight matrix is usually pre-defined. During the convolution operation on the image, the weight matrix is usually one pixel by one pixel (or two pixels by two pixels) along the horizontal direction on the input image. ...It depends on the value of the stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolutional output with a single depth dimension, but in most cases instead of using a single weight matrix, multiple weight matrices of the same size (row ⁇ column) are applied, That is, multiple matrices of the same shape.
  • the output of each weight matrix is stacked to form the depth dimension of the convolution image, where the dimension can be understood as determined by the "multiple" mentioned above.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to filter unwanted noise in the image.
  • the multiple weight matrices have the same size (row ⁇ column), and the feature maps extracted by the multiple weight matrices of the same size are also of the same size, and then the extracted multiple feature maps of the same size are combined to form the convolution operation. output.
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, so that the convolutional neural network 200 can make correct predictions .
  • the initial convolutional layer (such as 221) often extracts more general features, which can also be referred to as low-level features;
  • the features extracted by the later convolutional layers (such as 226) become more and more complex, such as features such as high-level semantics, and features with higher semantics are more suitable for the problem to be solved.
  • pooling layer can be followed by one layer of convolutional layers.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the sole purpose of pooling layers is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling an input image to obtain an image of a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of maximum pooling. Also, just like the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the image output after being processed by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After being processed by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned earlier, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to use the fully connected layer 230 to generate one or a group of outputs with the required number of classes. Therefore, the fully connected layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. It is obtained through pre-training, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, etc.
  • the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the backpropagation (as shown in Fig. 3, the propagation from 240 to 210 direction is back propagation) will Start to update the weight values and biases of the aforementioned layers to reduce the loss of the convolutional neural network 200 and the error between the output of the convolutional neural network 200 through the output layer and the ideal result.
  • the convolutional neural network 200 shown in FIG. 3 is only an example of a convolutional neural network.
  • the convolutional neural network can also exist in the form of other network models. For example, only Including part of the network structure shown in Figure 3.
  • the embodiment of the present application provides a system architecture 300 .
  • the system architecture includes a local device 301, a local device 302, an execution device 310, and a data storage system 350, wherein the local device 301 and the local device 302 are connected to the execution device 310 through a communication network.
  • Execution device 310 may be implemented by one or more servers.
  • the execution device 310 may be used in cooperation with other computing devices, such as data storage, routers, load balancers and other devices.
  • Execution device 310 may be arranged on one physical site, or distributed on multiple physical sites.
  • the execution device 310 may use the data in the data storage system 350 or call the program code in the data storage system 350 to implement the neural network model processing method of the embodiment of the present application.
  • the execution device 310 may perform the following process:
  • the second computation graph includes parallel n subgraphs corresponding to the first subgraph, and the n subgraphs
  • the overhead and the computing power of n processors among the m processors satisfy the first matching relationship, 1 ⁇ n ⁇ m, n is an integer;
  • Each local device can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
  • Each user's local device can interact with the execution device 310 through any communication mechanism/communication standard communication network, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 301 and the local device 302 obtain relevant parameters of the neural network model from the execution device 310, deploy the neural network model on the local device 301 and the local device 302, and use the neural network model of the embodiment of the application
  • the processing method of the model performs image classification, image processing, speech processing or text processing, and so on.
  • the neural network model can be directly deployed on the execution device 310, and the execution device 310 obtains the data to be processed from the local device 301 and the local device 302, and uses the processing method of the neural network model in the embodiment of this application to be processed The data is processed.
  • the aforementioned execution device 310 may also be a cloud device. In this case, the execution device 310 may be deployed on the cloud; or, the aforementioned execution device 310 may also be a terminal device. In this case, the execution device 310 may be deployed on the user terminal side. This is not limited.
  • data parallelism can be used to improve the processing efficiency of neural network models.
  • the server can evenly distribute a large amount of data to each distributed machine according to the number of distributed machines.
  • Each distributed machine processes the allocated data based on the calculation graph of the neural network model.
  • the data parallel method is usually applied in the training scenario of a large-scale neural network model, and in the inference scenario with high real-time requirements, the data size processed each time is small, for example, a single image, the data parallel method
  • the average distribution of data cannot be achieved, and it is difficult to improve the processing efficiency of the model.
  • the data parallel method will introduce additional data synchronization overhead and memory overhead.
  • the hardware computing power and storage capacity are limited, which cannot meet the data synchronization overhead and memory overhead, especially for delay. In demanding scenarios, it is difficult to significantly improve the processing efficiency of the neural network model.
  • a multi-core processor can also be used to run the neural network model to improve the processing efficiency of the neural network model. That is, the operations of the neural network model are jointly executed by multiple cores in the multi-core processor.
  • this solution only utilizes the computing resources of a single processor, and the improvement of processing efficiency is overly dependent on the computing power of the processor, and other processors are idle, resulting in a waste of resources.
  • the embodiment of the present application provides a method for processing a neural network model, which can improve the processing efficiency of the neural network model.
  • the solutions in the embodiments of the present application can be applied to the training scenario of the neural network model, and can also be applied to the inference scenario of the neural network model.
  • inference scenarios have higher requirements on processing performance.
  • the application of the solutions in the embodiments of the application to inference scenarios is used as an example for illustration.
  • the end-side Compared with deploying and running neural network models on the server side, the end-side has the advantages of high real-time performance, data privacy protection, and cloud resource saving.
  • the application of deploying and running neural network models on the device side is becoming more and more extensive.
  • the embodiments of the present application mainly take the application of the solutions of the embodiments of the present application in a device-side reasoning scenario as an example for illustration.
  • FIG. 5 shows an overall architecture of device-side reasoning. Deep learning frameworks are used to convert deep learning tasks expressed by neural network models into execution and data that can be executed on processors.
  • Each end-side frame has its own model representation.
  • inference runs are performed against the model.
  • An offline model is an offline file of the framework's model representation.
  • the end-side framework can convert the tripartite models constructed under other end-side frameworks through conversion tools to obtain the offline file of the intermediate representation (IR) of the current end-side framework.
  • the end-side framework can also obtain the calculation graph of the model built under the current end-side framework through its own composition interface.
  • run-time initialization phase read the offline file, generate a run-time topology map, and initialize each operator node in the graph to define the actual behavior of the operator at run-time.
  • This stage is used to prepare for the runtime scheduling stage and save the performance overhead of runtime scheduling. Specifically, this stage can perform operations such as topological sorting, data rearrangement, memory application, algorithm selection, preset backend, and subgraph division.
  • Preset backends can also be referred to as operator hardware deployment options.
  • the operator in the model can be preset with a backend according to user settings, operating environment, and operator support. For example, if the user has enabled the CPU, operators need to be deployed on the CPU. For another example, if the NPU is enabled by the user settings, but the running device does not have an NPU, the operator will be deployed on other hardware, such as a CPU. For another example, if the NPU is enabled by the user, and the running device includes the NPU, but the NPU does not support the operation of a certain operator, the operator needs to be deployed on other hardware, such as a CPU.
  • Algorithm selection refers to selecting a specific algorithm for an operator to implement the operation process.
  • the convolution operation can be realized through a 1*1 algorithm, a winograd algorithm, an Img2Col algorithm, or a sliding window algorithm.
  • the kernel object of the operator can be generated, and each subgraph saves the kernel object of the operator in the subgraph according to the topological order.
  • the runtime scheduling stage that is, the device-side inference stage, according to the topology graph constructed in the runtime initialization stage and the operator information in the graph, call the corresponding hardware, such as CPU, NPU or GPU, etc., to infer the user's input data . That is to say, at runtime, the corresponding hardware is called in sequence to execute the running (run) function of the kernel object to realize the operation of the operator.
  • the topology map is divided into submap 1, submap 2, submap 3, and submap 4, which are executed sequentially on the corresponding hardware in the order of submap 1, submap 2, submap 3, and submap 4, and output results are obtained .
  • other pieces of hardware are idle, resulting in a waste of resources, and the reasoning performance cannot be improved.
  • FIG. 6 shows a device-side inference system architecture provided by an embodiment of the present application.
  • a graph composition operation is added in the runtime initialization phase to obtain heterogeneous graphs.
  • a heterogeneous graph is used to indicate subgraphs deployed on heterogeneous hardware.
  • the graph composition operation may include operations such as heterogeneous hardware capability calculation, operator segmentation, parallel graph search, cost model calculation, and heterogeneous graph composition.
  • Heterogeneous hardware capability computing refers to computing the computing capabilities of heterogeneous hardware.
  • the computing capability of the heterogeneous hardware can be calculated according to the characteristics of the heterogeneous hardware, for example, the calculation rate of multiplication, the bandwidth limit of data transmission, and the like.
  • step S710 of method 700 or step S810 of method 800 For a specific description of heterogeneous hardware capability calculation, refer to step S710 of method 700 or step S810 of method 800, and no further description is given here.
  • Operator splitting refers to performing operator splitting on the first subgraph in the first computation graph corresponding to the model to obtain the second computation graph.
  • Cost model calculation refers to the overhead of calculating the graph.
  • step S720 of method 700 or step S820 of method 800 for specific description, refer to step S720 of method 700 or step S820 of method 800, and no further description is given here.
  • Parallel graph search refers to searching for combinations of subgraphs suitable for parallel execution in a computation graph.
  • step S830 of method 700 or method 800 For specific description, refer to step S830 of method 700 or method 800, and no further description is given here.
  • parallel graph search is an optional operation.
  • Heterogeneous graph composition refers to constructing heterogeneous parallel graphs.
  • the heterogeneous parallel graph is used to indicate the hardware corresponding to each subgraph.
  • Constructing a heterogeneous parallel graph is to select the corresponding hardware for each subgraph and deploy it.
  • step S730 of method 700 or step S840 of method 800 For specific description, reference may be made to step S730 of method 700 or step S840 of method 800 .
  • An asynchronous scheduling framework is introduced into the system architecture to enable parallel reasoning of heterogeneous graphs on heterogeneous hardware.
  • the asynchronous scheduling framework can be implemented through the Actor model.
  • the asynchronous scheduling framework may also be implemented in other forms, as long as the asynchronous scheduling can be implemented, which is not limited in this embodiment of the present application.
  • the operation initialization phase also includes orchestration operations.
  • the orchestration operation in the running initialization phase corresponds to the Actor model of the asynchronous scheduling framework. If the asynchronous scheduling framework is implemented in other forms, the orchestrated operations in the running initialization phase may be adjusted accordingly to adapt to the asynchronous scheduling framework, or the running initialization phase may not include orchestrated operations.
  • the calculation of the model can be performed based on the programmed actor list. For example, perform heterogeneous parallel operations based on Actor1, Actor2, Actor3, and Actor4 in FIG. 6 .
  • FIG. 7 shows a method 700 for processing a neural network model provided by an embodiment of the present application.
  • the method shown in FIG. 7 can be executed by an execution device of the neural network model, which can be a cloud service device or a terminal device, for example, a computer, a server and other devices with sufficient computing power to perform neural network model operations, It can also be a system composed of cloud service equipment and terminal equipment.
  • an execution device of the neural network model which can be a cloud service device or a terminal device, for example, a computer, a server and other devices with sufficient computing power to perform neural network model operations, It can also be a system composed of cloud service equipment and terminal equipment.
  • the method 700 can be applied not only in the inference scenario of the neural network model, but also in the training scenario of the neural network model.
  • the method 700 may be executed by the execution device 110 in FIG. 2 or the execution device 310 in FIG. 4 or a local device.
  • the method 700 may be executed by the training device 120 in FIG. 2 or the execution device 310 in FIG. 4 .
  • the method 700 includes step S710 to step S730. Step S710 to step S730 will be described in detail below.
  • the m processors are used to execute the operation of the neural network model, and m is an integer greater than 1.
  • m processors may be understood as m chips, or in other words, m pieces of hardware.
  • the m processors are m processors of different types.
  • the m processors are m heterogeneous hardware in the terminal device.
  • the m pieces of heterogeneous hardware are different types of hardware that can jointly execute operations of the neural network model.
  • the m processors may include a central processing unit (central processing unit, CPU) or a neural network computing processor, and the neural network computing processor may include a graphics processing unit (graphics processing unit, GPU), Neural-network processing unit (NPU) and tensor processing unit (TPU) and so on.
  • TPU is an artificial intelligence accelerator ASIC fully customized by Google for machine learning.
  • m may be 2, that is, the terminal device may include two processors, and the two processors may be a CPU and a GPU.
  • m may be 3, that is, the terminal device may include three processors, and the three processors may be a CPU, a GPU, and an NPU.
  • the m processors can also be understood as m working nodes.
  • the m working nodes may be of the same type, or may be of different types.
  • the m processors may include at least one of a single-core processor and a multi-core processor, which is not limited in this embodiment of the present application.
  • the calculation capability of a processor can be understood as the amount of calculation that the processor can carry, or the amount of calculation that the processor is suitable for carrying.
  • the computing capability of the processor may also be referred to as the carrying capacity of the processor.
  • the calculation capability of the processor can be calculated according to the characteristics of the processor, for example, the calculation rate of multiplication, the bandwidth limit of data transmission, and the like.
  • the neural network model in the embodiment of the present application may be an existing neural network model, for example, a CNN model, a residual network model, or a recurrent neural network model.
  • the neural network model in the embodiment of the present application may also be constructed by the user, which is not limited in the embodiment of the present application.
  • the solutions of the embodiments of the present application can be applied to multiple fields, and the types of application fields are related to the tasks of the neural network model.
  • the solutions of the embodiments of the present application can be applied to the field of computer vision.
  • image processing tasks include image classification, image detection, image segmentation, image recognition, or image generation.
  • text processing tasks include text recognition or text translation.
  • speech processing tasks include speech recognition and the like. This embodiment of the present application does not limit it.
  • Step S710 may be executed by one of the m processors, or may be executed by other devices other than the m processors, which is not limited in this embodiment of the present application.
  • S720 Perform operator segmentation on the first subgraph in the first computation graph corresponding to the neural network model to obtain a second computation graph.
  • the second computation graph includes parallel n subgraphs corresponding to the first subgraph, and the n
  • the overheads of the subgraphs and the computing capabilities of the n processors among the m processors satisfy a first matching relationship, 1 ⁇ n ⁇ m, where n is an integer.
  • the calculation graph corresponding to the neural network model is used to indicate the calculation process of the neural network model.
  • the calculation graph corresponding to the neural network model can be understood as a form of expression of the calculation process of the neural network model.
  • the first computation graph corresponding to the neural network model may be an initial computation graph.
  • the neural network model may be a model identified by the inference framework.
  • an operator can be understood as a network layer in a neural network model.
  • a convolution operator can be understood as a convolutional layer in a neural network model.
  • Operator segmentation can be understood as the segmentation of computing tasks corresponding to operators.
  • Parallel n subgraphs refer to n subgraphs that can be executed in parallel. In other words, computing tasks corresponding to n parallel subgraphs can be executed in parallel.
  • the subgraph containing operators is divided into subgraphs containing multiple operators that can be executed in parallel, or in other words, divided into multiple subgraphs that can be executed in parallel.
  • the parallel n subgraphs corresponding to the first subgraph refer to n subgraphs obtained by performing operator segmentation on the first subgraph.
  • the cost of the sub-graph can be calculated through the costmodel in the architecture shown in FIG. 6 .
  • the overhead of the subgraph may include the calculation overhead of the subgraph.
  • the computational cost of a subgraph refers to the computational cost of all operators in the subgraph.
  • the overhead of the subgraph may also include the communication overhead of the subgraph.
  • the first subgraph is divided into n parallel subgraphs by means of operator segmentation.
  • the computation amount of the first subgraph may be equal to the sum of the computation amounts of the n subgraphs.
  • the overhead of the n subgraphs and the computing capabilities of the n processors among the m processors satisfy the first matching relationship, which can also be referred to as the overhead of the n subgraphs and the n processors among the m processors matching computing power.
  • the first subgraph is divided into n parallel subgraphs according to the computing capabilities of n processors among the m processors.
  • the solution of the embodiment of the present application can perceive the difference in the computing capabilities of different processors, and based on this, perform the segmentation of the first subgraph.
  • the overhead of the n subgraphs matches the computing power of the n processors.
  • subgraph A in the n subgraphs matches the computing capability of processor A in the n processors, and there is a corresponding relationship between subgraph A and processor A. That is to say, there is a one-to-one correspondence between the n subgraphs and the n processors.
  • the corresponding relationship can be used to determine the manner of deploying the computing tasks of the n subgraphs in step S730.
  • n and m may be the same or different.
  • the number of parallel subgraphs and the number of processors may be the same or different.
  • each first subgraph may be segmented respectively to obtain multiple parallel subgraphs corresponding to each first subgraph.
  • the number of parallel subgraphs obtained after different first subgraphs can be the same or different. This embodiment of the present application does not limit it.
  • Step S720 may be executed by one of the m processors, or may be executed by other devices other than the m processors, which is not limited in this embodiment of the present application.
  • each subgraph in the n subgraphs can be used as one of the p subgraphs.
  • Assigning the computing tasks of the p subgraphs of the second computation graph to the m processors for execution can also be understood as deploying the p subgraphs on the m processors, that is, from the m processors for the The p subgraphs select processors for executing corresponding computing tasks, and deploy the p subgraphs on the m processors.
  • the m processors are scheduled to execute the computing tasks of the p sub-graphs of the second computing graph.
  • the m processors perform calculation tasks of the p subgraphs to complete the operation of the neural network model, or in other words, the m processors can complete the operation of the neural network model based on the p subgraphs.
  • the process of selecting and deploying hardware for the sub-graph may also be referred to as a heterogeneous graph composition process.
  • the heterogeneous graph is used to indicate the heterogeneous hardware corresponding to the p subgraphs.
  • a heterogeneous graph is used to indicate individual subgraphs deployed on heterogeneous hardware.
  • the n subgraphs in the p subgraphs are subgraphs executed in parallel.
  • a heterogeneous graph may also be called a heterogeneous parallel graph.
  • One subgraph is deployed on one processor, and one processor can deploy multiple subgraphs.
  • the second calculation graph can be divided into p subgraphs, and the specific division method can be set as required.
  • topologically ordered operators can form a subgraph.
  • a submap is implemented by a hardware device.
  • multiple topologically ordered operators can be executed on the same processor, reducing communication overhead.
  • a single operator can also form a subgraph.
  • the overhead of the n subgraphs matches the computing power of the n processors.
  • the n subgraphs and the n processors There is a corresponding relationship between the n subgraphs and the n processors. That is to say, the overheads of the n subgraphs match the computing capabilities of the n processors, and there is a corresponding relationship between the n subgraphs and the n processors.
  • the computing tasks of the n sub-graphs are respectively assigned to the n processors for execution, specifically, the computing tasks of the n sub-graphs are respectively assigned to the processors whose computing capabilities match the overhead of the n sub-graphs, that is, the The computing tasks of the n sub-graphs are respectively allocated to the processors corresponding to the n sub-graphs for execution. That is, the computing tasks of the n subgraphs are deployed according to the deployment mode defined when the n subgraphs are divided into operators.
  • step S730 may be performed by one of the m processors, or may be performed by other devices other than the m processors, which is not limited in this embodiment of the present application.
  • the first subgraph is divided into operators according to the computing capabilities of different processors to obtain multiple parallel subgraphs matching the computing capabilities of different processors, so that the multiple parallel subgraphs
  • the computing tasks of the structure can be executed in parallel on processors with matching computing capabilities, and the rational use of hardware resources is conducive to improving the processing efficiency of the model.
  • the matching of the overheads of the n subgraphs with the computing capabilities of the n processors may include: the overheads of the n subgraphs are less than or equal to the computing capabilities of the n processors.
  • the matching between the overhead and the computing capability may be that the overhead is less than or equal to the computing capability.
  • the calculation result of the first sub-graph can only be obtained after the calculation tasks of the n sub-graphs are all executed.
  • the subgraphs that have been executed need to wait for other parallel subgraphs to be calculated, that is, the calculation time of the n subgraphs depends on the longest calculation time of the n subgraphs.
  • a processor with low computing power may process the computing tasks of the sub-graphs with a large overhead, resulting in a long time consumption, which in turn affects the computing time of the n sub-graphs.
  • the first matching relationship between the overheads of the n subgraphs and the computing capabilities of the n processors may include: the ratio of the overheads of the n subgraphs to the ratio of the computing capabilities of the n processors The difference is less than or equal to the first threshold.
  • the n subgraphs are two subgraphs, and the ratio of the overhead of the two subgraphs to the computing power ratio of the two processors among the m processors is less than or equal to the first threshold, then the overhead of the two subgraphs Match the computing power of the two processors.
  • the n subgraphs belong to a combination of multiple candidate subgraph combinations, each candidate subgraph combination includes n candidate subgraphs corresponding to the first subgraph, and in the plurality of candidate subgraph combinations,
  • the difference between the ratio of the overhead of the n subgraphs and the ratio of the computing capabilities of the n processors is the smallest, then the overhead of the n subgraphs and the computing capabilities of the n processors satisfy the first matching relationship
  • processors with stronger computing power can execute the computing tasks of subgraphs with higher cost
  • processors with weaker computing power can execute the computing tasks of subgraphs with lower cost.
  • the degree of adaptation between overheads is conducive to further improving the processing efficiency of the model.
  • the cost of the first subgraph is greater than the cost of at least half of the subgraphs in the first computation graph.
  • the cost of the first subgraph is greater than or equal to the second threshold.
  • the second threshold may be determined according to the computing capability of the processor.
  • the execution sequence of all operators in the first subgraph is serial execution.
  • the execution sequence of the multiple operators is serial execution, that is, the multiple operators in the first subgraph are topologically ordered.
  • the first subgraph may be a graph structure that has relatively high overhead and does not include operators executed in parallel.
  • the first subgraph may be a continuous convolution structure, that is, the first subgraph may include multiple continuous convolution operators.
  • the convolution operator may include various types of convolution operators such as a conventional convolution operator or a depthwise convolution (Convolution Depthwise) operator.
  • the first subgraph includes a plurality of topologically ordered operators, the first subgraph is divided into operators, and the calculation tasks of the n subgraphs obtained after division are assigned to The corresponding processors, that is, after splitting the computing tasks of the multiple operators, assign the split computing tasks to the corresponding processors, and complete the split computing tasks in each processor according to the topological order After the task is completed, the calculation results of each processor are combined to obtain the calculation result of the first subgraph, which can reduce the communication overhead introduced by segmentation and ensure the processing efficiency of the model.
  • the input data of the n sub-graphs is obtained by performing data segmentation on the input data of the first sub-graph.
  • the segmentation operation on the input data can be realized by introducing a segmentation operator.
  • the second computation graph further includes a split operator, which is used to split the input data of the first sub-graph to obtain the input data of the n sub-graphs.
  • a split operator which is used to split the input data of the first sub-graph to obtain the input data of the n sub-graphs.
  • the weights of operators in the n subgraphs and the first subgraph may be the same. In other words, operator splitting does not change the weights of operators in the first subgraph.
  • Different processors can perform different computing tasks in parallel based on the same weight.
  • the difference in computing tasks is due to the segmentation of the input data. In other words, the difference in the computing task is caused by the difference in the input data.
  • the common data in the neural network model can be represented as 4-dimensional data, and the 4 dimensions are batch size (batch, N), number of channels (channel, C), height (height, H) and width (width, W).
  • the input data of the n subgraphs is obtained by dividing the input data of the first subgraph according to any of the following dimensions: the batch size N of the input data, the number of channels C of the input data, and the height H of the input data Or the width W of the input data.
  • Segmenting according to any of the above dimensions may also be referred to as performing segmentation according to any of the above axes.
  • splitting according to the height dimension of the input data may also be referred to as splitting according to the H axis of the input data.
  • the dimension of data segmentation is determined according to the data arrangement of the input data of the first subgraph.
  • the input data of each sub-graph in the n sub-graphs is continuous data in the data arrangement of the input data of the first sub-graph.
  • the input data of the n subgraphs is obtained by segmenting the input data of the first subgraph according to the height dimension .
  • the data arrangement is batch N, height H, width W, and number of channels C means that the data is arranged continuously in the order of the C axis, W axis, H axis, and N axis. That is to say, the data of the C-axis is continuous data. After the data arrangement of the C-axis is completed, the data of the W-axis is arranged. After the data arrangement of the C-axis is completed, the data of the H-axis is arranged. The data arrangement of the H-axis is completed. After that, arrange the data of the N axis.
  • 1x 2x 2x 3 data refers to data with a batch size of 1, a height of 2, a width of 2, and a channel number of 3.
  • the data of 3 channels is expressed in the form of the following three matrices:
  • the data is divided according to the H axis, which can ensure that the divided data is still continuous data.
  • the batch size N of the input data is usually 1, and in this case, the input data cannot be split in this dimension.
  • Split according to the H axis, and the split data is continuous data. Segmenting according to the W-axis and the C-axis requires jumpy reading of data, which means that there are intervals in reading data, which will increase the overhead of underlying data synchronization, which will lead to the failure to improve the processing performance of the model.
  • the input data is segmented through the H axis, and computing tasks are distributed to different processors for parallel processing, which effectively improves the processing performance of the model.
  • the above is only an example, and if the data is arranged in other ways, other dimensions may also be used for the segmentation.
  • the dimension of segmentation can also be the width dimension.
  • the operator in the first subgraph is a convolution operator
  • the sliding step of the convolution operator is smaller than the height of the convolution kernel
  • at least two subgraphs in the n subgraphs Some of the data in the input data are the same.
  • the operator in the first subgraph is a convolutional operator
  • the sliding step size of the convolutional operator is smaller than the width of the convolution kernel
  • the input data of at least two subgraphs in the n subgraphs overlap.
  • the input data of at least two subgraphs in the n subgraphs overlap in height dimension. That is to say, overlapping segmentation is performed on the input data of the first sub-graph to obtain the input data of the n sub-graphs.
  • the height of the divided input data is greater than or equal to the height of the convolution kernel.
  • the output data of the first sub-graph is obtained by merging the output data of the n sub-graphs.
  • the merging operation on the output data can be realized by introducing a merging operator.
  • the second computation graph further includes a merge operator, which is used to merge the output data of the n subgraphs to obtain the output data of the first subgraph.
  • a merge operator which is used to merge the output data of the n subgraphs to obtain the output data of the first subgraph.
  • step S720 can be implemented in the following manner: perform operator segmentation on the first subgraph in the first computation graph to obtain n parallel subgraphs, and replace the first subgraph in the first computation graph with the n subgraphs, and introduce a segmentation operator before the input of the n subgraphs, and introduce a merge operator after the output of the n subgraphs, and the obtained calculation graph is the second calculation graph.
  • the split operator may be a split with overlap (SplitWithOverlap) operator.
  • the SplitWithOverlap operator is used to split the input data with overlap.
  • the merging operator may be a Concat operator.
  • Fig. 8 shows a first computation graph of a neural network model.
  • the first calculation graph includes three continuous convolution operators, MaxPoolFusion operator, Reshape operator, two continuous FullConnection operators and Classification (Softmax) operator .
  • the convolution operator is the Conv2DFusion operator in Figure 8.
  • the Reshape operator is used to change the dimension information of the input data of the Reshape operator.
  • the Reshape operator does not change the total amount of input data or the value of the input data.
  • the data in Figure 8 is expressed according to the dimension data of NHWC.
  • 1 x 240 x 240 x 128 means that the batch of the data is 1, the height is 240, the width is 240, and the number of channels is 128.
  • the size of the weight (weight) of the Conv2DFusion operator is 128 x 3 x 3 x 128, and the size of the bias (bias) is 128.
  • One of the FullConnection operators has a weight size of 128 x 35840 and a bias size of 128.
  • Another FullConnection operator has a weight size of 10 x 128 and a bias size of 128.
  • the parameter of the Reshape operator is expressed as shape(2), that is, the size of the shape parameter is 2, that is, 2 dimensions, and is used to indicate the dimension information of the output data.
  • the dimension of the input data of the Reshape operator in Figure 8 is 1 x 40 x 7 x 128, a total of 35840 data, and the shape (2) is (1,35840), that is, the dimension of the output data is 1 x35840.
  • the first sub-graph is shown in the dotted line box in (a) of FIG. 8 , and the first sub-graph includes 3 continuous convolution operators.
  • the SplitWithOverlap operator may include the following parameters: split dimension (split dimension, slip_dim), number of splits (number_split), split ratio (ratio), and overlapping parameters.
  • the overlapping parameters may specifically include: the size of the upward extension (extend_top) of the divided input data and the size of the downward extension (extend_bottom) of the divided input data.
  • split_dim is used to indicate which axis to split according to
  • number_split is used to indicate the value of n
  • ratio is used to indicate the ratio of the calculation amount of the n parallel subgraphs.
  • extend_top is used to indicate the size of the data split according to the ratio
  • extend_bottom is used to indicate the size of the data split according to the ratio.
  • split_dim can be the H axis, that is, split the H axis.
  • number_split is 2, that is, the first subgraph is divided into two parallel subgraphs.
  • the Ratio is 83:157, that is, the data with a height of 240 is divided into two data with a height of 83 and 157, that is, the height of the two output data.
  • the height ranges of the two input data after segmentation are [0,82] and [83,239] respectively.
  • extend_top is (0,6)
  • the first 0 indicates the size of the first input data in the two split input data, that is, the data with a height range of [0,82], which is expanded forward on the H-axis is 0, that is, the height range is still [0,82]
  • the second 6 represents the second input data, that is, the data with a height range of [83,239]
  • the size of the forward expansion on the H axis is 6, that is, the height range becomes [77,239].
  • the first 0 indicates the size of the first input data in the two split input data, that is, the data with a height range of [0,82], which is expanded backward on the H-axis is 0, that is, the height range is still [0,82]; the second 0 represents the second input data, that is, the data with a height range of [77,239], and the size of the backward expansion on the H axis is 0, that is, the height range becomes [77,239].
  • the split lengths of the two input data after splitting are 83 and 163 respectively, which respectively occupy the data with a height range of [0,82] and the data with a height range of [77,239] in the original input data.
  • the SplitWithOverlap operator may also include more or fewer parameters.
  • the SplitWithOverlap operator may not include the split_dim parameter and the like.
  • the dimension of the input data of the first subgraph in Figure 8 is 1 x 240 x 240 x 128.
  • the dimensions of the two data are 1 x 163 x 240 x 128 and 1 x 83 x 240 x 128.
  • the heights of the two data are 163 and 83 respectively, and the sum of the two is greater than 240. It can be seen that the two data are overlapped in the height dimension.
  • the two data are respectively processed by three convolutional operators in two parallel subgraphs.
  • the dimensions of the data obtained after processing the two parallel subgraphs are 1 x 160 x 240 x 128 and 1 x 80 x 240 x 128 respectively.
  • the two data are combined through the concat operator, and the dimension of the obtained data is 1 x 240 x 240 x 128. This data is passed to the MaxPoolFusion operator.
  • the combined output data is consistent with the output data of the first sub-graph in (a) of FIG. 8 .
  • the height range of the input data of the first subgraph is [0,239], and the dimensions of the two data after segmentation are 1 x 83 x 240 x 128 and 1 x 163 x 240 x 128 respectively.
  • the heights of the two data are 83 and 163 respectively, and the corresponding height ranges are [0,82] and [77,239] respectively, that is to say, the data in the range of 77-82 is required for both the split data, That is, the two data after splitting overlap in the height dimension.
  • the convolution operator is included in the first subgraph, if overlapping segmentation is not performed, the calculation result after the segmentation may not be consistent with the calculation result of the original first subgraph after being merged.
  • the dimension of the output result of the convolution operator can be determined according to the relevant parameters of the convolution operator.
  • the relevant parameters of the convolution operator can include: the dimension of the input data of the convolution operator, the dimension of the convolution kernel, the sliding step size and Fill (pad) parameters to determine.
  • the dimension of the input data of the convolution operator, the dimension of the convolution kernel, the sliding step size, and the pad parameters are determined, the dimension of the data result of the convolution operator is determined.
  • the height of the input data of the n sub-graphs is determined according to the ratio between the overheads of the n sub-graphs.
  • the ratio between the overhead of the two parallel subgraphs in (b) of Figure 8 is 83:157. If the input data is segmented according to the conventional segmentation method, the obtained two segmented data The height ranges of are [0,82] and [83,239] respectively. In order to ensure that the input data of the subgraph in (b) in Figure 8 and the input data of the MaxPoolFusion operator are consistent with those in (a) in Figure 8, it is necessary to split the input data with overlapping. According to the ratio and the first sub The relevant parameters of the convolution operator in the figure can be deduced that the height ranges of the two data after segmentation are [0,82] and [77,239].
  • the extend_top and extend_bottom in the SplitWithOverlap operator are determined according to the ratio between the costs of the n parallel subgraphs and the relevant parameters of the convolution operator.
  • the operator in the first subgraph is a convolutional operator, it can be ensured that the output data of the first subgraph is consistent with the n subgraphs by overlapping the input data.
  • the results of the combined output data are the same, thereby ensuring that the calculation results of the second calculation graph are the same as the calculation results of the first calculation graph.
  • the p subgraphs in the second computation graph include q parallel subgraphs in the first computation graph.
  • the calculation tasks of the q sub-graphs are assigned to q processors among the m processors, each of the q processors executes one of the calculation tasks of the q sub-graphs, and the overhead of the q sub-graphs Computing capabilities of the q processors satisfy the second matching relationship, and q is an integer greater than 1.
  • the combination of q subgraphs is called the first subgraph combination.
  • the p subgraphs may include one or more combinations of subgraphs. Multiple subgraphs in each subgraph combination are subgraphs that can be executed in parallel.
  • the overheads of the q subgraphs and the computing capabilities of the q processors satisfy the second matching relationship, which can also be called that the overheads of the q subgraphs match the computing capabilities of the q processors.
  • q subgraphs are deployed according to the computing capabilities of the m processors.
  • the parallel n subgraphs obtained by performing operator segmentation on the first subgraph are a combination of subgraphs.
  • each subgraph in the n subgraphs can be used as a subgraph in the combination of subgraphs.
  • the one or more parallel subgraph combinations may be obtained through parallel graph search.
  • a parallel graph search is performed to obtain subgraphs that can be executed in parallel, and then one or more combinations of subgraphs that can be executed in parallel are obtained based on the subgraphs that can be executed in parallel.
  • the n subgraphs corresponding to the first subgraph are not included in the combination of one or more parallel subgraphs.
  • a parallel graph search is performed in the second computation graph to obtain subgraphs that can be executed in parallel, and then obtain one or more parallel subgraph combinations.
  • the one or more parallel subgraph combinations include n subgraphs corresponding to the first subgraph.
  • the parallel graph search can be implemented by using the existing directed graph search method, which is not limited in this embodiment of the present application.
  • the overheads of the parallel-executable subgraphs in the first computation graph may not be able to precisely match the computation capabilities of each processor. In this way, the matching degree between the overhead of q subgraphs in the first subgraph combination and the computing capabilities of q processors may be lower than the matching degree between the overhead of n subgraphs corresponding to the first subgraph and the computing capabilities of n processors .
  • multiple subgraphs that can be executed in parallel can be combined to obtain multiple candidate combinations, and a candidate combination is selected from the multiple candidate combinations according to the computing capability of each processor as the first subgraph combination.
  • the candidate combination whose cost ratio of q subgraphs is closest to the ratio of computing capabilities of q processors is taken as the first subgraph combination.
  • subgraph 1 For example, three subgraphs that can be executed in parallel are found in the first computation graph: subgraph 1, subgraph 2 and subgraph 3, m is 2, and the ratio of computing power of two processors is 1:2.
  • Sub-picture 1 and sub-picture 2 form sub-picture 1
  • sub-picture 3 forms sub-picture 1'
  • sub-picture 1 and sub-picture 1' can be used as a candidate combination, and so on, these 3 sub-pictures can form multiple candidate combinations, in Among the plurality of candidate combinations, the candidate combination whose cost ratio of subgraphs is closest to 1:2 is the first subgraph combination.
  • the subgraphs that can be executed in parallel are deployed on multiple processors according to the computing capabilities of different processors, so that the computing tasks of the multiple parallel subgraphs can be executed in parallel in processing that matches the computing capabilities On the server, rational use of hardware resources is conducive to improving the processing efficiency of the model.
  • the plurality of processors being able to execute the plurality of subgraphs in parallel means that the execution processes of the plurality of processors for the plurality of subgraphs are independent of each other, and the plurality of processors can simultaneously execute the plurality of subgraphs.
  • parallel execution does not limit the start time or end time of "execution”. That is to say, the multiple processors executing the multiple sub-graphs in parallel does not mean that the multiple processors need to start executing the multiple sub-graphs at the same time, or finish executing the multiple sub-graphs at the same time.
  • corresponding processors can be scheduled to execute calculation tasks of each subgraph in the combination of subgraphs in parallel.
  • the asynchronous scheduling framework can adopt existing schemes.
  • an asynchronous scheduling framework can adopt the Actor model.
  • step S730 A specific implementation manner of step S730 will be described below by taking the Actor model as an example.
  • step S730 includes: converting the p subgraphs into p Actors, and scheduling the m processors to execute calculation tasks of the p subgraphs during the execution of the p Actors.
  • the behavior of the p Actors is defined according to the computing tasks of the p subgraphs.
  • the processors corresponding to the p subgraphs are the processors allocated for computing tasks of the p subgraphs.
  • the calculation of the model can be performed.
  • Actors interact through messages, and the messages transmit data information between subgraphs. After the Actor receives the message, it will trigger the execution of its own behavior. Each Actor is executed independently. When multiple Actors are triggered synchronously, parallel execution can be achieved.
  • Actor's behavior can include: pre-behavior, execution behavior and post-behavior.
  • Pre-behavior refers to the operations that the Actor needs to perform before performing the calculation task.
  • Execution behavior refers to the operations that Actors need to perform to perform computing tasks.
  • Post-behavior refers to the operations that the Actor needs to perform after performing the calculation task.
  • the pre-behavior, execution behavior and post-behavior of the p Actors can be defined respectively.
  • the pre-behavior, execution behavior and post-behavior of an Actor can be defined as follows.
  • the processor deployed in the subgraph corresponding to the actor is analyzed, and the processor is scheduled to execute the subgraph corresponding to the actor, that is, to execute the computing tasks in the subgraph.
  • Actor can define information such as hardware for subgraph execution and execution operators.
  • the defined behavior can be run on the preset hardware, that is, the processor selected in step S730.
  • the output data is the input data of other Actors, a message will be sent to other Actors to trigger the corresponding Actors. If the output data is the output of the whole graph, set a promise (promise) of the whole graph, or set the value of the promise (value) to the completed state. When all the promise settings of the whole graph are completed, the execution ends, and the operation result of the model is obtained.
  • the whole graph refers to the complete computation graph corresponding to the model, for example, the first computation graph or the second computation graph.
  • the m processors can implement calculation graph scheduling execution according to the programmed Actor list, that is, perform model calculation.
  • FIG. 10 shows a processing method 800 of a neural network model provided by the embodiment of the present application.
  • the method 800 can be understood as a specific implementation of the method 700.
  • the specific description can refer to the method 700.
  • the description of the method 800 Some descriptions are omitted when appropriate.
  • the method 800 can be run on the end-side device.
  • the m processors in method 700 are heterogeneous hardware.
  • the method 800 can be implemented using the framework shown in FIG. 6 .
  • the method 800 includes step S810 to step S860. Steps S810 to S860 will be described below.
  • the method 800 can be performed in the initialization phase and the runtime phase within the framework shown in FIG. 6 . Specifically, steps S810 to S850 are executed in the initialization stage, and step S860 is executed in the runtime stage.
  • Step S810 to step S840 can be understood as a composition stage of a heterogeneous graph, and FIG. 11 shows a schematic diagram of a composition process of a heterogeneous parallel graph.
  • Steps S850 to S860 can be understood as heterogeneous parallel execution stages.
  • Fig. 14 shows a schematic flowchart of the heterogeneous parallel execution phase.
  • the amount of calculation that the heterogeneous hardware can carry that is, the computing capacity, can be calculated.
  • Step S810 corresponds to step S710 in method 700, and for a specific description, reference may be made to step S710.
  • the method 800 only uses two heterogeneous hardware as an example for illustration.
  • the two heterogeneous hardware may be a CPU and a GPU, and the ratio of computing capabilities is about 1:2.
  • operators in the first sub-graph in the first computation graph are segmented based on computing capabilities of each heterogeneous hardware to obtain a second computation graph.
  • the first computation graph is the initial computation graph of the model.
  • the first calculation graph may be a calculation graph corresponding to the model structure identified by the end-to-side reasoning framework in FIG. 10 .
  • step S820 can be implemented in the following manner: search for a graph structure suitable for operator segmentation in the first computation graph, that is, the first subgraph; The operator performs segmentation to construct a graph structure suitable for parallel execution (that is, the parallel subgraph in method 700) to obtain the second computation graph. That is to say, the second computation graph includes a graph structure suitable for parallelism.
  • a graph structure suitable for operator segmentation needs to meet preset conditions, for example, the overhead is greater than or equal to the second threshold, and the execution sequence of multiple operators is serial execution.
  • the original graph result that is, the first computation graph, includes 6 nodes, that is, 6 operators: op1, op2, op3, out-op1, out-op2, and out-op3.
  • the three operators op1, op2, and op3 are topologically ordered and must be executed in series, and the overhead of the three operators is greater than the second threshold, which meets the above conditions, that is,
  • the three operators are suitable for operator segmentation, that is, the graph structure formed by the three operators is used as the first subgraph.
  • the three operators out-op1, out-op2, and out-op3 have no topological order in the first calculation graph, that is, these three operators can be parallelized in the first calculation graph Computing does not meet the above conditions, that is to say, the three operators are not suitable for operator segmentation.
  • the graph structure that is not suitable for operator segmentation keeps the original graph structure unchanged in the second calculation graph.
  • the method of determining the first submap in FIG. 12 is only an example. In practical applications, the first submap can be determined according to needs, for example, the first submap is determined according to the hardware configuration.
  • the embodiment of the present application There is no limit to this. For example, when the amount of hardware is large but the computing power is small, one or more operators among the three operators out-op1, out-op2, and out-op3 with a large amount of calculation can also be switched. point.
  • the heterogeneous hardware includes CPU and GPU.
  • the first subgraph is divided into two parts (two parallel subgraphs), namely op1.1, op2.1, op3.1 and op1.2, op2.2, op3.2.
  • the sum of the overhead of these two parts is the same as the calculation amount of the three operators op1, op2 and op3 in the first calculation graph.
  • the overhead of op1.1, op2.1, and op3.1 and the overhead of op1.2, op2.2, and op3.2 match the computing capabilities of heterogeneous hardware.
  • the ratio of the computing power of the CPU and GPU obtained according to step S810 is about 1:2, the sum of the overhead of op1.1, op2.1, and op3.1 and the overhead of op1.2, op2.2, and op3.2
  • the ratio between sum and sum is 1:2.
  • a split operator is added before the parallel subgraphs obtained after splitting, and a merge operator is added after the parallel subgraphs obtained after splitting.
  • the second calculation graph obtained after segmentation is equivalent to the first calculation graph, which is beneficial to ensure that the calculation results of the second calculation graph and the first calculation graph are completely consistent.
  • it can be ensured that the input and output of other parts other than the first sub-graph in the first calculation graph remain unchanged in the second calculation graph, that is, the calculation of other parts is not affected.
  • the operator in the first subgraph shown in Figure 12 can be a convolution operator, and the SplitWithOverlap operator is added before the two parallel subgraphs, and the Concat operator is added after the two parallel subgraphs. son.
  • the SplitWithOverlap operator is used to split the input data with overlap. For a specific description of the operator, reference may be made to FIG. 8 above, which will not be repeated here.
  • Step S820 corresponds to step S720 in method 700, and for a specific description, reference may be made to step S720.
  • combination 1 is obtained through step S820
  • combination 2 is a combination of subgraphs suitable for parallel execution in the first computation graph.
  • a heterogeneous parallel graph also known as a heterogeneous graph, refers to a subgraph used to implement heterogeneous parallelism.
  • the heterogeneous parallel graph is used to indicate the heterogeneous hardware used for the execution of each subgraph.
  • the process of constructing a heterogeneous parallel graph can also be understood as the process of hardware selection and deployment, that is, selecting hardware for each sub-graph to perform the computing tasks of the corresponding sub-graph.
  • the subgraphs in the combination are deployed on the heterogeneous hardware according to the computing capability of the heterogeneous hardware, that is, the deployment method should be as suitable as possible for the computing capability of the heterogeneous hardware.
  • Combination 1 is obtained by splitting the operator in step S820, and the two parallel subgraphs obtained after splitting can match the computing capabilities of heterogeneous hardware.
  • the ratio of the computing power of the CPU to the GPU is about 1:2, and the ratio between the sum of the overhead of op1.1, op2.1, and op3.1 and the sum of the overhead of op1.2, op2.2, and op3.2 is 1:2.
  • Combination 2 was obtained by searching.
  • the overhead of subgraphs in Combination 2 may not fully match the computing power of heterogeneous hardware.
  • the subgraphs in the combination can be deployed in a way that matches the computing capabilities of the heterogeneous hardware with the highest degree of matching.
  • the ratio of the overhead of subgraph5 composed of out-op1 and out-op2 to the overhead of subgraph6 composed of out-op3 is closest to 2:1.
  • subgraphs other than the combination searched in step S830 that is, subgraphs that cannot be executed in parallel, can be deployed using existing solutions, which is not limited in this embodiment of the present application. For example, deploy subgraph1 and subgraph4 in Figure 13 on the CPU.
  • Step S840 corresponds to step S730 in method 700, and for specific description, reference may be made to step S730, which will not be repeated here.
  • Actors shown in Figure 14 correspond to the subgraphs in Figure 13, that is, Actor1, Actor2, Actor3, Actor4, Actor5, and Actor6 in Figure 14 correspond to subgraph1, subgraph2, subgraph3, subgraph4, subgraph5, and subgraph6 in Figure 13, respectively.
  • Actor1, Actor2, Actor4 and Actor6 in Figure 14 schedule the CPU to execute computing tasks during execution, and Actor3 and Actor5 schedule GPU to execute computing tasks during execution, which is consistent with the hardware indicated by the heterogeneous diagram in Figure 13.
  • Step S860 may be performed through the following steps.
  • Actor1 after Actor1 receives the input of the calculation graph (graph-input), it triggers the pre-behavior of Actor1, that is, checks whether all inputs have been received. After the verification receives all the inputs, the execution behavior of Actor1 is triggered, and the corresponding subgraph, namely subgraph1, is executed on the CPU. After the execution is complete, trigger Actor1's post-behavior.
  • the post-behavior of Actor1 is defined in the diagram layout process, that is, the output of actor1 is preset to be sent to subsequent Actors, namely Actor2 and Actor3 in Figure 14. That is, trigger Actor1 to send output to Actor2 and Actor3 after execution is complete.
  • Actor2 and Actor3 respectively receive the data sent by Actor1 and execute their respective tasks independently. Since the actions performed by Actor2 and Actor3 are similar, in order to avoid repetition, the following only uses Actor2 as an example for illustration.
  • Actor2 After Actor2 receives the data sent by Actor1, it triggers the pre-behavior of Actor2, which is to check whether Actor2 has received all the inputs. All input to Actor2 is the data sent by Actor1.
  • Actor2 After checking that all inputs are received, the execution behavior of Actor2 is triggered, and the corresponding subgraph, namely subgraph2, is executed on the CPU.
  • the actions performed by Actor2 and Actor3 are similar, but Actor2 is executed on the CPU, while Actor3 is executed on the GPU. These two Actors are executed on different hardware, independent of each other and do not affect each other, that is, heterogeneous parallelism is realized.
  • Actor2's post-behavior is triggered, that is, the output is sent to the subsequent Actor.
  • Actor2 and Actor3 each send their output to Actor4.
  • Actor4 After Actor4 receives the output from Actor2 or Actor3, it triggers the pre-action.
  • Actor2 and Actor3 are executed concurrently, and the running time may be different, that is, Actor2 and Actor3 do not necessarily complete the operation at the same time.
  • Actor2 after running, will send the output to Actor4, and Actor4 will trigger the pre-behavior of Actor4 after receiving the output of Actor2, which is to check whether all inputs have been received.
  • All inputs to Actor4 include data sent by Actor2 and data sent by Actor3. Actor4 checks and finds that all inputs have not been received, and Actor4 will continue to wait until it receives the data sent by Actor3, at which point Actor4's pre-behavior is triggered again.
  • Actor4 checks and finds that all inputs have been received, triggering the execution behavior of Actor4, that is, running the corresponding subgraph on the CPU, that is, subgraph4. After the execution is completed, the post-behavior of Actor4 is triggered, that is, the output is sent to the subsequent Actors, namely Actor5 and Actor6 in Figure 14.
  • Actor5 and Actor6 respectively receive the data sent by Actor4 and execute their respective tasks independently.
  • the execution process of Actor5 and Actor6 is similar to Actor2 and Actor3. In order to avoid repetition, the following only takes Actor5 as an example for illustration.
  • Actor5 After Actor5 receives the data sent by Actor4, it triggers the pre-behavior of Actor5, that is, checks whether Actor5 has received all the inputs. All input to Actor5 is the data sent by Actor4.
  • Actor5 After checking that all inputs are received, the execution behavior of Actor5 is triggered, and the corresponding subgraph, namely subgraph5, is executed on the GPU.
  • the actions performed by Actor5 and Actor6 are similar, but Actor5 is executed on the GPU, while Actor6 is executed on the CPU.
  • the output of Actor5 and Actor6 is the output of the whole graph.
  • the output of Actor5 and Actor6 is defined as the output of the whole graph in the post-behavior of Actor5 and Actor6.
  • the output of Actor5 is used as the output of the whole graph, and the value (value) of the promise of the output (output) of the whole graph is set, that is, the value of the promise corresponding to Actor5 is set to a completed state.
  • both Actor5 and Actor6 set the value of the promise of the whole graph output, all the outputs of the whole graph are obtained, that is, the reasoning process is completed and the reasoning result is obtained.
  • Using the solution of the embodiment of the present application can perceive the capabilities of heterogeneous hardware to perform graph composition operations, obtain heterogeneous graphs suitable for heterogeneous parallelism, and perform model reasoning on heterogeneous hardware, which can improve the performance of device-side reasoning.
  • Some models for example, MobilenetV1, performance improved by 5%-10%.
  • the device of the embodiment of the present application will be described below with reference to FIG. 15 to FIG. 16 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.
  • Fig. 15 is a schematic block diagram of a processing device for a neural network model according to an embodiment of the present application.
  • the neural network model processing device 3000 shown in FIG. 15 includes an acquisition unit 3010 and a processing unit 3020 .
  • the acquisition unit 3010 and the processing unit 3020 may be used to execute the neural network model processing method of the embodiment of the present application, specifically, may be used to execute the method 700 or the method 800 .
  • the acquiring unit 3010 is configured to acquire the computing capabilities of m processors, where m is an integer greater than 1.
  • the processing unit 3020 is configured to perform operator segmentation on the first subgraph in the first computation graph corresponding to the neural network model to obtain a second computation graph, and the second computation graph includes parallel n subgraphs corresponding to the first subgraph , the overhead of n subgraphs and the computing power of n processors among m processors satisfy the first matching relationship, 1 ⁇ n ⁇ m, n is an integer; assign the computing tasks of p subgraphs in the second computing graph to m processors execute, wherein, p subgraphs include n subgraphs, and the computing tasks of n subgraphs are respectively assigned to n processors for execution, and each processor in n processors executes one of the computing tasks of n subgraphs One, p is an integer greater than or equal to n.
  • the cost of the first subgraph is greater than the cost of at least half of the subgraphs in the first computation graph.
  • the execution sequence of all operators in the first subgraph is serial execution.
  • the input data of the n subgraphs is obtained by segmenting the input data of the first subgraph, and the dimension of data segmentation is determined according to the data arrangement of the input data of the first subgraph.
  • the operator in the first subgraph is a convolutional operator
  • the sliding step of the convolutional operator is smaller than the height of the convolution kernel
  • the input of at least two subgraphs in the n subgraphs Part of the data is the same.
  • the overhead of the n subgraphs and the computing capabilities of the n processors among the m processors satisfy the first matching relationship, including: the ratio of the overhead of the n subgraphs to the ratio of the computing capabilities of the n processors The difference is less than or equal to the first threshold.
  • the p subgraphs include q parallel subgraphs in the first computation graph, and the computing tasks of the q subgraphs are allocated to q processors among the m processors for execution, and the overhead of the q subgraphs is equal to that of the q processing
  • the computing capability of the device satisfies the second matching relationship, and q is an integer greater than 1.
  • the processing unit 3020 is specifically configured to: convert the p subgraphs into p Actors respectively; and schedule m processors to execute calculation tasks of the p subgraphs during the execution of the p Actors.
  • processing device 3000 is embodied in the form of functional units.
  • unit here may be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit or a combination of both to realize the above functions.
  • the hardware circuitry may include application specific integrated circuits (ASICs), electronic circuits, processors (such as shared processors, dedicated processors, or group processors) for executing one or more software or firmware programs. etc.) and memory, incorporating logic, and/or other suitable components to support the described functionality.
  • ASICs application specific integrated circuits
  • processors such as shared processors, dedicated processors, or group processors for executing one or more software or firmware programs. etc.
  • memory incorporating logic, and/or other suitable components to support the described functionality.
  • the units of each example described in the embodiments of the present application can be realized by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • FIG. 16 is a schematic diagram of a hardware structure of a processing device for a neural network model provided by an embodiment of the present application.
  • the neural network model processing apparatus 5000 shown in FIG. 16 includes a memory 5001 , a processor 5002 , a communication interface 5003 and a bus 5004 .
  • the memory 5001 , the processor 5002 , and the communication interface 5003 are connected to each other through a bus 5004 .
  • the memory 5001 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 5001 may store a program, and when the program stored in the memory 5001 is executed by the processor 5002, the processor 5002 is configured to execute each step of the neural network model processing method of the embodiment of the present application. For example, processor 5002 may execute method 700 shown in FIG. 7 or method 800 shown in FIG. 10 above.
  • the processor 5002 may be a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs, so as to realize the processing method of the neural network model in the method embodiment of the present application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capabilities.
  • each step of the neural network model processing method of the present application can be completed by an integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the above-mentioned processor 5002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the processing device shown in Figure 15, or execute the method shown in Figure 7 of the method embodiment of the present application Or the processing method of the neural network model shown in FIG. 10 .
  • the communication interface 5003 implements communication between the apparatus 5000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the first calculation graph corresponding to the neural network model can be acquired through the communication interface 5003 .
  • the bus 5004 may include a pathway for transferring information between various components of the device 5000 (eg, memory 5001, processor 5002, communication interface 5003).
  • the embodiment of the present application further provides a computer-readable medium, where the computer-readable medium stores program code for execution by a device, and the program code includes a processing method for executing the neural network model in the embodiment of the present application.
  • the embodiment of the present application also provides a computer program product containing instructions, and when the computer program product is run on a computer, it causes the computer to execute the neural network model processing method in the embodiment of the present application.
  • the embodiment of the present application also provides a chip, the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface, and executes the neural network model processing method in the embodiment of the present application.
  • the chip may further include a memory, the memory stores instructions, the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute the instructions of the present application.
  • the processing method of the neural network model in the embodiment may be a memory, the memory stores instructions, the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute the instructions of the present application.
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory Access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “multiple” means two or more.
  • At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)
  • Apparatus For Radiation Diagnosis (AREA)

Abstract

本申请实施例提供了人工智能领域中的一种神经网络模型的处理方法,该方法包括:对神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,第二计算图包括第一子图对应的并行的多个子图,该多个子图的开销与多个处理器的计算能力匹配,该多个处理器分别处理该多个子图的计算任务。本申请的方法能够使得多个处理器并行执行该多个子图的计算任务,有利于提高模型的处理效率。

Description

神经网络模型的处理方法及装置
本申请要求于2021年11月24日提交中国专利局、申请号为202111405768.3、申请名称为“神经网络模型的处理方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请实施例涉及人工智能领域,并且更具体地,涉及一种神经网络模型的处理方法及装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
随着神经网络模型的发展,模型的复杂度越来越高,模型的规模也逐渐增大,运算过程的时延较长。以端侧推理场景为例,目前通常是根据硬件的优先级部署神经网络模型中的算子,例如,图形处理器(graphics processing unit,GPU)的优先级高于中央处理器(central processing unit,CPU),则会将算子尽可能部署到GPU上,GPU不支持的算子部署到CPU上,这样的方案能够保证模型的运行。但CPU的处理速度低于GPU的处理速度,影响了模型整体的处理速度。因此,将神经网络模型能够整体运行在GPU或NPU等神经网络处理器上是一个重要的优化方向。然而,采用单一的神经网络处理器运行模型时,其他硬件处于闲置状态,导致了资源的浪费。
如何提高神经网络模型的处理效率成为一个亟待解决的问题。
发明内容
本申请实施例提供一种神经网络模型的处理方法及装置,能够通过算子拆分将计算任务分配至多个处理器并行执行,提高了神经网络模型的处理效率。
第一方面,提供了一种神经网络模型的处理方法,包括:获取m个处理器的计算能力,m为大于1的整数;对神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,第二计算图包括第一子图对应的并行的n个子图,n个子图的开销与m个处理器中的n个处理器的计算能力满足第一匹配关系,1<n≤m,n为整数;将第二计算图的p个子图的计算任务分配给m个处理器执行,其中,该p个子图包括该n个子 图,n个子图的计算任务分别分配给n个处理器执行,n个处理器中的每个处理器执行n个子图的计算任务中的一个,p为大于或等于n的整数。
在本申请实施例中,根据不同处理器的计算能力对第一子图进行算子切分,以得到匹配不同处理器的计算能力的多个并行的子图,使得该多个并行的子图的计算任务可以并行执行于计算能力相匹配的处理器上,合理利用硬件资源,有利于提高模型的处理效率。
该m个处理器用于执行神经网络模型的运算。
第一子图对应的并行的n个子图指的是通过对第一子图进行算子切分得到的n个子图。
结合第一方面,在第一方面的某些实现方式中,第一子图的开销大于第一计算图中的至少一半的子图的开销。
结合第一方面,在第一方面的某些实现方式中,第一子图的开销大于或等于第二阈值。
这样,仅对开销较大的子图进行算子切分,有利于保证并行执行带来的性能的提升超过引入的通信开销,进而提高模型的处理效率。同时,在该多个处理器无法共享内存的情况下,处理器需要存储所需要的执行的子图中的算子的参数以执行该子图的计算任务。采用本申请实施例的方案,仅对开销较大的子图进行算子切分,能够避免由于算子切分而引入过多的内存开销。
结合第一方面,在第一方面的某些实现方式中,在第一子图包括多个算子的情况下,第一子图中的所有算子的执行顺序为串行执行。
卷积类算子的计算量通常较大。例如,第一子图可以为连续卷积结构,即第一子图中可以包括多个连续的卷积类算子。
在本申请实施例的方案中,第一子图中包括拓扑有序的多个算子,对该第一子图进行算子切分,将切分后得到的n个子图的计算任务分配给对应的处理器,也就是,将该多个算子的计算任务进行切分后,将切分后的计算任务分配给对应的处理器,在各个处理器按照拓扑序完成该切分后的计算任务后,再将各个处理器的计算结果进行合并,得到该第一子图的计算结果,这样能够减少切分引入的通信开销,保证了模型的处理效率。
结合第一方面,在第一方面的某些实现方式中,第二计算图还包括切分算子,该切分算子用于将第一子图的输入数据进行数据切分,以得到该n个子图的输入数据。
结合第一方面,在第一方面的某些实现方式中,数据切分的维度是根据第一子图的输入数据的数据排布情况确定的。
具体地,该n个子图中的各个子图的输入数据在第一子图的输入数据的数据排布中为连续数据。
根据数据排布情况确定数据切分的维度,有利于保证切分后的输入数据为连续数据,即切分后的输入数据在读取时是连续读取的,避免出现跳跃读取数据的情况导致开销过大,进而有利于保证模型的处理性能。
结合第一方面,在第一方面的某些实现方式中,在输入数据的数据排布为批量N、高度H、宽度W、通道数C情况下,n个子图的输入数据是将第一子图的输入数据按照高度维度进行切分得到的。
在输入数据的批量大小N为1的情况下,无法在该维度上对输入数据进行切分。按照H轴进行切分,切分后的数据为连续数据。而按照W轴和按照C轴进行切分,则需要跳 跃的读取数据,也就是说读取数据有间隔,会增加底层数据同步的开销,进而导致模型的处理性能无法提升。本申请实施例的方案中,通过H轴对输入数据进行切分将计算任务分配至不同的处理器并行处理,有效提高了模型的处理性能。
结合第一方面,在第一方面的某些实现方式中,在第一子图中的算子为卷积类算子,且卷积类算子的滑动步长小于卷积核的高度的情况下,n个子图中至少两个子图的输入数据中的部分数据相同。
也就是说,该n个子图中至少两个子图的输入数据有重叠。具体地,该n个子图中至少两个子图的输入数据在高度维度上有重叠。
在本申请实施例中,在第一子图中的算子为卷积类算子的情况下,通过有重叠的切分输入数据,能够保证第一子图的输出数据与该n个子图的输出数据合并后的结果是相同的,进而保证第二计算图的计算结果与第一计算图的计算结果相同。
结合第一方面,在第一方面的某些实现方式中,第二计算图还包括合并算子,该合并算子用于将该并行的n个子图的输出数据进行合并,以得到第一子图的输出数据。
结合第一方面,在第一方面的某些实现方式中,n个子图的开销与m个处理器中的n个处理器的计算能力满足第一匹配关系,包括:n个子图的开销的比值与n个处理器的计算能力的比值之间的差值小于或等于第一阈值。
这样,能够保证计算能力较强的处理器执行开销较大的子图的计算任务,计算能力较弱的处理器执行开销较小的子图的计算任务,提高处理器的计算能力与子图的开销之间的适配度,有利于进一步提高模型的处理效率。
结合第一方面,在第一方面的某些实现方式中,p个子图包括第一计算图中的并行的q个子图,q个子图的计算任务分别分配给m个处理器中的q个处理器执行,q个处理器中的每个处理器执行q个子图的计算任务中的一个,q个子图的开销与q个处理器的计算能力满足第二匹配关系,q为大于1的整数。
在本申请实施例中,根据不同处理器的计算能力将能够并行执行的子图部署于多个处理器上,使得该多个并行的子图的计算任务可以并行执行于计算能力相匹配的处理器上,合理利用硬件资源,有利于提高模型的处理效率。
结合第一方面,在第一方面的某些实现方式中,将第二计算图的p个子图的计算任务分配给m个处理器执行,包括:将p个子图分别转换为p个操作者Actor,在p个Actor在执行过程中调度该m个处理器执行该p个子图的计算任务。
p个Actor的行为是根据p个子图的计算任务定义的。
在本申请实施例的方案中,采用Actor模型作为异步调度框架,能够调度相应的处理器并行执行计算任务。
第二方面,提供了一种神经网络模型的处理装置,该装置包括用于执行上述第一方面以及第一方面的任意一种实现方式的方法的单元。
应理解,在上述第一方面中对相关内容的扩展、限定、解释和说明也适用于第二方面中相同的内容。
第三方面,提供了一种神经网络模型的处理装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面以及第一方面的任意一种实现方式中的方法。
上述第三方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
第四方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面以及第一方面的任意一种实现方式中的方法。
第五方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面以及第一方面的任意一种实现方式中的方法。
第六方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面以及第一方面的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面以及第一方面的任意一种实现方式中的方法。
附图说明
图1是本申请实施例提供的一种人工智能主体框架示意图;
图2为本申请实施例提供的一种系统架构的结构示意图;
图3为本申请实施例提供的一种卷积神经网络的结构示意图;
图4为本申请实施例提供的另一种系统架构的示意图;
图5为本申请实施例提供的一种端侧推理框架的示意图;
图6为本申请实施例提供的另一种端侧推理框架的示意图;
图7为本申请实施例提供的一种神经网络模型的处理方法的示意性流程图;
图8为本申请实施例提供的一种算子切分的示意性流程图;
图9为本申请实施例提供的一种卷子运算过程的示意性流程图;
图10为本申请实施例提供的另一种神经网络模型的处理方法的示意性流程图;
图11为本申请实施例提供的一种异构图构图过程的示意性流程图;
图12为本申请实施例提供的另一种算子切分过程的示意性流程图;
图13为本申请实施例提供的异构并行图构图过程的示意性流程图;
图14为本申请实施例提供的异构并行执行阶段的示意性流程图;
图15是本申请实施例提供的一种神经网络模型的处理装置的示意性框图;
图16是本申请实施例提供的另一种神经网络模型的处理装置的示意性框图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
图1示出一种人工智能主体框架示意图,该主体框架描述了人工智能系统总体工作流程,适用于通用的人工智能领域需求。
下面从“智能信息链”(水平轴)和“信息技术(information technology,IT)价值 链”(垂直轴)两个维度对上述人工智能主题框架进行详细的阐述。
“智能信息链”反映从数据的获取到处理的一列过程。举例来说,可以是智能信息感知、智能信息表示与形成、智能推理、智能决策、智能执行与输出的一般过程。在这个过程中,数据经历了“数据—信息—知识—智慧”的凝练过程。
“IT价值链”从人智能的底层基础设施、信息(提供和处理技术实现)到系统的产业生态过程,反映人工智能为信息技术产业带来的价值。
(1)基础设施:
基础设施为人工智能系统提供计算能力支持,实现与外部世界的沟通,并通过基础平台实现支撑。
基础设施可以通过传感器与外部沟通,基础设施的计算能力可以由智能芯片提供。
这里的智能芯片可以是中央处理器(central processing unit,CPU)、神经网络处理器(neural-network processing unit,NPU)、图形处理器(graphics processing unit,GPU)、专门应用的集成电路(application specific integrated circuit,ASIC)以及现场可编程门阵列(field programmable gate array,FPGA)等硬件加速芯片。
基础设施的基础平台可以包括分布式计算框架及网络等相关的平台保障和支持,可以包括云存储和计算、互联互通网络等。
例如,对于基础设施来说,可以通过传感器和外部沟通获取数据,然后将这些数据提供给基础平台提供的分布式计算系统中的智能芯片进行计算。
(2)数据:
基础设施的上一层的数据用于表示人工智能领域的数据来源。该数据涉及到图形、图像、语音、文本,还涉及到传统设备的物联网数据,包括已有系统的业务数据以及力、位移、液位、温度、湿度等感知数据。
(3)数据处理:
上述数据处理通常包括数据训练,机器学习,深度学习,搜索,推理,决策等处理方式。
其中,机器学习和深度学习可以对数据进行符号化和形式化的智能信息建模、抽取、预处理、训练等。
推理是指在计算机或智能系统中,模拟人类的智能推理方式,依据推理控制策略,利用形式化的信息进行机器思维和求解问题的过程,典型的功能是搜索与匹配。
决策是指智能信息经过推理后进行决策的过程,通常提供分类、排序、预测等功能。
(4)通用能力:
对数据经过上面提到的数据处理后,进一步基于数据处理的结果可以形成一些通用的能力,比如可以是算法或者一个通用系统,例如,翻译,文本的分析,计算机视觉的处理,语音识别,图像的识别等等。
(5)智能产品及行业应用:
智能产品及行业应用指人工智能系统在各领域的产品和应用,是对人工智能整体解决方案的封装,将智能信息决策产品化、实现落地应用,其应用领域主要包括:智能制造、智能交通、智能家居、智能医疗、智能安防、自动驾驶,智慧城市,智能终端等。
本申请实施例可以应用在人工智能中的很多领域,例如,智能制造、智能交通、智能 家居、智能医疗、智能安防、自动驾驶,智慧城市等领域。
具体地,本申请实施例可以具体应用在自动驾驶、图像分类、图像检索、图像语义分割、图像质量增强、图像超分辨率和自然语言处理等需要使用(深度)神经网络的领域。尤其适用于要求低时延的任务场景中。
下面对图片分类和监控这两种应用场景进行简单的介绍。
图片分类:
当用户在终端设备(例如,手机)或者云盘上存储了大量的图片时,通过对相册中图像进行识别可以方便用户或者系统对相册进行分类管理,提升用户体验。
利用本申请实施例的神经网络模型的处理方法,能够提高该神经网络模型的处理速度,即提高对图片进行分类的速度,降低时延,有利于实时为不同的类别的图片打上标签,便于用户查看和查找。另外,这些图片的分类标签也可以提供给相册管理系统进行分类管理,节省用户的管理时间,提高相册管理的效率,提升用户体验。
监控:
监控场景包括:智慧城市、野外监控、室内监控、室外监控、车内监控等。其中,智慧城市场景下,需要进行多种属性识别,例如行人属性识别和骑行属性识别,深度神经网络凭借着其强大的能力在多种属性识别中发挥着重要的作用。
通过采用本申请实施例的神经网络模型的处理方法,能够提高神经网络模型的处理效率,有利于对输入的道路画面进行实时处理,更快地识别出道路画面中的不同的属性信息。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022133524-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。
f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号变换为输出信号。该激活函数的输出信号可以作为下一层的输入。例如,激活函数可以是ReLU,tanh或sigmoid函数。
神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022133524-appb-000002
其中,
Figure PCTCN2022133524-appb-000003
是输入向量,
Figure PCTCN2022133524-appb-000004
是输出向量,
Figure PCTCN2022133524-appb-000005
是偏移 向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022133524-appb-000006
经过如此简单的操作得到输出向量。由于DNN层数多,系数W和偏移向量
Figure PCTCN2022133524-appb-000007
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022133524-appb-000008
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022133524-appb-000009
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器,该特征抽取器可以看作是滤波器。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。卷积核可以以随机大小的矩阵的形式化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)计算图
神经网络模型对应的计算图用于指示神经网络模型的计算过程。或者说,神经网络模型对应的计算图可以理解为神经网络模型的计算过程的一种表现形式。
一个计算图的任一局部都可以作为该计算图的子图,计算图本身也可以视为自身的子图。计算图的任一局部也可以理解为计算图中的任一图结构。
计算图的子图也可以称为计算图的图结构。
图的开销可以包括图的计算开销。图的计算开销指的是图中的所有算子对应的计算任务的计算开销。计算开销也可以称为计算量。
进一步地,图的开销还可以包括图的通信开销。
(5)操作者(Actor)框架
Actor框架也可以称为Actor模型。Actor模型用于处理并发计算。
一个Actor指的是一个最基本的计算单元,它能接收一个消息并基于其执行计算。在Actor模型里面,每个Actor都有地址,能够通过发送消息来通信。每个Actor是完全独立的,可以同时执行各自的操作。各个Actor可以运行于同一个机器上,也可以运行于不同的机器上。
Actor模型通常有两种任务调度方式:基于线程的调度和基于事件的调度。
基于线程的调度:为每个Actor分配一个线程,接收一个消息时,如果当前Actor的邮箱(mail box)为空,会阻塞当前线程。
基于事件的调度:事件可以理解为任务或事件的到来,此时才会为Actor的任务分配线程并执行。
Actor的输入是接收到的消息,Actor接收到消息后处理消息中定义的任务,Actor处理完任务后可以发送给其他的Actor。
在一个系统中,可以将大规模的任务分解为多个小任务,这些小任务可以由多个Actor并发执行,从而减少任务的完成时间。
如图2所示,本申请实施例提供了一种系统架构100。在图2中,数据采集设备160用于采集训练数据。例如,若训练数据为图像数据,则训练数据可以包括训练图像以及训练图像对应的处理结果。例如,训练图像对应的分类结果,训练图像的分类结果可以是人工预先标注的结果。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的原始数据进行处理,将输出值与目标值进行对比,直到训练设备120输出的值与目标值的差值小于一定的阈值,从而完成目标模型/规则101的训练。
本申请实施例中的目标模型/规则101具体可以为神经网络模型。例如,卷积神经网络或残差网络。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图2所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图2中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的数据的处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图2中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体 的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图2仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图2中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图2所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的神经网络模型,具体的,本申请实施例的神经网络模型可以为CNN或残差网络等。
CNN是一种非常常见的神经网络,下面结合图3重点对CNN的结构进行详细的介绍。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图3所示,卷积神经网络(CNN)200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及全连接层(fully connected layer)230。
卷积层/池化层220:
卷积层:
如图3所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
下面将以卷积层221为例,介绍一层卷积层的内部工作原理。
卷积层221可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用多个尺寸(行×列)相同的权重矩阵,即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度,这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色, 又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同,经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同,再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息,从而使得卷积神经网络200进行正确的预测。
当卷积神经网络200有多个卷积层的时候,初始的卷积层(例如221)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络200深度的加深,越往后的卷积层(例如226)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图3中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
全连接层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用全连接层230来生成一个或者一组所需要的类的数量的输出。因此,在全连接层230中可以包括多层隐含层(如图3所示的231、232至23n),该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等。
在全连接层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图3由210至240方向的传播为前向传播)完成,反向传播(如图3由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图3所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,仅包括图3中所示的网络结构的一部分。
如图4所示,本申请实施例提供了一种系统架构300。该系统架构包括本地设备301、 本地设备302以及执行设备310和数据存储系统350,其中,本地设备301和本地设备302通过通信网络与执行设备310连接。
执行设备310可以由一个或多个服务器实现。可选的,执行设备310可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备310可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备310可以使用数据存储系统350中的数据,或者调用数据存储系统350中的程序代码来实现本申请实施例的神经网络模型的处理方法。
具体地,在一种实现方式中,执行设备310可以执行以下过程:
获取m个处理器的计算能力,m为大于1的整数;
对神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,第二计算图包括第一子图对应的并行的n个子图,n个子图的开销与m个处理器中的n个处理器的计算能力满足第一匹配关系,1<n≤m,n为整数;
将第二计算图的p个子图的计算任务分配给m个处理器执行,其中,该p个子图包括该n个子图,将n个子图的计算任务分别分配给n个处理器执行,n个处理器中的每个处理器执行该n个子图的计算任务中的一个,p为大于或等于n的整数。
用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备310进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备310进行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备301、本地设备302从执行设备310获取到神经网络模型的相关参数,将神经网络模型部署在本地设备301、本地设备302上,利用本申请实施例的神经网络模型的处理方法进行图像分类、进行图像处理、语音处理或者文本处理等等。
在另一种实现中,执行设备310上可以直接部署神经网络模型,执行设备310通过从本地设备301和本地设备302获取待处理数据,并利用本申请实施例的神经网络模型的处理方法对待处理数据进行处理。
上述执行设备310也可以为云端设备,此时,执行设备310可以部署在云端;或者,上述执行设备310也可以为终端设备,此时,执行设备310可以部署在用户终端侧,本申请实施例对此并不限定。
目前,可以采用数据并行的方式提高神经网络模型的处理效率。例如,服务端可以按照分布式机器的数量将大批量的数据平均分配给各个分布式机器。各个分布式机器分别基于神经网络模型的计算图对分配的数据进行处理。然而数据并行的方式通常应用于大规模的神经网络模型的训练场景中,而在实时性要求较高的推理场景中,每次处理的数据规模较小,例如,单张图像,数据并行的方式无法实现数据的平均分配,难以提高模型的处理效率。而且,数据并行的方式会引入额外的数据同步开销以及内存的开销,尤其对于端侧来说,硬件的计算能力以及存储能力有限,无法满足数据同步的开销以及内存的开销,尤其是对时延要求较高的场景下,难以显著提高神经网络模型的处理效率。
此外,还可以采用多核处理器运行神经网络模型以提高神经网络模型的处理效率。即 通过多核处理器中的多个核共同执行神经网络模型的运算。但该方案仅利用了单个处理器的计算资源,处理效率的提升过度依赖该处理器的运算能力,其他处理器处于闲置状态,造成了资源的浪费。
本申请实施例提供了一种神经网络模型的处理方法,能够提高神经网络模型的处理效率。
本申请实施例的方案可以应用于神经网络模型的训练场景,也可以应用于神经网络模型的推理场景。通常,推理场景对处理性能的要求更高,本申请实施例中主要以本申请实施例的方案应用于推理场景为例进行说明。
相较于在服务端部署运行神经网络模型,端侧具有实时性高、保护数据隐私以及节省云端资源等优势。在端侧部署运行神经网络模型的应用越来越广泛。为了便于描述和理解,本申请实施例主要以本申请实施例的方案应用于端侧推理场景为例进行说明。
图5示出了一种端侧推理的整体架构。深度学习框架用于将神经网络模型所表达的深度学习任务转换为能够在处理器上执行的执行和数据。
各个端侧的框架具有各自的模型表达。在推理过程中,按照模型进行推理运行。离线模型是框架的模型表达的离线文件。
如图5所示,端侧框架可以将其他端侧框架下构建的三方模型通过转换工具进行转换,得到当前端侧框架的中间表达(intermediate representation,IR)的离线文件。或者,端侧框架也可以通过自有构图接口获取当前端侧框架下构建的模型的计算图。
在运行时初始化阶段,读取该离线文件,生成运行时的拓扑图,并初始化图中的每个算子节点,定义算子在运行时的实际行为。该阶段用于为运行时调度阶段做准备,节约运行时调度的性能开销。具体地,该阶段可以执行拓扑排序、数据重排、内存申请、算法选择、预设后端以及子图划分等操作。
预设后端也可以称为算子硬件部署选择。具体地,可以根据用户设置、运行环境以及算子的支持情况等对模型中的算子预设后端。例如,用户设置使能了CPU,则需要将算子部署于CPU上。再如,用户设置使能了NPU,但运行设备不具备NPU,则将算子部署在其他硬件上,比如CPU上。再如,用户设置使能了NPU,且运行设备包括NPU,但NPU不支持某个算子的运算,则需要将该该算子部署于其他硬件上,比如CPU上。
算法选择指的是为算子选择具体的算法实现运算过程。例如,对于卷积算子,可以通过1*1算法、winograd算法、Img2Col算法或滑窗算法等实现卷积运算。
具体地,在初始化阶段的各项操作后可以生成算子的核(kernel)对象,各个子图按照拓扑序保存子图内的算子的kernel对象。
在运行时调度阶段,即端侧推理阶段,根据运行时初始化阶段构造的拓扑图以及图中的算子信息,调用相应的硬件,例如,CPU、NPU或GPU等,对用户的输入数据进行推理。也就是说,在运行时按照顺序调用相应的硬件依次执行kernel对象的运行(run)函数,实现算子的运算。例如,拓扑图被划分为子图1、子图2、子图3和子图4,按照子图1、子图2、子图3和子图4的顺序在相应的硬件上依次执行,得到输出结果。一个硬件执行的过程中,其他硬件处于闲置状态,造成了资源的浪费,推理性能无法得到提升。
图6示出了本申请实施例提供的一种端侧推理的系统架构。
相较于图5所示的系统架构,图6所示的系统架构中,在运行时初始化阶段增加了构 图的操作,以得到异构图。异构图用于指示异构硬件上部署的子图。
示例性地,如图6所示,构图操作可以包括异构硬件能力计算、算子切分、并行图搜索、代价模型(costmodel)计算以及异构图构图等操作。
异构硬件能力计算指的是计算异构硬件的计算能力。
例如,可以根据异构硬件的特性,例如,乘法的计算速率、数据传输的带宽限制等,计算异构硬件的计算能力。
异构硬件能力计算的具体描述可以参见方法700的步骤S710或方法800的步骤S810,此处不展开描述。
算子切分指的是对模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图。
代价模型计算指的是计算图的开销。
具体描述可以参见方法700的步骤S720或方法800的步骤S820,此处不展开描述。
并行图搜索指的是在计算图中搜索适合并行执行的子图的组合。
具体描述可以参见方法700或方法800的步骤S830,此处不展开描述。
需要说明的是,并行图搜索为可选操作。
异构图构图指的是构造异构并行图。异构并行图用于指示各个子图对应的硬件。
构造异构并行图也就是为各个子图选择相应的硬件,并进行部署。
具体描述可以参见方法700的步骤S730或方法800的步骤S840。
在系统架构中引入异步调度框架,以实现异构图在异构硬件上的并行推理。
例如,如图6所示,异步调度框架可以通过Actor模型实现。或者,异步调度框架也可以采用其他形式实现,只要能够实现异步调度即可,本申请实施例对此不做限定。
如图6所示,在异步调度框架采用Actor模型的情况下,运行初始化阶段还包括编排的操作。
具体地,将异构图中的子图(subgraph)转换为Actor,并定义Actor行为。Actor的行为包括前置行为、执行行为和后置行为。编排的具体实现方式可以参见后文中的方法700或方法800中的步骤S850。
需要说明的是,运行初始化阶段的编排操作是与异步调度框架的Actor模型对应的。若异步调度框架采用其他形式实现,则运行初始化阶段的编排的操作可以进行相应的调整,以适应异步调度框架,或者,运行初始化阶段也可以不包括编排的操作。
基于编排好的actor列表即可执行模型的运算。例如,基于图6中的Actor1、Actor2、Actor3和Actor4执行异构并行运算。
应理解,图6中仅以端侧推理的系统架构为例进行说明,不对本申请实施例的方案的应用场景构成限定。
下面结合图7对本申请实施例中的神经网络模型的处理方法进行详细的描述。
图7示出了本申请实施例提供的神经网络模型的处理方法700。图7所示的方法可以由神经网络模型的执行装置来执行,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行神经网络模型运算的装置,也可以是由云服务设备和终端设备构成的系统。
方法700既可以应用于神经网络模型的推理场景中,也可以应用于神经网络模型的训 练场景中。
示例性地,若方法700应用于神经网络模型的推理场景中,方法700可以由图2中的执行设备110或图4中的执行设备310或本地设备执行。示例性地,若方法700应用于神经网络模型的训练场景中,方法700可以由图2中的训练设备120或图4中的执行设备310执行。
方法700包括步骤S710至步骤S730。下面对步骤S710至步骤S730进行详细介绍。
S710,获取m个处理器的计算能力。该m个处理器用于执行神经网络模型的运算,m为大于1的整数。
示例性地,当方法700应用于端侧场景中时,即方法700由终端设备执行时,m个处理器可以理解为m个芯片,或者说,m个硬件。在终端设备中,该m个处理器是m个不同类型的处理器。换言之,该m个处理器为终端设备中的m个异构硬件。该m个异构硬件即为能够共同执行神经网络模型的运算的不同类型的硬件。
示例性地,该m个处理器既可以包括中央处理器(central processing unit,CPU),也可以包括神经网络运算处理器,神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。例如,m可以为2,即终端设备中可以包括两个处理器,这两个处理器可以为CPU和GPU。再如,m可以为3,即终端设备中可以包括三个处理器,这三个处理器可以为CPU、GPU和NPU。
在分布式系统中,该m个处理器也可以理解为m个工作节点。该m个工作节点可以是相同类型的工作节点,也可以是不同类型的工作节点。该m个处理器可以包括单核处理器和多核处理器中的至少一种,本申请实施例对此不做限定。
处理器的计算能力,可以理解为,处理器能够承载的计算量,或者,处理器适合承载的计算量。处理器的计算能力也可以称为处理器的承载能力。
具体地,可以根据处理器的特性,例如,乘法的计算速率、数据传输的带宽限制等,计算处理器的计算能力。
该m个处理器的计算能力的具体计算方式可以根据实际需要设置,本申请实施例对此不作限定。
本申请实施例中的神经网络模型可以是现有的神经网络模型,例如,CNN模型、残差网络模型或循环神经网络模型等。或者,本申请实施例中的神经网络模型也可以用户自行构建的,本申请实施例对此不做限定。
本申请实施例的方案能够应用于多个领域,应用领域的类型与神经网络模型的任务相关。例如,神经网络模型用于图像处理任务,则本申请实施例的方案可以应用于计算机视觉领域。具体地,图像处理任务包括图像分类、图像检测、图像分割、图像识别或图像生成等。再如,神经网络模型用于文本处理任务,则本申请实施例的方案可以应用于自然语言处理领域。具体地,文本处理任务包括文本识别或文本翻译等。再如,神经网络模型用于语音处理任务,则本申请实施例的方案可以应用于语音处理领域。具体地,语音处理任务包括语音识别等。本申请实施例对此不做限定。
步骤S710可以由该m个处理器中的一个处理器执行,也可以由该m个处理器之外的 其他装置执行,本申请实施例对此不做限定。
S720,对神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,第二计算图包括第一子图对应的并行的n个子图,该n个子图的开销与该m个处理器中的n个处理器的计算能力满足第一匹配关系,1<n≤m,n为整数。
神经网络模型对应的计算图用于指示神经网络模型的计算过程。或者说,神经网络模型对应的计算图可以理解为神经网络模型的计算过程的一种表现形式。
神经网络模型对应的第一计算图可以为初始计算图。
示例性地,在方法700应用于推理场景中时,该神经网络模型可以为推理框架识别出的模型。
算子用于执行神经网络模型中的计算任务。示例性地,算子可以理解为神经网络模型中的网络层。例如,卷积算子可以理解为神经网络模型中的卷积层。
算子切分可以理解为算子对应的计算任务的切分。
并行的n个子图指的是能够并行执行的n个子图。换言之,并行的n个子图对应的计算任务能够并行执行。即将包含算子的子图切分为包含多个可并行执行的算子的子图,或者说,切分为多个可以并行执行的子图。
第一子图对应的并行的n个子图指的是通过对第一子图进行算子切分得到的n个子图。
也就是说,对第一子图进行算子切分得到并行的n个子图,将第一计算图中的第一子图替换为该n个子图,进而得到第二计算图。
示例性地,子图的开销可以通过图6所示的架构中的costmodel计算。
示例性地,子图的开销可以包括子图的计算开销。子图的计算开销指的是子图中的所有算子的计算开销。
进一步地,子图的开销还可以包括子图的通信开销。
通过算子切分的方式将第一子图切分为并行的n个子图。第一子图的计算量可以等于该n个子图的计算量之和。
该n个子图的开销与该m个处理器中的n个处理器的计算能力满足第一匹配关系,也可以称为,该n个子图的开销与该m个处理器中的n个处理器的计算能力匹配。
具体地,根据该m个处理器中的n个处理器的计算能力将该第一子图切分为并行的n个子图。
换言之,本申请实施例的方案能够感知不同处理器的计算能力的差异,并基于此进行第一子图的切分。
该n个子图的开销与该n个处理器的计算能力匹配。示例性地,该n个子图中的子图A与该n个处理器中的处理器A的计算能力匹配,则子图A和处理器A之间具有对应关系。也就是说,该n个子图与该n个处理器之间是一一对应的。该对应关系可以用于确定步骤S730中该n个子图的计算任务的部署的方式。
n与m可以相同,也可以不同。换言之,并行的子图的数量与处理器的数量可以相同,也可以不同。
需要说明的是,第一子图可以为一个,也可以为多个。若第一子图为多个,则可以分别对每个第一子图进行切分,得到每个第一子图对应的多个并行的子图。不同的第一子图 切分后得到的并行的子图的数量可以相同,也可以不同。本申请实施例对此不做限定。
步骤S720可以由该m个处理器中的一个处理器执行,也可以由该m个处理器之外的其他装置执行,本申请实施例对此不做限定。
S730,将第二计算图的p个子图的计算任务分配给该m个处理器执行,其中,该p个子图包括该n个子图,该n个子图的计算任务分别分配给该n个处理器执行,n个处理器中的每个处理器执行n个子图的计算任务中的一个,p为大于或等于n的整数。
将第二计算图划分为p个子图时,该n个子图中的每个子图均可以作为该p个子图中的一个。
将第二计算图的p个子图的计算任务分配给该m个处理器执行,也可以理解为,将该p个子图部署于该m个处理器上,即从该m个处理器中为该p个子图选择用于执行对应计算任务的处理器,并将该p个子图部署于该m个处理器上。在运行时调度该m个处理器执行第二计算图的p个子图的计算任务。该m个处理器执行该p个子图的计算任务以完成神经网络模型的运算,或者说,该m个处理器可以基于该p个子图完成神经网络模型的运算。
在该m个处理器为异构硬件的情况下,为子图进行硬件选择和部署的过程也可以称为异构图构图过程。异构图用于指示该p个子图对应的异构硬件。或者说,异构图用于指示异构硬件上部署的各个子图。该p个子图中的n个子图为并行执行的子图,在该情况下,异构图,也可以称为,异构并行图。
一个子图部署于一个处理器上,一个处理器可以部署多个子图。
第二计算图可以划分为p个子图,具体的划分方式可以根据需要设定。
例如,拓扑有序的多个算子可以构成一个子图。一个子图由一个硬件设备完成。这样,能够使得拓扑有序的多个算子在同一个处理器上执行,减少了通信开销。再如,单一算子也可以构成一个子图。
如前所述,该n个子图的开销与该n个处理器的计算能力匹配。该n个子图和该n个处理器之间存在对应关系。也就是说,该n个子图的开销与该n个处理器的计算能力匹配,该n个子图和该n个处理器之间存在对应关系。该n个子图的计算任务分别分配给该n个处理器执行,具体指的是,将该n个子图的计算任务分别分配给计算能力与该n个子图的开销匹配的处理器执行,即将该n个子图的计算任务分别分配给该n个子图对应的处理器执行。即根据该n个子图在进行算子切分时所定义的部署方式部署该n个子图的计算任务。
示例性地,步骤S730可以由该m个处理器中的一个处理器执行,也可以由该m个处理器之外的其他装置执行,本申请实施例对此不做限定。
在本申请实施例中,根据不同处理器的计算能力对第一子图进行算子切分,以得到匹配不同处理器的计算能力的多个并行的子图,使得该多个并行的子图构的计算任务可以并行执行于计算能力相匹配的处理器上,合理利用硬件资源,有利于提高模型的处理效率。
示例性地,该n个子图的开销与该n个处理器的计算能力匹配,可以包括:该n个子图的开销小于或等于该n个处理器的计算能力。
也就是说,开销与计算能力匹配可以为开销小于或等于计算能力。
然而,该n个子图的计算任务均执行完后,才能得到第一子图的计算结果。已经执行完的子图需要等待其他并行的子图计算完成,也就是说,该n个子图的计算时间取决于该 n个子图中最长的计算时间。在上述匹配方式下,计算能力较低的处理器可能会处理开销较大的子图的计算任务,导致耗时较长,进而影响该n个子图的计算时间。
可选地,该n个子图的开销与该n个处理器的计算能力满足第一匹配关系,可以包括:该n个子图的开销的比值与该n个处理器的计算能力的比值之间的差值小于或等于第一阈值。
例如,该n个子图为两个子图,该两个子图的开销之比与该m个处理器中的两个处理器的计算能力之比小于或等于第一阈值,则该两个子图的开销与该两个处理器的计算能力匹配。
或者,该n个子图属于多个候选子图的组合中的一个组合,每个候选子图的组合包括第一子图对应的n个候选子图,在该多个候选子图的组合中,该n个子图的开销的比值与该n个处理器的计算能力的比值之间的差值最小,则该n个子图的开销与该n个处理器的计算能力满足第一匹配关系
这样,能够保证计算能力较强的处理器执行开销较大的子图的计算任务,计算能力较弱的处理器执行开销较小的子图的计算任务,提高处理器的计算能力与子图的开销之间的适配度,有利于进一步提高模型的处理效率。
可选地,第一子图的开销大于第一计算图中的至少一半的子图的开销。
可选地,该第一子图的开销大于或等于第二阈值。
示例性地,第二阈值可以根据处理器的计算能力确定。
这样,仅对开销较大的子图进行算子切分,有利于保证并行执行带来的性能的提升超过引入的通信开销,进而提高模型的处理效率。同时,在该多个处理器无法共享内存的情况下,处理器需要存储所需要的执行的子图中的算子的参数以执行该子图的计算任务。采用本申请实施例的方案,仅对开销较大的子图进行算子切分,能够避免由于算子切分而引入过多的内存开销。
进一步地,在第一子图包括多个算子的情况下,第一子图中的所有算子的执行顺序为串行执行。
该多个算子的执行顺序为串行执行,即第一子图中的多个算子是拓扑有序的。
或者说,第一子图可以为开销较大,且不包括并行执行的算子的图结构。
卷积类算子的计算量通常较大。例如,第一子图可以为连续卷积结构,即第一子图中可以包括多个连续的卷积类算子。
示例性地,卷积类算子可以包括:常规的卷积(convolution)算子或者深度卷积(Convolution Depthwise)算子等各种类型的卷积算子。
应理解,此处仅以卷积类算子为例进行说明,第一子图中的多个算子也可以包括其他类型的算子,例如,全连接算子等。
示例性地,可以在第一计算图中搜索适合进行算子切分的图结构,即第一子图,进而对第一子图进行算子切分,以得到第二计算图。
在对单个算子进行切分后,也就是将该算子的计算任务进行切分后,需要将该算子的切分后的计算任务分配给对应的处理器执行,各个处理器执行完切分后的计算任务后,将各个处理器的计算结果合并才能得到该算子的计算结果。单个算子切分引入的通信开销较大。
在本申请实施例的方案中,第一子图中包括拓扑有序的多个算子,对该第一子图进行算子切分,将切分后得到的n个子图的计算任务分配给对应的处理器,也就是,将该多个算子的计算任务进行切分后,将切分后的计算任务分配给对应的处理器,在各个处理器按照拓扑序完成该切分后的计算任务后,再将各个处理器的计算结果进行合并,得到该第一子图的计算结果,这样能够减少切分引入的通信开销,保证了模型的处理效率。换言之,处理器在执行该多个算子的计算任务时,无需对每个算子的计算任务进行分配,也无需将每个算子的拆分后的计算结果进行合并,而是按拓扑序执行切分后的多个算子的计算任务后对计算结果进行合并,相较于单个算子的切分,该方案能够减少通信开销,提高模型的处理效率。
该n个子图的输入数据是通过对第一子图的输入数据进行数据切分得到的。
对输入数据的切分操作可以通过引入切分算子实现。
可选地,第二计算图还包括切分算子,该切分算子用于将第一子图的输入数据进行切分,以得到该n个子图的输入数据。
该n个子图与第一子图中的算子的权重可以是相同的。也就是说,算子切分并没有改变第一子图中的算子的权重。不同的处理器可以基于相同的权重并行执行不同的计算任务。该计算任务的不同是由于输入数据的切分导致的。或者说,该计算任务的不同是由于输入数据的不同导致的。
例如,对算子A进行切分后得到算子B和算子C,算子A、算子B和算子C的权重是相同的。对算子A的输入数据A进行切分,得到算子B的输入数据B和算子C的输入数据C。算子A、算子B和算子C的计算任务的不同是由于三者的输入数据不同导致的。
神经网络模型中常见的数据可以表示为4维数据,该4个维度分别为批量大小(batch,N),通道数(channel,C),高度(height,H)以及宽度(width,W)。
具体地,该n个子图的输入数据是将第一子图的输入数据按照以下任一维度进行数据切分得到的:输入数据的批量大小N,输入数据的通道数C,输入数据的高度H或输入数据的宽度W。
按照上述任一维度进行切分也可以称为按照上述任一轴进行切分。例如,按照输入数据的高度维度进行切分,也可以称为按照输入数据的H轴进行切分。
可选地,数据切分的维度是根据第一子图的输入数据的数据排布情况确定的。具体地,该n个子图中的各个子图的输入数据在第一子图的输入数据的数据排布中为连续数据。
根据数据排布情况确定数据切分的维度,有利于保证切分后的输入数据为连续数据,即切分后的输入数据在读取时是连续读取的,避免出现跳跃读取数据的情况导致开销过大,进而有利于保证模型的处理性能。
进一步地,在输入数据的数据排布为批量N、高度H、宽度W、通道数C情况下,该n个子图的输入数据是该第一子图的输入数据按照高度维度进行切分得到的。
数据排布为批量N、高度H、宽度W、通道数C指的是,数据是按照C轴、W轴、H轴和N轴的顺序依次连续排布的。也就是说,C轴的数据为连续数据,C轴数据排布完成后,排布W轴的数据,C轴的数据排布完成后,排布H轴的数据,H轴的数据排布完成后,排布N轴的数据。
例如,1x 2x 2x 3的数据指的是批量为1,高度为2,宽度为2,通道数为3的数据, 3个通道的数据以如下三个矩阵的形式表示:
Figure PCTCN2022133524-appb-000010
在数据排布为NHWC的情况下,上述1 x 2 x 2 x 3的数据可以表示为:
[a,a',a”,b,b',b”,c,c',c”,d,d',d”];
在该情况下,按照H轴进行数据切分,能够保证切分后的数据仍然为连续的数据。
在方法700应用于推理场景的情况下,输入数据的批量大小N通常为1,在该情况下,无法在该维度上对输入数据进行切分。按照H轴进行切分,切分后的数据为连续数据。而按照W轴和按照C轴进行切分,则需要跳跃的读取数据,也就是说读取数据有间隔,会增加底层数据同步的开销,进而导致模型的处理性能无法提升。本申请实施例的方案中,通过H轴对输入数据进行切分将计算任务分配至不同的处理器并行处理,有效提高了模型的处理性能。
应理解,以上仅为示例,若数据排布为其他方式的排布,切分的维度也可以采用其他维度。例如,输入数据的数据排布为NWHC,则切分的维度也可以为宽度维度。
可选地,在第一子图中的算子为卷积类算子,且卷积类算子的滑动步长小于卷积核的高度的情况下,该n个子图中至少两个子图的输入数据中的部分数据相同。
或者说,在第一子图中的算子为卷积类算子,且卷积类算子的滑动步长小于卷积核的宽度的情况下,该n个子图中至少两个子图的输入数据中的部分数据相同。
也就是说,该n个子图中至少两个子图的输入数据有重叠。具体地,该n个子图中至少两个子图的输入数据在高度维度上有重叠。也就是说,对第一子图的输入数据进行有重叠的切分,以得到该n个子图的输入数据。
需要说明的是,输入数据按照H轴进行切分时,切分后的输入数据的高度大于或等于卷积核的高度。
第一子图的输出数据是通过对该n个子图的输出数据进行合并得到的。
对输出数据的合并操作可以通过引入合并算子实现。
可选地,第二计算图还包括合并算子,该合并算子用于将该n个子图的输出数据进行合并,以得到第一子图的输出数据。
示例性地,步骤S720具体可以通过以下方式实现:对第一计算图中的第一子图进行算子切分得到并行的n个子图,将第一计算图中的第一子图替换为该n个子图,并在该n个子图输入之前引入切分算子,在该n个子图的输出之后引入合并算子,得到的计算图即为第二计算图。
例如,切分算子可以为有重叠的切分(SplitWithOverlap)算子。SplitWithOverlap算子用于对输入数据进行有重叠的切分。例如,合并算子可以为Concat算子。
下面结合图8对输入数据的切分操作以及输出数据的合并操作进行说明。
图8示出了一种神经网络模型的第一计算图。第一计算图包括三个连续的卷积类算子、最大池化(MaxPoolFusion)算子、重排(Reshape)算子、两个连续的全连接(FullConnection)算子以及分类(Softmax)算子。卷积类算子即图8中的Conv2DFusion算子。Reshape算子用于更改Reshape算子的输入数据的维度信息,Reshape算子不会改变输入数据的数据总量,也不改变输入数据的值。
图8中的数据是按照NHWC的维度数据表示的,例如,1 x 240 x 240 x 128表示该数据的批量为1,高度为240,宽度为240,通道数为128。Conv2DFusion算子的权重(weight)的尺寸为128 x 3 x 3 x 128,偏置(bias)的尺寸为128。其中一个FullConnection算子的权重的尺寸为128 x 35840,偏置的尺寸为128。另一个FullConnection算子的权重的尺寸为10 x 128,偏置的尺寸为128。Reshape算子的参数表示为shape(2),即shape参数的尺寸为2,即2个维度,用于指示输出数据的维度信息。图8中的Reshape算子的输入数据的维度为1 x 40 x 7 x 128,共35840个数据,shape(2)为(1,35840),即输出数据的维度为1 x35840。
第一子图如图8的(a)图的虚线框所示,该第一子图中包括3个连续的卷积算子。
对第一子图进行算子切分,以得到并行的两个子图,并在并行的两个子图的输入之前引入SplitWithOverlap算子,以便对第一子图的输入数据进行切分。在该两个并行的子图构的输出之后引入Concat算子,以便对这两个并行的子图的计算结果进行合并,得到的第二计算图如图8的(b)图所示。
示例性地,SplitWithOverlap算子可以包括以下参数:切分维度(split dimention,slip_dim)、切分的数量(number_split)、切分比例(ratio)、重叠参数。
例如,重叠参数具体可以包括:切分后的输入数据向上拓展的大小(extend_top)和切分后的输入数据向下拓展(extend_bottom)的大小。
split_dim用于指示按照哪个轴进行切分,number_split用于指示n的值,ratio用于指示该n个并行的子图的计算量的比例。extend_top用于指示按照ratio切分后的数据向上拓展的大小,extend_bottom用于指示按照ratio切分后的数据向下拓展的大小。
例如,split_dim可以为H轴,即对H轴进行切分。number_split为2,即将第一子图切分为两个并行的子图。Ratio为83:157,即将高度为240的数据切分为两个高度为83和157的两个数据,即两个输出数据的高度。若按照常规的切分方式,则两个切分后的输入数据的高度范围分别为[0,82]和[83,239]。extend_top为(0,6),第一个0表示该两个切分后的输入数据中第一份输入数据,即高度范围为[0,82]的数据,在H轴上向前拓展的大小为0,即高度范围依然为[0,82];第二个6表示第二份输入数据,即高度范围为[83,239]的数据,在H轴上向前拓展的大小为6,即高度范围变为[77,239]。extend_bottom为(0,0),第一个0表示该两个切分后的输入数据中第一份输入数据,即高度范围为[0,82]的数据,在H轴上向后拓展的大小为0,即高度范围依然为[0,82];第二个0表示第二份输入数据,即高度范围为[77,239]的数据,在H轴上向后拓展的大小为0,即高度范围变为[77,239]。这样,切分后的两个输入数据的切分长度分别为83和163,分别占用原输入数据中的高度范围为[0,82]的数据和高度范围为[77,239]的数据。
应理解,以上仅为示例,SplitWithOverlap算子还可以包括更多或更少的参数。例如,SplitWithOverlap算子可以不包括split_dim参数等。
图8中的第一子图的输入数据的维度为1 x 240 x 240 x 128,经过上述SplitWithOverlap算子后切分为两个数据,两个数据的维度分别为1 x 163 x 240 x 128和1 x 83 x 240 x 128。两个数据的高度分别为163和83,两者之和大于240,由此可以看出这两个数据在高度维度上是由重叠的部分的。该两个数据分别经由并行的两个子图中的三个卷积类算子进行处理。并行的两个子图处理后得到的数据的维度分别为1 x 160 x 240 x 128与1 x 80 x 240 x  128。通过concat算子将该两个数据进行合并,得到的数据的维度为1 x 240 x 240 x 128。该数据传递给MaxPoolFusion算子。合并后的输出数据和图8的(a)图中第一子图的输出数据是一致的。
第一子图的输入数据的高度范围为[0,239],切分后的两个数据的维度分别为1 x 83 x 240 x 128和1 x 163 x 240 x 128。两个数据的高度分别为83和163,对应的高度范围分别为[0,82]和[77,239],也就是说,77-82范围内的数据是切分后的两个数据均需要的,即切分后的两个数据在高度维度上是有重叠的部分的。
在第一子图中包括卷积算子的情况下,若不进行有重叠的切分,则可能会导致切分后的计算结果合并后无法与原第一子图的计算结果一致。
下面结合图9中的卷积算子的计算过程进行说明。如图9所示,计算结果矩阵中的R1位置的值是根据输入数据中实线框中的数据计算得到的,计算结果矩阵中的R2位置的值是根据输入数据中虚线框中的数据计算得到的,实线框中的数据和虚线框中的数据是有部分重叠的。若按照常规的切分方式对输入数据进行切分,则会导致切分后的计算结果合并后无法与该算子原本的计算结果一致。
卷积算子的输出结果的维度可以根据卷积算子的相关参数确定,卷积算子的相关参数可以包括:卷积算子的输入数据的维度、卷积核的维度、滑动步长以及填充(pad)参数确定。或者说,在卷积算子的输入数据的维度、卷积核的维度、滑动步长以及填充(pad)参数确定的情况下,卷积算子的数据结果的维度是确定的。
具体地,该n个子图的输入数据的高度的值是根据该n个子图的开销之间的比值确定的。
例如,图8的(b)中的两个并行的子图的开销之间的比值为83:157,如果按照常规的切分方式对输入数据进行切分,得到的两个切分后的数据的高度范围分别为[0,82]和[83,239]。为了保证图8的(b)中的子图的输入数据以及MaxPoolFusion算子的输入数据与图8的(a)中一致,需要对输入数据进行有重叠的切分,根据该比值以及第一子图中的卷积类算子的相关参数即可反推出切分后的两个数据的高度范围分别为[0,82]和[77,239]。
也就是说,SplitWithOverlap算子中的extend_top和extend_bottom是根据该n个并行的子图的开销之间的比值以及卷积类算子的相关参数确定的。
应理解,以上数值仅为示例,不对本申请实施例的方案构成限定。
在本申请实施例中,在第一子图中的算子为卷积类算子的情况下,通过有重叠的切分输入数据,能够保证第一子图的输出数据与该n个子图的输出数据合并后的结果是相同的,进而保证第二计算图的计算结果与第一计算图的计算结果相同。
可选地,第二计算图的p个子图包括第一计算图中的并行的q个子图。该q个子图的计算任务分别分配给该m个处理器中的q个处理器执行,q个处理器中的每个处理器执行q个子图的计算任务中的一个,该q个子图的开销与该q个处理器的计算能力满足第二匹配关系,q为大于1的整数。
为了便于描述,将该q个子图构成的组合称为第一子图组合。
该p个子图可以包括一个或多个子图组合。每个子图组合中的多个子图为能够并行执行的子图。
该q个子图的开销与该q个处理器的计算能力满足第二匹配关系,也可以称为该q个 子图的开销与该q个处理器的计算能力匹配。
具体地,根据该m个处理器的计算能力部署q个子图。
一个第一子图进行算子切分得到的并行的n个子图即为一个子图组合。相应地,该n个子图中的每个子图均可以作为该子图组合中的一个子图。
示例性地,该一个或多个并行的子图组合可以是通过并行图搜索得到的。
例如,在第一计算图,即神经网络模型的初始计算图,执行并行图搜索,以得到能够并行执行的子图,进而根据能够并行执行的子图得到一个或多个并行的子图组合。在该情况下,该一个或多个并行的子图组合中不包括第一子图对应的该n个子图。
再如,在第二计算图中执行并行图搜索,以得到能够并行执行的子图,进而得到一个或多个并行的子图组合。在该情况下,该一个或多个并行的子图组合中包括第一子图对应的n个子图。
为了便于描述,下面以第一子图组合不是由该n个子图构成的子图组合为例进行说明。
并行图搜索可以采用现有的有向图的搜索方式实现,本申请实施例对此不做限定。
第一计算图中的能够并行执行的子图的开销可能无法精确匹配各个处理器的计算能力。这样,第一子图组合中的q个子图的开销与q个处理器的计算能力的匹配度可能低于第一子图对应的n个子图的开销与n个处理器的计算能力的匹配度。
在该情况下,可以将多个能够并行执行的子图进行组合以得到多个候选组合,根据各个处理器的计算能力从多个候选组合中选择一个候选组合作为第一子图组合。
示例性地,将多个候选组合中的q个子图的开销之比与q个处理器的计算能力之比最接近的候选组合作为第一子图组合。
例如,在第一计算图中搜索到三个能够并行执行的子图:子图1、子图2和子图3,m为2,两个处理器的计算能力之比为1:2。子图1和子图2构成子图1,子图3构成子图1’,子图1和子图1’即可作为一个候选组合,以此类推,这3个子图可以构成多个候选组合,在该多个候选组合中,子图的开销之比与1:2最接近的候选组合即为第一子图组合。
在本申请实施例中,根据不同处理器的计算能力将能够并行执行的子图部署于多个处理器上,使得该多个并行的子图的计算任务可以并行执行于计算能力相匹配的处理器上,合理利用硬件资源,有利于提高模型的处理效率。
需要说明的是,该多个处理器能够并行执行该多个子图指的是,该多个处理器对于该多个子图的执行过程是相互独立的,该多个处理器可以同时执行该多个子图,“并行执行”并非对“执行”的起始时间或结束时间进行限定。也就是说,该多个处理器并行执行该多个子图,并不意味着,该多个处理器需要同时开始执行该多个子图,或者,同时结束执行该多个子图。
利用异步调度框架执行该p个子图,即可调度相应的处理器并行执行子图组合中的各个子图的计算任务。
异步调度框架可以采用现有的方案。例如,异步调度框架可以采用Actor模型。
下面以Actor模型为例对步骤S730的一种具体实现方式进行说明。
可选地,步骤S730包括:将该p个子图转换为p个Actor,在p个Actor的执行过程中调度该m个处理器执行该p个子图的计算任务。
该p个Actor的行为是根据该p个子图的计算任务定义的。该p个子图对应的处理器即为为该p个子图的计算任务分配的处理器。
也就是说,使用Actor框架对该p个子图进行图编排,将该p个子图转换为p个Actor,定义Actor的行为,在定义Actor的行为的过程中,定义执行计算任务的处理器。子图和Actor是一一对应的。
基于编排好的Actor列表即可执行模型的运算。
Actor之间通过消息来交互,消息中传递的是子图之间的数据信息。在Actor接收到消息后,会触发执行自身的行为。各个Actor是独立执行的,当多个Actor被同步触发时,即可实现并行执行。
Actor的行为可以包括:前置行为、执行行为和后置行为。
前置行为指的是Actor在执行计算任务之前所需要执行的操作。
执行行为指的是Actor执行计算任务所需要的执行的操作。
后置行为指的是Actor在执行计算任务之后所需要的执行的操作。
在图编排过程中,可以分别定义该p个Actor的前置行为、执行行为和后置行为。
示例性地,Actor的前置行为、执行行为和后置行为可以按照如下方式定义。
前置行为:
(1)在Actor触发后,接收到的数据储存到Actor内。数据承载于Actor接收的消息内。
(2)校验是否收到了所有输入数据。如果检验发现只收到一部分输入数据,那么说明Actor还需要等待其他Actor的输出消息,将会继续等待。如果校验发现Actor的所有输入数据均已收到,继续执行后续的行为,即执行行为。
由于子图和Actor是一一对应的。一个Actor执行所需要的所有输入数据可以根据子图确定。
执行行为:
当Actor的前置行为的检验通过后,解析该Actor对应的子图所部署的处理器,调度该处理器执行Actor对应的子图,即执行该子图中的计算任务。
Actor中可以定义子图执行的硬件以及执行算子等信息。在Actor执行时,即可将定义的行为运行到预设硬件上,即步骤S730所选择的处理器上。
后置行为:
当Actor对应的子图执行完成后,发送输出数据。
若输出数据是其他Actor的输入数据,则会将消息发送至其他Actor,以触发相应Actor。若输出数据是整图的输出,设置整图的一个承诺(promise),或者说将该promise的值(value)设置为完成状态。当整图的所有promise设置完成,即执行结束,得到模型的运算结果。整图指的是模型对应的完整的计算图,例如,第一计算图或第二计算图。
该m个处理器可以根据编排后的Actor列表实现计算图调度执行,即执行模型计算。
应理解,以上仅以Actor框架为例对本申请实施例的方案进行说明,本申请实施例的异步调度框架还可以采用其他框架,本申请实施例对此不作限定。
图10示出了本申请实施例提供的一种神经网络模型的处理方法800,方法800可以理解为方法700的一种具体实现方式,具体描述可以参见方法700,为了避免重复,在描 述方法800时适当省略部分描述。
示例性地,方法800可以运行于端侧设备上。在该情况下,方法700中的m个处理器为异构硬件。例如,方法800可以采用图6所示的框架执行。
为了便于理解和描述,以采用异构硬件执行方法800为例对本申请实施例的方案进行说明,不对本申请实施例的方案构成限定。
方法800包括步骤S810至步骤S860。下面对步骤S810至步骤S860进行说明。
方法800可以在图6所示的框架内初始化阶段和运行时阶段执行。具体地,初始化阶段执行步骤S810至步骤S850,运行时阶段执行步骤S860。
步骤S810至步骤S840可以理解为异构图的构图阶段,图11示出了一种异构并行图的构图过程的示意图。
步骤S850至步骤S860可以理解为异构并行执行阶段。图14示出了异构并行执行阶段的示意性流程图。
S810,获取各个异构硬件的计算能力。
可以根据异构硬件自身的特性计算得到异构硬件适合承载的计算量,即计算能力。
步骤S810与方法700中的步骤S710对应,具体描述可以参考步骤S710。
为了便于描述,方法800中仅以两个异构硬件为例进行说明。例如,该两个异构硬件可以为CPU和GPU,计算能力之比约为1:2。
S820,对第一计算图中的第一子图进行算子切分。
具体地,基于各个异构硬件的计算能力对第一计算图中的第一子图中的算子进行切分,以得到第二计算图。
第一计算图即为模型的初始计算图。例如,第一计算图可以为图10中的端侧推理框架识别的模型结构对应的计算图。
示例性地,步骤S820可以通过以下方式实现:在第一计算图中搜索适合进行算子切分的图结构,即第一子图;基于各个异构硬件的计算能力对第一子图中的算子进行切分,构造适合并行执行的图结构(即方法700中的并行的子图),以得到第二计算图。也就是说,第二计算图中包括适合并行的图结构。
适合进行算子切分的图结构需要满足预设条件,例如,开销大于或等于第二阈值,多个算子的执行顺序为串行执行。
如图12所示,原始图结果,即第一计算图,包括6个节点,即6个算子:op1、op2、op3、out-op1、out-op2和out-op3。根据算子的拓扑序搜索得到op1、op2和op3这三个算子是拓扑有序的,必须串行执行,且该三个算子的开销大于第二阈值,满足上述条件,也就是说,该三个算子适合进行算子切分,即将该三个算子构成的图结构作为第一子图。
根据算子的拓扑序搜索得到out-op1、out-op2、out-op3三个算子在第一计算图中没有拓扑先后顺序,即这三个算子在第一计算图中即可实现并行计算,不满足上述条件,也就是说,该三个算子不适合进行算子切分。不适合算子切分的图结构在第二计算图中保持原图结构保持不变。
需要说明的是,图12中的第一子图的确定方式仅为示例,在实际应用中,可以根据需要确定第一子图,例如,根据硬件配置情况确定第一子图,本申请实施例对此不做限定。例如,在硬件的数量较多,但计算能力较小的情况下,也可以将out-op1、out-op2、out-op3 三个算子中计算量较大的一个或多个算子进行切分。
对第一计算图的第一子图进行算子切分,对不适合进行算子切分的部分保持不变,得到第二计算图。
示例性地,异构硬件包括CPU和GPU,如图12所示,将第一子图切分为两个部分(两个并行的子图),即op1.1、op2.1、op3.1和op1.2、op2.2、op3.2。这两部分的开销之和与第一计算图中的op1、op2和op3这三个算子的计算量相同。op1.1、op2.1、op3.1的开销与op1.2、op2.2、op3.2的开销与异构硬件的计算能力匹配。例如,根据步骤S810得到的CPU和GPU的计算能力之比约为1:2,op1.1、op2.1、op3.1的开销之和与op1.2、op2.2、op3.2的开销之和之间的比值为1:2。
为了保证与第一计算图的计算结果完全一致,需要对输入数据进行切分,并对切分后得到的并行的子图构的输出数据进行合并。
具体地,在切分后得到的并行的子图之前增加切分算子,在切分后得到的并行的子图之后增加合并算子。这样,切分后得到的第二计算图等价于第一计算图,有利于保证第二计算图和第一计算图的计算结果完全一致。此外,能够保证第一计算图中的第一子图以外的其他部分的输入和输出在第二计算图中保持不变,即不影响其他部分的计算。
在端侧模型中,计算量较大的算子通常为卷积类算子,适合进行算子切分的子图通常会包括卷积类算子。对卷积类的算子的输入数据进行有重叠的切分,能够保证切分后的并行的子图的计算结果的准确性。具体描述可以参见前文中的图8,此处不再赘述。
例如,如图12所示的第一子图中的算子可以为卷积类算子,在该两个并行的子图之前增加SplitWithOverlap算子,在该两个并行的子图之后增加Concat算子。SplitWithOverlap算子用于对输入数据进行有重叠的切分。该算子的具体描述可以参考前文中的图8,此处不再赘述。
步骤S820与方法700中的步骤S720对应,具体描述可以参考步骤S720。
S830,搜索并行图。
示例性地,在第二计算图中搜索适合并行执行的子图的组合。
例如,在如图13所示的第二计算图中搜索得到两个组合,即组合1和组合2。组合1是通过步骤S820得到的,组合2为第一计算图中适合并行执行的子图的组合。
S840,构造异构并行图。
构造异构并行图,即实现异构并行图的构图。异构并行图,也可以称为异构图,指的是用于实现异构并行的子图。或者说,异构并行图用于指示各个子图执行所采用的异构硬件。
构造异构并行图的过程也可以理解为硬件选择与部署的过程,即为各个子图选择硬件以执行对应的子图的计算任务。
对于步骤S830搜索到的组合,根据异构硬件的计算能力将组合中的子图部署于异构硬件上,也就是说,部署方式要尽可能适合异构硬件的计算能力。
下面以组合1和组合2为例进行说明。
组合1是通过步骤S820中对算子进行切分得到的,切分后得到的两个并行的子图能够匹配异构硬件的计算能力。CPU和GPU的计算能力之比约为1:2,op1.1、op2.1、op3.1的开销之和与op1.2、op2.2、op3.2的开销之和之间的比值为1:2。
对于组合1中的两个并行的子图,即subgraph2和subgraph3,根据两个并行的子图的开销和异构硬件的计算能力进行部署即可,或者说,根据算子切分时定义的部署方式进行部署。即将subgraph2部署于CPU上,将subgraph3部署于GPU上。
组合2是通过搜索得到的。组合2中的子图的开销可能无法完全匹配异构硬件的计算能力。在该情况下,可以按照异构硬件的计算能力的匹配度最高的方式部署组合中的子图。
例如,out-op1、out-op2构成的subgraph5的开销与out-op3构成的subgraph6的开销之比最接近2:1,在该情况下,将subgraph5部署于GPU上,将subgraph6构成的子图部署于CPU上。
对于第二计算图中,除了步骤S830搜索到的组合以外的其他子图,即无法并行执行的子图,可以采用现有方案部署,本申请实施例对此不做限定。例如,将图13中的subgraph1和subgraph4部署于CPU上。
步骤S840与方法700中的步骤S730对应,具体描述可以参考步骤S730,此处不再赘述。
S850,执行异构图编排。
利用异步调度框架对异构图进行编排,将子图转换为Actor。
具体地,在异构图的编排过程中,定义Actor的行为。具体描述可以参见步骤S730中的描述,此处不再赘述。
S860,执行异构图调度运行。
也就是说,基于编排好的Actor列表,执行模型的运算。
图14所示的Actor分别与图13中的subgraph对应,即图14中的Actor1、Actor2、Actor3、Actor4、Actor5和Actor6分别与图13中的subgraph1、subgraph2、subgraph3、subgraph4、subgraph5和subgraph6对应。
图14中的Actor1、Actor2、Actor4和Actor6在执行过程中调度CPU执行计算任务上,Actor3和Actor5在执行过程中调度GPU执行计算任务,与图13中的异构图所指示的硬件一致。
图14中的Actor之间通过异步数据(async data)进行交互。相应地,Actor的输出为异步输出(async output)。
步骤S860可以通过以下步骤执行。
S1,Actor1接收到计算图的输入(graph-input)后,触发Actor1的前置行为,即校验是否已经接收到了所有输入。在校验收到所有输入后,触发Actor1的执行行为,在CPU上执行对应的子图,即subgraph1。在执行完成后,触发Actor1的后置行为。在图编排过程中定义了Actor1的后置行为,即预设了actor1的输出将发送给后续的Actor,即图14中的Actor2和Actor3。也就是说,在执行完成后,触发Actor1将输出发送给Actor2和Actor3。
S2,Actor2和Actor3分别接收Actor1发送的数据,独立执行各自的任务。由于Actor2和Actor3执行的动作类似,为了避免重复,下面仅以Actor2为例进行说明。
Actor2收到Actor1发送的数据后,触发Actor2的前置行为,即校验Actor2是否已经接收了所有输入。Actor2的所有输入即为Actor1发送的数据。
在检验收到所有输入后,触发Actor2的执行行为,在CPU上执行对应的子图,即 subgraph2。Actor2和Actor3执行的动作类似,但Actor2执行在CPU上,而Actor3执行在GPU上。这两个Actor分别执行于不同的硬件上,相互独立互不影响,即实现了异构并行。
在执行完成后,触发Actor2的后置行为,即将输出发送给后续的Actor。具体地,Actor2和Actor3分别将各自的输出发送给Actor4。
S3,Actor4接收到Actor2或Actor3的输出后,触发前置行为。
Actor2和Actor3是并发执行,运行时间可能是不同的,即Actor2和Actor3不一定是同时完成运算。在其中一个,例如,Actor2,运行完成后,会将输出发送至Actor4,Actor4在接收到Actor2的输出后,触发Actor4的前置行为,即校验是否已经接收到了所有输入。Actor4的所有输入包括Actor2发送的数据和Actor3发送的数据。Actor4检验发现没有接收到所有输入,Actor4会继续等待,直至接收到Actor3发送的数据,此时再次触发Actor4的前置行为。Actor4检验发现已经接收到了所有的输入,触发Actor4的执行行为,即在CPU上运行对应的子图,即subgraph4。在执行完成后,触发Actor4的后置行为,即将输出发送给后续的Actor,即图14中的Actor5和Actor6。
S4,Actor5和Actor6分别接收到Actor4发送的数据,独立执行各自的任务。Actor5和Actor6的执行过程与Actor2和Actor3类似。为了避免重复,下面仅以Actor5为例进行说明。
Actor5收到Actor4发送的数据后,触发Actor5的前置行为,即校验Actor5是否已经接收了所有输入。Actor5的所有输入即为Actor4发送的数据。
在检验收到所有输入后,触发Actor5的执行行为,在GPU上执行对应的子图,即subgraph5。Actor5和Actor6执行的动作类似,但Actor5执行在GPU上,而Actor6执行在CPU上。
在执行完成后,触发Actor5的后置行为。Actor5和Actor6的输出是整图的输出,如前所述,即在Actor5和Actor6的后置行为中定义了Actor5和Actor6的输出为整图的输出。Actor5在执行完成后,将Actor5的输出作为整图的输出,并设置整图输出(output)的promise的值(value),即将Actor5对应的promise的值设置为已完成状态。在Actor5和Actor6均设置了整图输出的promise的值后,即得到整图的所有输出,即完成推理过程,得到推理结果。
采用本申请实施例的方案能够感知异构硬件的能力进行构图操作,得到适合异构并行的异构图,在异构硬件上进行模型推理,能够提升端侧推理的性能,部分模型,例如,MobilenetV1,性能提升了5%-10%。
下面结合图15至图16对本申请实施例的装置进行说明。应理解,下面描述的装置能够执行前述本申请实施例的方法,为了避免不必要的重复,下面在介绍本申请实施例的装置时适当省略重复的描述。
图15是本申请实施例的神经网络模型的处理装置的示意性框图。图15所示的神经网络模型的处理装置3000包括获取单元3010和处理单元3020。
获取单元3010和处理单元3020可以用于执行本申请实施例的神经网络模型的处理方法,具体地,可以用于执行方法700或方法800。
获取单元3010用于获取m个处理器的计算能力,m为大于1的整数。
处理单元3020用于对神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,第二计算图包括第一子图对应的并行的n个子图,n个子图的开销与m个处理器中的n个处理器的计算能力满足第一匹配关系,1<n≤m,n为整数;将第二计算图的p个子图的计算任务分配给m个处理器执行,其中,p个子图包括n个子图,n个子图的计算任务分别分配给n个处理器执行,n个处理器中的每个处理器执行n个子图的计算任务中的一个,p为大于或等于n的整数。
可选地,第一子图的开销大于第一计算图中的至少一半的子图的开销。
可选地,在第一子图包括多个算子的情况下,第一子图中的所有算子的执行顺序为串行执行。
可选地,n个子图的输入数据是将第一子图的输入数据进行数据切分得到的,数据切分的维度是根据第一子图的输入数据的数据排布情况确定的。
可选地,在第一子图中的算子为卷积类算子,且卷积类算子的滑动步长小于卷积核的高度的情况下,n个子图中至少两个子图的输入数据中的部分数据相同。
可选地,n个子图的开销与m个处理器中的n个处理器的计算能力满足第一匹配关系,包括:n个子图的开销的比值与n个处理器的计算能力的比值之间的差值小于或等于第一阈值。
可选地,p个子图包括第一计算图中的并行的q个子图,q个子图的计算任务分别分配给m个处理器中的q个处理器执行,q个子图的开销与q个处理器的计算能力满足第二匹配关系,q为大于1的整数。
可选地,处理单元3020具体用于:将p个子图分别转换为p个操作者Actor;在p个Actor的执行过程中调度m个处理器执行该p个子图的计算任务。
需要说明的是,上述处理装置3000以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图16是本申请实施例提供的神经网络模型的处理装置的硬件结构示意图。图16所示的神经网络模型的处理装置5000(该装置5000具体可以是一种计算机设备)包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
存储器5001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002用于执行本申请实施例的神经网络模型的处理方法的各个步骤。例如,处理器5002可以执行上文中图7所示的 方法700或图10所示的方法800。
处理器5002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的神经网络模型的处理方法。
处理器5002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的神经网络模型的处理方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器5002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成图15所示的处理装置中包括的单元所需执行的功能,或者,执行本申请方法实施例的图7或图10所示的神经网络模型的处理方法。
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取神经网络模型对应的第一计算图。
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。
本申请实施例还提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行本申请实施例中的神经网络模型的处理方法。
本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行本申请实施例中的神经网络模型的处理方法。
本申请实施例还提供一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取存储器上存储的指令,执行本申请实施例中的神经网络模型的处理方法。
可选地,作为一种实现方式,该芯片还可以包括存储器,该存储器中存储有指令,该处理器用于执行该存储器上存储的指令,当该指令被执行时,该处理器用于执行本申请实施例中的神经网络模型的处理方法。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包 括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本 申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (17)

  1. 一种神经网络模型的处理方法,其特征在于,包括:
    获取m个处理器的计算能力,m为大于1的整数;
    对所述神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,所述第二计算图包括所述第一子图对应的并行的n个子图,所述n个子图的开销与所述m个处理器中的n个处理器的计算能力满足第一匹配关系,1<n≤m,n为整数;
    将所述第二计算图的p个子图的计算任务分配给所述m个处理器执行,其中,所述p个子图包括所述n个子图,所述n个子图的计算任务分别分配给所述n个处理器执行,所述n个处理器中的每个处理器执行所述n个子图的计算任务中的一个,p为大于或等于n的整数。
  2. 根据权利要求1所述的方法,其特征在于,所述第一子图的开销大于所述第一计算图中的至少一半的子图的开销。
  3. 根据权利要求1或2所述的方法,其特征在于,在所述第一子图包括多个算子的情况下,所述第一子图中的所有算子的执行顺序为串行执行。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述n个子图的输入数据是将所述第一子图的输入数据进行数据切分得到的,在所述第一子图中的算子为卷积类算子,且所述卷积类算子的滑动步长小于卷积核的高度的情况下,所述n个子图中至少两个子图的输入数据中的部分数据相同。
  5. 根据权利要求1至4中任一项所述的方法,其特征在于,所述n个子图的开销与所述m个处理器中的n个处理器的计算能力满足第一匹配关系,包括:
    所述n个子图的开销的比值与所述n个处理器的计算能力的比值之间的差值小于或等于第一阈值。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述p个子图包括所述第一计算图中的并行的q个子图,所述q个子图的计算任务分配给所述m个处理器中的q个处理器执行,所述q个处理器中的每个处理器执行所述q个子图的计算任务中的一个,所述q个子图的开销与所述q个处理器的计算能力满足第二匹配关系,q为大于1的整数。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述将所述第二计算图的p个子图的计算任务分配给所述m个处理器执行,包括:
    将所述p个子图分别转换为p个操作者Actor;
    在所述p个Actor的执行过程中调度所述m个处理器执行所述p个子图的计算任务。
  8. 一种神经网络模型的处理装置,其特征在于,包括:
    获取单元,用于获取m个处理器的计算能力,m为大于1的整数;
    处理单元,用于对所述神经网络模型对应的第一计算图中的第一子图进行算子切分,以得到第二计算图,所述第二计算图包括所述第一子图对应的并行的n个子图,所述n个子图的开销与所述m个处理器中的n个处理器的计算能力满足第一匹配关系,1<n≤m,n为整数;
    将所述第二计算图的p个子图的计算任务分配给所述m个处理器执行,其中,所述p 个子图包括所述n个子图,所述n个子图的计算任务分配该所述n个处理器执行,所述n个处理器中的每个处理器执行所述n个子图的计算任务中的一个,p为大于或等于n的整数。
  9. 根据权利要求8所述的装置,其特征在于,所述第一子图的开销大于所述第一计算图中的至少一半的子图的开销。
  10. 根据权利要求8或9所述的装置,其特征在于,在所述第一子图包括多个算子的情况下,所述第一子图中的所有算子的执行顺序为串行执行。
  11. 根据权利要求8至10中任一项所述的装置,其特征在于,所述n个子图的输入数据是将所述第一子图的输入数据进行数据切分得到的,在所述第一子图中的算子为卷积类算子,且所述卷积类算子的滑动步长小于卷积核的高度的情况下,所述n个子图中至少两个子图的输入数据中的部分数据相同。
  12. 根据权利要求8至11中任一项所述的装置,其特征在于,所述n个子图的开销与所述m个处理器中的n个处理器的计算能力满足第一匹配关系,包括:
    所述n个子图的开销的比值与所述n个处理器的计算能力的比值之间的差值小于或等于第一阈值。
  13. 根据权利要求8至12中任一项所述的装置,其特征在于,所述p个子图包括所述第一计算图中的并行的q个子图,所述q个子图的计算任务分配给所述m个处理器中的q个处理器执行,所述q个处理器中的每个处理器执行所述q个子图的计算任务中的一个,q为大于1的整数。
  14. 根据权利要求8至13中任一项所述的装置,其特征在于,所述处理单元具体用于:
    将所述p个子图分别转换为p个操作者Actor;
    在所述p个Actor的执行过程中调度所述m个处理器执行所述p个子图的计算任务。
  15. 一种神经网络模型的处理装置,其特征在于,包括:包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行如权利要求1至7中任一项所述的方法。
  16. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至7中任一项所述的方法。
  17. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至7中任一项所述的方法。
PCT/CN2022/133524 2021-11-24 2022-11-22 神经网络模型的处理方法及装置 WO2023093724A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111405768.3A CN116187391A (zh) 2021-11-24 2021-11-24 神经网络模型的处理方法及装置
CN202111405768.3 2021-11-24

Publications (1)

Publication Number Publication Date
WO2023093724A1 true WO2023093724A1 (zh) 2023-06-01

Family

ID=86436937

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/133524 WO2023093724A1 (zh) 2021-11-24 2022-11-22 神经网络模型的处理方法及装置

Country Status (2)

Country Link
CN (1) CN116187391A (zh)
WO (1) WO2023093724A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116954721A (zh) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 一种执行器多模态算子异步非阻塞分裂方法
CN117155791A (zh) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 基于集群拓扑结构的模型部署方法、系统、设备及介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629308A (zh) * 2023-07-24 2023-08-22 科大讯飞股份有限公司 一种神经网络模型的推理方法、装置、设备及存储介质
CN117056068B (zh) * 2023-08-08 2024-03-19 杭州观远数据有限公司 ETL中JobEngine任务拆分方法
CN117576125B (zh) * 2024-01-16 2024-04-16 芯瞳半导体技术(山东)有限公司 一种神经网络计算图的分割方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918199A (zh) * 2019-02-28 2019-06-21 中国科学技术大学苏州研究院 基于gpu的分布式图处理系统
CN111738434A (zh) * 2020-06-03 2020-10-02 中国科学院计算技术研究所 在异构处理单元上执行深度神经网络的方法
CN112598112A (zh) * 2020-12-04 2021-04-02 深圳大学 一种基于图神经网络的资源调度方法
CN113159285A (zh) * 2021-04-14 2021-07-23 广州放芯科技有限公司 神经网络加速器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918199A (zh) * 2019-02-28 2019-06-21 中国科学技术大学苏州研究院 基于gpu的分布式图处理系统
CN111738434A (zh) * 2020-06-03 2020-10-02 中国科学院计算技术研究所 在异构处理单元上执行深度神经网络的方法
CN112598112A (zh) * 2020-12-04 2021-04-02 深圳大学 一种基于图神经网络的资源调度方法
CN113159285A (zh) * 2021-04-14 2021-07-23 广州放芯科技有限公司 神经网络加速器

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116954721A (zh) * 2023-09-20 2023-10-27 天津南大通用数据技术股份有限公司 一种执行器多模态算子异步非阻塞分裂方法
CN116954721B (zh) * 2023-09-20 2023-12-15 天津南大通用数据技术股份有限公司 一种执行器多模态算子异步非阻塞分裂方法
CN117155791A (zh) * 2023-10-31 2023-12-01 浪潮电子信息产业股份有限公司 基于集群拓扑结构的模型部署方法、系统、设备及介质
CN117155791B (zh) * 2023-10-31 2024-02-13 浪潮电子信息产业股份有限公司 基于集群拓扑结构的模型部署方法、系统、设备及介质

Also Published As

Publication number Publication date
CN116187391A (zh) 2023-05-30

Similar Documents

Publication Publication Date Title
WO2020221200A1 (zh) 神经网络的构建方法、图像处理方法及装置
WO2023093724A1 (zh) 神经网络模型的处理方法及装置
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2021238366A1 (zh) 一种神经网络构建方法以及装置
WO2021233342A1 (zh) 一种神经网络构建方法以及系统
WO2021190127A1 (zh) 一种数据处理方法和数据处理设备
WO2021057056A1 (zh) 神经网络架构搜索方法、图像处理方法、装置和存储介质
WO2022052601A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
CN112990211B (zh) 一种神经网络的训练方法、图像处理方法以及装置
WO2021218517A1 (zh) 获取神经网络模型的方法、图像处理方法及装置
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
CN112418392A (zh) 一种神经网络构建方法以及装置
WO2021008206A1 (zh) 神经网络结构的搜索方法、图像处理方法和装置
CN110222718B (zh) 图像处理的方法及装置
WO2021164750A1 (zh) 一种卷积层量化方法及其装置
WO2022007867A1 (zh) 神经网络的构建方法和装置
EP4235506A1 (en) Neural network model training method, image processing method, and apparatus
WO2021051987A1 (zh) 神经网络模型训练的方法和装置
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
CN113505883A (zh) 一种神经网络训练方法以及装置
WO2024067884A1 (zh) 一种数据处理方法及相关装置
WO2022156475A1 (zh) 神经网络模型的训练方法、数据处理方法及装置
WO2022063076A1 (zh) 对抗样本的识别方法及装置
WO2021136058A1 (zh) 一种处理视频的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22897807

Country of ref document: EP

Kind code of ref document: A1