WO2021134519A1 - 在神经网络推理中实现数据同步的装置和方法 - Google Patents

在神经网络推理中实现数据同步的装置和方法 Download PDF

Info

Publication number
WO2021134519A1
WO2021134519A1 PCT/CN2019/130638 CN2019130638W WO2021134519A1 WO 2021134519 A1 WO2021134519 A1 WO 2021134519A1 CN 2019130638 W CN2019130638 W CN 2019130638W WO 2021134519 A1 WO2021134519 A1 WO 2021134519A1
Authority
WO
WIPO (PCT)
Prior art keywords
neural network
blocks
inference
memory
data
Prior art date
Application number
PCT/CN2019/130638
Other languages
English (en)
French (fr)
Inventor
王岩岩
冯源
吴祖光
周鹏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201980051147.4A priority Critical patent/CN113169989A/zh
Priority to PCT/CN2019/130638 priority patent/WO2021134519A1/zh
Priority to EP19958452.5A priority patent/EP4075343A4/en
Publication of WO2021134519A1 publication Critical patent/WO2021134519A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1095Replication or mirroring of data, e.g. scheduling or transport for data synchronisation between network nodes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the embodiment of the present invention relates to neural network inference technology, and in particular to a device and method for realizing data synchronization in neural network inference.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic AI theories.
  • Computer vision is an inseparable part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is about how to use cameras/video cameras and computers to obtain What we need is the knowledge of the data and information of the subject. To put it vividly, it is to install eyes (camera/camcorder) and brain (algorithm) on the computer to replace the human eye to identify, track and measure the target, so that the computer can perceive the environment. Because perception can be seen as extracting information from sensory signals, computer vision can also be seen as a science that studies how to make artificial systems "perceive" from images or multi-dimensional data.
  • computer vision is to use various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to complete the processing and interpretation of the input information.
  • the ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans, and have the ability to adapt to the environment autonomously.
  • neural network As an important method of computer vision, neural network is widely used in the fields of target classification and detection. Input image data, through the trained neural network, calculate the required semantic information, such as the category of the object, etc. This process is neural network inference (Neural Network Inference).
  • Figure 1 is an architecture diagram for performing neural network inference on image data.
  • the architecture diagram includes a camera (or camera), an image signal processor (ISP), and a neural network processing unit (NPU). ), specifically, the NPU inferring the raw data collected by the camera includes the following steps:
  • Step 1 The camera collects raw data (Raw Data). Because the raw data is related to the camera filter, the data format is not friendly to the naked eye and the neural network, and contains more noise data and unnecessary information. Therefore, the camera will The data is sent to the ISP for image processing.
  • Raw Data Raw Data
  • the data format is not friendly to the naked eye and the neural network, and contains more noise data and unnecessary information. Therefore, the camera will The data is sent to the ISP for image processing.
  • Step 2 After receiving the original data, the ISP performs related processing on the original data, including steps of denoising, color gamut conversion, sharpening and compression, etc., to convert the original data into image data;
  • Step 3 The NPU reads the image data processed by the ISP, and loads the trained neural network model to perform neural network inference, thereby obtaining the inference result.
  • the NPU needs to synchronize multiple intermediate data generated in the inference process, and then continue to perform the subsequent inference process based on the synchronized data.
  • the so-called data synchronization refers to The generated data does not continue to perform the next operation, but waits for multiple data to arrive, and then performs the next operation after the multiple data as a whole. Since the selection of data synchronization points will affect the data relocation overhead, when the NPU performs neural network inference, the choice of when to synchronize the data will affect the data relocation overhead, which in turn affects the performance of the NPU in performing neural network inference.
  • the present application provides a device and method for realizing data synchronization in neural network reasoning, which can reduce the overhead of data migration in memory.
  • an embodiment of the present application provides a device for realizing data synchronization in neural network inference, including: a memory for storing a first feature map; and a neural network processor NPU for obtaining the data from the memory
  • the first feature map, the first feature map contains M blocks, and M is a positive integer; use the first asynchronous method to perform at least two layers of inference calculations in the neural network model on the M blocks to obtain M inference results
  • the first asynchronous mode means that the intermediate results obtained by performing the inference calculation of one layer in the neural network model for each block are not synchronized with data, and the inference calculation of the next layer is continued to be performed on the intermediate result; Data synchronization is performed on each inference result to obtain the synchronized data.
  • the first feature map may be an input image or other initial input, or may be a feature map generated during a neural network inference process, and the block is a part of the first feature map.
  • the NPU performs data synchronization after performing at least two layers of inference calculations in the neural network model, which makes the number of data synchronizations in the neural network inference process less, and thus generates less data. Relocation expenses.
  • the first characteristic map is an input image
  • the device further includes: a digital image signal processor ISP, which is used to perform image processing on the original image collected by the camera, and use the image processing result as the input
  • ISP digital image signal processor
  • the NPU obtains the first feature map from the memory, it is specifically used to: obtain the input image from the memory.
  • the ISP is required to perform image processing on the original image to obtain an input image suitable for neural network inference.
  • the ISP when the ISP performs image processing on the original image collected by the camera, and stores the image processing result as the input image in the memory, it is specifically used to: divide the original image into M original images Block; sequentially perform image processing on the M original image blocks to obtain M image blocks; asynchronously store the M image blocks as the M sub-blocks in the memory, where the M image blocks are the image processing results .
  • the ISP does not perform the data synchronization process on the image block, that is, asynchronous processing, which can reduce the data migration caused by data synchronization. Overhead, improve data processing efficiency.
  • the ISP when the ISP performs image processing on the original image collected by the camera, and stores the image processing result as the input image in the memory, it is specifically used to: perform image processing on the original image block by block. Obtain multiple image blocks; perform data synchronization on the multiple image blocks to obtain the image processing result, and store the image processing result in the memory.
  • the ISP performs data synchronization, that is, synchronization processing, on the generated multiple image blocks.
  • the aforementioned synchronization processing is performed for the ISP, and the NPU is further used to: divide the acquired first feature map into the M blocks.
  • the NPU includes multiple processor cores, the multiple processor cores share a cache, and the NPU is further used to store the synchronized data in one of the memory and the cache. .
  • the NPU stores the synchronized data in one of the memory and the cache, which is specifically used to: compare the size of the synchronized data with the size of the cache; When the size of the synchronized data is greater than the size of the cache, the synchronized data is stored in the memory; when the size of the synchronized data is not greater than the size of the cache, the synchronized data is stored in the cache .
  • the synchronized data generated by the NPU may be stored in the L2 cache instead of the memory, which avoids all the synchronized data generated by the NPU from being stored in the memory, and reduces the overhead of data migration in the memory.
  • the NPU is also used to: fetch the synchronized data from the memory or the cache; use the synchronized data as the second feature map to execute one or more layers in the neural network model. Layer reasoning calculation; wherein, one or more layers in the neural network model are subsequent layers of the at least two layers in the neural network model.
  • the NPU is also used to perform inference calculations of one or more layers in the neural network model on the multiple blocks included in the second feature map in the asynchronous manner.
  • the value of M is different from the number of multiple blocks included in the second feature map.
  • an embodiment of the present application provides an apparatus for realizing data synchronization in neural network inference, including: a calculation engine including at least one processor core, configured to: obtain a first feature map, the first feature map including M Blocks, M is a positive integer; the M blocks are used asynchronously to perform at least two layers of inference calculations in the neural network model to obtain M inference results.
  • the asynchronous mode means that the execution of each block is completed.
  • the intermediate results obtained by the inference calculation of one layer are not data synchronized, and the inference calculation of the next layer is continued to be performed on the intermediate results; the M inference results are data synchronized to obtain the synchronized data.
  • the device further includes a cache, the at least one processor core shares the cache, and the calculation engine is further configured to store the synchronized data in the cache.
  • the calculation engine is specifically configured to store the synchronized data in the cache when the size of the synchronized data is not greater than the size of the cache.
  • the NPU is also used to: fetch the synchronized data from the cache; use the synchronized data as the second feature map to perform one or more layers of reasoning in the neural network model Calculation; wherein, one or more layers in the neural network model are subsequent layers of the at least two layers in the neural network model.
  • an embodiment of the present application provides a method for realizing data synchronization in neural network inference.
  • the method includes: a neural network processor NPU obtains a first feature map, the first feature map includes M blocks, and M is A positive integer; the NPU uses an asynchronous manner to perform at least two layers of inference calculations in the neural network model on the M blocks to obtain M inference results.
  • the asynchronous manner refers to the execution of each block in the neural network model
  • the intermediate result obtained by the inference calculation of one layer does not perform data synchronization, and continues to perform the inference calculation of the next layer on the intermediate result; the NPU performs data synchronization on the M inference results to obtain synchronized data.
  • the first feature map is an input image.
  • the method further includes: a digital signal processor ISP performs image processing on the original image collected by the camera, and The image processing result is stored in the memory as the input image, and the memory is the external memory of the NPU;
  • acquiring the first feature map by the NPU includes: acquiring the input image from the memory by the NPU.
  • the ISP performs image processing on the original image collected by the camera, and stores the image processing result as the input image in the memory, including: the ISP divides the original image into M original image blocks; The ISP sequentially performs image processing on the M original image blocks to obtain M image blocks; the ISP asynchronously stores the M image blocks as the M sub-blocks in the memory, where the M image blocks are the image process result.
  • the ISP performs image processing on the original image collected by the camera, and stores the image processing result as the input image in the memory, including: the ISP performs image processing on the original image block by block to obtain multiple images. Image blocks; data synchronization is performed on the multiple image blocks to obtain the image processing result, and the image processing result is stored in the memory.
  • the NPU includes multiple processor cores, and the multiple processor cores share a cache.
  • the method further includes: the NPU stores the synchronized data in one of the memory and the cache.
  • the memory is the external memory of the NPU.
  • the method further includes: the NPU retrieves the synchronized data from the memory or the cache; the NPU uses the synchronized data as the second feature map to execute one of the neural network models Layer or multi-layer reasoning calculation; wherein one or more layers in the neural network model are subsequent layers of the at least two layers in the neural network model.
  • the method for realizing data synchronization provided by the third aspect can be regarded as the method executed by the device for realizing data synchronization provided by the first aspect.
  • the method executed by the device for realizing data synchronization provided by the first aspect please refer to The related description in the first aspect will not be repeated here.
  • an embodiment of the present application provides a device for realizing data synchronization in neural network reasoning, and the device includes a module for executing the method in the third aspect.
  • a computer-readable medium stores program code for execution by a device, and the program code includes a method for executing the method in the third aspect.
  • a computer program product containing instructions is provided, when the computer program product runs on a computer, the computer is caused to execute the method in the third aspect.
  • a chip in a seventh aspect, includes a processor and a data interface.
  • the processor reads instructions through the data interface and executes the method in the third aspect.
  • Figure 1 is an architecture diagram for performing neural network inference on image data
  • FIG. 2 is a schematic structural diagram of a convolutional neural network CNN provided by an embodiment of the present application
  • FIG. 3 is a system architecture diagram for realizing data synchronization in neural network reasoning provided by an embodiment of the present application
  • FIG. 4 is a flowchart of data synchronization in neural network reasoning provided by an embodiment of the present application.
  • Fig. 5 is a data dependency graph of adjacent blocks in an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a processor core provided by an embodiment of the present application.
  • FIG. 7 is a flow chart for realizing data synchronization in neural network inference according to an embodiment of the present application.
  • Fig. 8 is a structural diagram of a device for realizing data synchronization in neural network reasoning provided by an embodiment of the present application.
  • a neural network can be composed of neural units.
  • a neural unit can refer to an arithmetic unit that takes x s and intercept 1 as inputs.
  • the output of the arithmetic unit can be:
  • s 1, 2,...n, n is a natural number greater than 1
  • W s is the weight of x s
  • b is the bias of the neural unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the characteristics of the local receptive field.
  • the local receptive field can be a region composed of several neural units.
  • Deep Neural Network also known as multi-layer neural network
  • DNN Deep Neural Network
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the number of layers in the middle are all hidden layers.
  • the layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as It should be noted that there is no W parameter in the input layer.
  • more hidden layers make the network more capable of portraying complex situations in the real world.
  • a model with more parameters is more complex and has a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is also the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • Convolutional Neural Network (CNN, Convolutional Neuron Network) is a deep neural network with a convolutional structure.
  • the convolutional neural network contains a feature extractor composed of a convolutional layer and a sub-sampling layer.
  • the feature extractor can be regarded as a filter, and the convolution process can be regarded as convolution using a trainable filter and an input image or feature map.
  • the convolutional layer refers to the neuron layer that performs convolution processing on the input signal in the convolutional neural network.
  • a neuron can be connected to only part of the neighboring neurons.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units.
  • Neural units in the same feature plane share weights, and the shared weights here are the convolution kernels. Sharing weight can be understood as the way of extracting image information has nothing to do with location. The underlying principle is that the statistical information of a certain part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the image information obtained by the same learning can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the more the number of convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size.
  • the convolution kernel can obtain reasonable weights through learning.
  • the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, and at the same time reduce the risk of overfitting.
  • RNN Recurrent Neural Networks
  • RNN is used to process sequence data.
  • the layers are fully connected, and the nodes in each layer are disconnected.
  • this ordinary neural network has solved many problems, it is still powerless for many problems. For example, if you want to predict what the next word of a sentence will be, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called recurrent neural network is that the current output of a sequence is also related to the previous output.
  • the specific form of expression is that the network will memorize the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer are no longer unconnected but connected, and the input of the hidden layer includes not only The output of the input layer also includes the output of the hidden layer at the previous moment.
  • RNN can process sequence data of any length.
  • the training of RNN is the same as the training of traditional CNN or DNN.
  • the error back-propagation algorithm is also used, but there is a difference: that is, if the RNN is network expanded, then the parameters, such as W, are shared; while the traditional neural network mentioned above is not the case.
  • the output of each step depends not only on the current step of the network, but also on the state of the previous steps of the network. This learning algorithm is called Backpropagation Through Time (BPTT).
  • BPTT Backpropagation Through Time
  • the neural network provided in the embodiment of the present application may be a convolutional neural network CNN.
  • a convolutional neural network is a deep neural network with a convolutional structure, and a deep learning architecture.
  • the deep learning architecture refers to the use of machine learning algorithms. Multi-level learning is carried out on the abstract level of.
  • CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the input image.
  • the CNN 200 may include an input layer 210, a convolutional layer/pooling layer 220 (the pooling layer is optional), and a neural network layer 230.
  • the convolutional layer/pooling layer 220 shown in FIG. 2 may include layers 221-226 as shown in the examples.
  • layer 221 is a convolutional layer
  • layer 222 is a pooling layer
  • layer 223 is a convolutional layer.
  • Layers, 224 is the pooling layer
  • 225 is the convolutional layer
  • 226 is the pooling layer; in another implementation, 221 and 222 are the convolutional layers, 223 is the pooling layer, and 224 and 225 are the convolutional layers.
  • Layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or as the input of another convolutional layer to continue the convolution operation.
  • the pooling layer can be a convolutional layer followed by a layer.
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the purpose of the pooling layer is to reduce the size of the image space.
  • the pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain an image with a smaller size.
  • the average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of the average pooling.
  • the maximum pooling operator can take the pixel with the largest value within a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the image size.
  • the size of the image output after processing by the pooling layer can be smaller than the size of the image of the input pooling layer, and each pixel in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 200 After processing by the convolutional layer/pooling layer 220, the convolutional neural network 200 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 220 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 200 needs to use the neural network layer 230 to generate one or a group of required classes of output. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in FIG. 2) and an output layer 240. The parameters contained in the multiple hidden layers can be based on specific task types. The relevant training data of the, for example, the task type can include image recognition, image classification, image super-resolution reconstruction and so on.
  • the output layer 240 After the multiple hidden layers in the neural network layer 230, that is, the final layer of the entire convolutional neural network 200 is the output layer 240.
  • the output layer 240 has a loss function similar to the classification cross entropy, which is specifically used to calculate the prediction error.
  • the convolutional neural network 200 shown in FIG. 2 is only used as an example of a convolutional neural network. In specific applications, the convolutional neural network may also exist in the form of other network models.
  • the main idea of this application is that the NPU uses an asynchronous method to perform inference calculations of at least two layers in the neural network model on multiple blocks in the feature map, and then synchronizes multiple inference results, that is, solves the problem of NPU inference in the neural network model.
  • the question of when to synchronize data during the process will be described below in conjunction with Figure 3 to illustrate a specific scenario of neural network inference for images.
  • the inference calculation of the neural network model performed by the NPU in this application is not limited to the image scene. All scenarios where the neural network model can be used for inference calculations can be applied to this application.
  • an embodiment of the present invention provides a system architecture 100.
  • the system architecture 100 includes a camera 101, an ISP 102, a memory 103, and an NPU 104.
  • the NPU 104 includes a calculation engine 1041 and a secondary cache (L2 cache) 1042.
  • the L2 cache 1042 is shared by multiple processor cores included in the calculation engine 1041.
  • the size of the L2 cache is generally much smaller than the size of the memory 103.
  • the camera 101 (or called a camera) collects original images, and sends the collected original images to the ISP for image processing.
  • the camera 101 may sequentially send the original images to the ISP 102 in a row/column manner. For example, the camera 101 collects an original image with a size of 720*1280p, and sends the original image in a row/column manner with a size of 20*1280p each time. Send 36 original image blocks in total to complete the original image transmission.
  • the ISP102 After receiving the original image, the ISP102 performs image processing on the original image, and then stores the image processing result in the memory 103, and the NPU will use the image processing result as the input image to perform the inference calculation of the neural network model, where the input An image can also be understood as a feature map.
  • the ISP can handle it in the following two ways:
  • the ISP After receiving the original image, the ISP divides the original image into M 1 original image blocks, and sequentially performs image processing on the M 1 original image blocks to obtain the corresponding M 1 image blocks, which are composed of M 1 image blocks The result of image processing is then stored in the memory 103 asynchronously with these M 1 image blocks. This means that every time the ISP generates an image block, the generated image block is stored in the memory 103, that is, the ISP does not need to perform processing on these M 1 image blocks. During the data synchronization process, at this time, M 1 image blocks may be distributed in different storage areas of the memory 103, or may be distributed in a continuous storage area of the memory 103.
  • the NPU sequentially obtains these M 1 image blocks from the memory, that is, the image processing result, and uses the M 1 image blocks as input images to perform inference calculations, and each image block can be regarded as a block of the input image, so ,
  • the input image contains M 1 blocks, that is, the number of original image blocks divided by the original image in the ISP is the same as the number of blocks included in the input image for inference calculation performed by the NPU, both are M 1 .
  • the ISP After receiving the original image, the ISP performs image processing on the original image by block to obtain multiple image blocks, then synchronizes the data of the multiple image blocks, and then stores the synchronized data in the memory 103. At this time, the multiple image blocks are distributed in a continuous storage area of the memory 103. In this case, the ISP synchronizes multiple image blocks to obtain the image processing result, and stores the image processing result in the memory 103, and then the NPU obtains the image processing result from the memory 103 as the input image, and the NPU is performing inference During calculation, the input image is divided into M 1 blocks. At this time, 1 is no correlation between the number of blocks in the original image into the ISP and performing inference NPU calculated input image block division number M, that may be equal or may not be equal.
  • the input image for the NPU to perform inference calculation includes M 1 blocks.
  • the NPU 104 obtains the input image from the memory 103, and the input image contains M 1 blocks, and performs the inference calculation of all the layers in the first stage (stage 1) on the M 1 blocks to obtain the corresponding M 1 inference results. Then the M 1 inference results are data synchronized to obtain the synchronized data in the first stage; among them, the inference calculation in the first stage includes one or more inference calculations in the neural network model, and each block performs the first step. The inference calculation in the first stage obtains a corresponding inference result.
  • the data synchronized in the first stage is a feature map
  • the second stage of inference calculation will be performed on the feature map later.
  • each of the above inference results can actually be understood as a feature map or part of a feature map. .
  • the NPU104 After obtaining the data synchronized in the first phase, the NPU104 compares the size of the synchronized data in the first phase with the size of the L2cache. If the size of the synchronized data in the first phase is greater than the size of the L2 cache, the first phase The synchronized data is stored in the memory 103; if the size of the synchronized data in the first phase is not greater than the size of the L2 cache, the synchronized data in the first phase is stored in the L2 cache. When the data after the first phase synchronization is stored in the L2 cache, the overhead of data migration in the memory can be reduced.
  • NPU104 fetches the data synchronized in the first stage from the memory 103 or L2 cache, and uses the synchronized data in the first stage as a feature map to perform inference calculations on all layers in the second stage (stage2), similar to the above-mentioned inference calculation method in the first stage Similarly, in the process of performing the second stage of inference calculation, the feature map is also inferred according to the block method. Specifically, the feature map is divided into M 2 blocks, and the M 2 blocks are separately performed. Two-stage inference calculation to obtain the corresponding M 2 inference results, and then the M 2 inference results are data synchronized to obtain the synchronized data in the second stage; similarly, the second-stage inference calculation also includes the nerve One or more inference calculations in the network model.
  • the NPU104 After the NPU104 obtains the second-level synchronized data, it also compares its size with the size of the L2 cache, so as to determine that the second-level synchronized data is stored in the memory 103 or L2 cache. For details, please refer to the above description of the first phase. The storage method of the synchronized data will not be repeated here.
  • NPU104 uses the same inference calculation method as the above-mentioned first and second stages to execute the inference calculation of each subsequent stage, thereby completing the inference calculation of N stages (stageN), that is, completing all layers in the neural network model.
  • stageN the inference calculation of N stages
  • the reasoning calculation of the Nth stage is semantic information, such as the result of image recognition.
  • the reasoning calculation of each stage includes one or more layers of reasoning calculation in the neural network model, and the number of neural network layers included in the reasoning calculation of each stage and other stages
  • the number of neural networks included in the inference calculation can be the same or different.
  • the memory 103 is the external memory of the NPU 104, and the memory 103 can be a double data rate synchronous dynamic random access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), a high bandwidth memory (High Bandwidth Memory, HBM) or other readable and writable Memory.
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • HBM High Bandwidth Memory
  • FIG. 4 is a process for realizing data synchronization in neural network inference according to an embodiment of this application.
  • the execution subject is NPU.
  • the process includes :
  • the i-th stage of FIG characteristics included in the sub-blocks M i are the i-th stage performs reasoning calculation number M i to obtain inference results;
  • the initial value of i is 1, that is, the inference calculation starts from the first stage until the inference calculation of the Nth stage is completed.
  • S406 Judge whether i is N, if i is not equal to N, that is, i is less than N, then execute S407; if i is equal to N, then execute S408;
  • the NPU performs data synchronization after executing the inference calculation of each stage, that is, synchronizes the data according to the inference result of the stage.
  • the inference calculation of each stage includes the multi-layer inference calculation in the neural network model. That is, the NPU performs data synchronization every time after the multi-layer inference in the neural network model is executed. This makes the number of data synchronizations in the neural network inference process less, and then generates less data migration overhead, so that the NPU executes The performance of neural network inference is high.
  • the i-th stage inference calculation can include one or more layers of inference calculations in the neural network model. If the i-th stage includes the inference calculation of one layer in the neural network model, then perform a layer of neural network inference on the feature map obtained in the i-th stage Calculation, specifically, the M i blocks contained in the feature map obtained at the i-th stage are sequentially subjected to a layer of neural network inference calculation to obtain M i inference results, and then the M i inference results are data synchronized to obtain Data synchronized at stage i;
  • the feature of the i-th stage of FIG acquired multilayer neural network inference computing in particular, the i-th stage of FIG acquired feature included in M i sub
  • the blocks complete the inference calculation of the i-th stage in turn, that is, after performing the neural network inference of all the layers of the i-th stage on one of the blocks, perform the neural network inference of all the layers of the i-th stage on the next block adjacent to the block , Until the M i blocks contained in the feature map are executed in all layers of the i-th stage neural network calculations, and then M i inference results are obtained, and then the M i inference results are data synchronized to obtain the i-th stage synchronization The data;
  • the asynchronous method is used to perform the inference calculation of all the layers of the i-th stage for each block in the feature map, where the asynchronous method refers to the intermediate result obtained by performing the inference calculation of one layer in the neural network model for each block.
  • the result is not data synchronized, and the inference calculation of the next layer is continued to be performed on the intermediate result, until the inference calculation of all layers included in the i-th stage has been executed; among them, the intermediate result generated by each block in each layer can be It is stored in the L2 cache, and the intermediate result is retrieved from the L2 cache to perform the inference calculation of the next layer. Since the intermediate results are not stored in the memory, but in the L2 cache, data migration in the memory is avoided.
  • Figure 5 shows three adjacent blocks A, B, and C.
  • Block A contains data from rows 0 to 4
  • block B contains data from rows 4 to 8
  • block C contains data from rows 4 to 8.
  • the data from rows 8 to 12 is the edge part of block A, that is, the data in row 4 is in block B. Therefore, there is a data dependency between block A and block B, and the data in row 4 is coincident data;
  • the edge part of block B that is, the 8th line of data is in block C, therefore, there is a data dependency between block B and block C, and the 8th line of data is coincident data;
  • the size of L2 cache is generally small. In the inference calculation of a stage, the size of the feature map obtained at this stage is constant. If the number of blocks M contained in the feature map is relatively large, that is, the feature map contains If the number of blocks is larger, the size of each block is smaller. Correspondingly, the intermediate result of each block inference at each level is relatively small, and the intermediate result can be stored in the L2 cache; on the contrary, if M is compared Small, that is, the number of blocks contained in the feature map is relatively small, and the size of each block is larger. Correspondingly, the intermediate result of each block inference at each level is relatively large, and the intermediate result cannot be stored in L2 cache. Therefore, in order to ensure that the intermediate results of each segment in the feature map obtained at this stage at each level of inference can be stored in the L2 cache, it is necessary to select an appropriate value for the number of segments M;
  • the size of the obtained feature maps are different. Therefore, in order to ensure that the intermediate results generated by each block in each layer of inference at each stage can be stored in the L2 cache, it is necessary for each stage
  • the number M of the blocks divided by the acquired feature map is set to an appropriate value, and in different stages, the value of M may be the same or different. How to determine the value of M at each stage will be further explained later.
  • the L2 cache can store two types of data:
  • One-stage synchronized data and the size of the synchronized data is not greater than the L2 cache. As the number of neural network layers increases, the size of the synchronized data generated in later stages becomes smaller. Therefore, the synchronized data generated after several stages of inference calculations can be stored in the L2 cache;
  • the obtained feature map is divided into the intermediate results produced by each layer of inference.
  • the L2 cache can store the above two types of data, reducing the storage burden of the memory, and thereby reducing the overhead of data migration in the memory.
  • the number of neural network layers contained in it is fixed. Assuming that these neural network layers are divided into N stages, that is, N inference stages, the synchronized data of each stage may be stored in Memory, the intermediate results generated within each inference stage will not be stored in the memory. Therefore, the smaller the value of N, the less synchronized data generated during the entire inference calculation process, that is, the fewer times the data is synchronized, causing data The lower the cost of relocation, but because the smaller the value of N, the more layers of the neural network model included in each stage, which will lead to the difference between the M blocks in the feature map obtained at each stage The data dependence of data increases, even exponentially, which increases the amount of calculation of neural network inference calculation at each stage. Therefore, the value of N is not as small as possible, and an appropriate value of N needs to be determined to achieve a balance between data relocation overhead and calculation amount.
  • this application proposes to establish a time-related cost function J, which cost function J is used to indicate the sum of the time for the ISP to perform image processing and the time for the NPU to perform the inference calculation of the entire neural network model, that is, the total end-to-end time, which can also be called the total end-to-end delay.
  • represents the time for the ISP to process the entire original image. This time can be determined by calculation or according to the ISP running time.
  • Cycles(l,m,i) represents the theoretical cycle (cycle) required for the m-th block of the feature map obtained in the i-th stage to perform convolution operations on the L-th layer of the entire neural network model. Further, Cycles(l,m,i) batch size, the size of the m-th block, the size of the convolution kernel, the number of computing units on the NPU, etc. are determined. This application provides for Cycles(l,m , i) The specific calculation method will not be further developed.
  • T3(M i ,N,L i ) is used to instruct the NPU to execute the time of data transfer in the storage device during the N stages of inference calculation from stage 1 (Stage1) to stage N (StageN), and T3 (m i, N, L i) is also affected by m i, N L i and impact, and further, T3 on the batch size calculated by (batch size), the size of the m-th block, the size of the convolution kernel, the NPU The number of units is determined, and the calculation method of T3 in this application will not be further expanded;
  • Above cost function J with respect to time is a nonlinear function, the function J comprising a plurality of unknown variables, such as N, M i L i and the like, provided these unknown variables satisfying certain constraints, such as, 1 ⁇ i ⁇ N, the total number of downsampling 1 ⁇ N ⁇ neural network model, the value of M i ⁇ L2 cache guarantee size such as the size of each block of the intermediate result in each inference, therefore, by solving a set of satisfying the above constraints the value of variable (N, M i, L i ) , such that the value of the objective function J minimum, i.e., the lowest end of the delay, the value of the set of variables (N, M i, L i ) obtained at this time It is a set of values that can guarantee the optimal performance of the entire neural network model inference.
  • certain constraints such as, 1 ⁇ i ⁇ N, the total number of downsampling 1 ⁇ N ⁇ neural network model, the value of M i ⁇ L2 cache guarantee size such as the size of each block of
  • a suitable set of (N, M i, L i ) value i.e., determines the division of the total neural network model stages, each stage The number of blocks divided by the acquired feature map and which layers of the neural network model are included in each stage, and then the ISP and NPU perform ISP processing and the entire neural network inference process respectively according to the set of values obtained.
  • the ISP divides the original image into 10 image blocks, and the feature map acquired by the NPU in the first stage also contains 10 blocks.
  • the ISP divides the number of image blocks into the original image without being restricted by 10.
  • the feature map is divided into 15 blocks.
  • the NPU 104 includes a calculation engine 1041 and an L2 cache 1042.
  • the calculation engine 1041 can include one or more processor cores ( Figure 3 uses multiple processor cores as an example).
  • the calculation engine 1041 is used to execute neural networks.
  • the reasoning calculation of the network model, and the core of the reasoning calculation of the neural network model is realized by the processor core.
  • the processor core is responsible for performing calculations related to scalar, vector and tensor.
  • the processor core can also be called an AI processor core. , Referred to as AI core.
  • Figure 6 shows an example of the implementation architecture of a processor core.
  • the processor core includes a matrix calculation unit, a vector calculation unit, a scalar calculation unit and an accumulator.
  • the matrix calculation unit and the accumulator mainly perform matrix-related operations; vector calculations
  • the unit is responsible for performing vector operations, such as vector multiplication, vector addition, exponential operations, and logarithmic operations; the scalar calculation unit is mainly responsible for various types of scalar data operations and program flow control.
  • a series of on-chip buffers and registers such as input buffer and output buffer, are distributed in the processor core, and registers are configured around the scalar computing unit, such as general-purpose Registers and special registers.
  • the embodiment of the application provides a method for realizing data synchronization in neural network reasoning, which specifically includes:
  • the NPU obtains a first feature map, where the first feature map includes M blocks, and M is a positive integer;
  • the NPU uses an asynchronous manner to perform inference calculations of at least two layers in the neural network model on the M blocks respectively to obtain M inference results;
  • the asynchronous mode means that the intermediate result obtained by performing the inference calculation of one layer in the neural network model for each block does not perform data synchronization, and the inference calculation of the next layer is continued to be performed on the intermediate result;
  • the NPU performs data synchronization on the M inference results to obtain synchronized data.
  • the above method further includes:
  • the digital signal processor ISP performs image processing on the original image collected by the camera, and stores the image processing result as the input image in a memory; where the memory is an external memory of the NPU;
  • the ISP can use the following two methods to perform image processing on the original image:
  • the ISP divides the original image into M original image blocks, the ISP sequentially performs image processing on the M original image blocks to obtain M image blocks, and then the ISP treats the M image blocks as M sub-blocks Asynchronously store in the memory, where these M image blocks are the result of image processing.
  • the ISP performs image processing on the original image block by block to obtain multiple image blocks, performs data synchronization on the multiple image blocks to obtain an image processing result, and stores the image processing result in the memory.
  • acquiring the first feature map by the NPU in S701 includes: the NPU acquiring the input image from the memory.
  • the NPU includes multiple processor cores, and the multiple processor cores share a cache, and the above method further includes:
  • the NPU stores the synchronized data in one of a memory and the cache, where the memory is an external memory of the NPU.
  • the above method also includes:
  • the NPU fetches the synchronized data from the memory or the cache.
  • the NPU uses the synchronized data as the second feature map to perform one or more layers of inference calculations in the neural network model; wherein, one or more layers in the neural network model are those described in the neural network model. At least two subsequent layers.
  • the NPU uses the aforementioned asynchronous manner to perform one or more layers of inference calculations in the neural network model on the multiple blocks included in the second feature map.
  • the value of M above is different from the number of multiple blocks contained in the second feature map; there is a data dependency between two adjacent blocks in the multiple blocks contained in the second feature map; the M There is a data dependency between two adjacent blocks in a block.
  • the embodiment of the present application provides another device for realizing data synchronization in neural network reasoning, which specifically includes:
  • the obtaining module 801 is used to obtain a first feature map, the first feature map includes M blocks, and M is a positive integer;
  • the inference module 802 is used to perform at least one of the neural network models on the M blocks in an asynchronous manner. Two-layer inference calculations are used to obtain M inference results; among them, the asynchronous mode means that the intermediate results obtained by performing the inference calculations of one layer in the neural network model for each block do not perform data synchronization, and continue to Intermediate results perform the inference calculation of the next layer;
  • the synchronization module 803 is configured to perform data synchronization on the M inference results to obtain synchronized data.
  • the device for realizing data synchronization further includes:
  • the image processing module 800 is configured to perform image processing on the original image collected by the camera, and use the image processing result as the input image.
  • reasoning module 802 is also used for:
  • the synchronized data as the second feature map to perform inference calculations of one or more layers in the neural network model; wherein, one or more layers in the neural network model are those of the at least two layers in the neural network model Subsequent layer.
  • the above-mentioned asynchronous manner is used to perform inference calculations of one or more layers in the neural network model on the multiple blocks included in the second feature map, respectively.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the function is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Abstract

本申请提供了在神经网络推理中实现数据同步的装置和方法。涉及人工智能(Artificial Intelligence,AI)领域,具体涉及神经网络推理技术。该装置包括:存储器,用于存储第一特征图;神经网络处理器NPU,用于:从所述存储器中获取所述第一特征图,所述第一特征图包含M个分块,M为正整数;利用异步方式对所述M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,该异步方式是指对每个分块执行完所述神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对所述中间结果执行下一层的推理计算;将所述M个推理结果进行数据同步以得到同步后的数据。由于该NPU在执行完神经网络模型中至少两层的推理计算之后才进行数据同步,这使得在神经网络推理过程中进行数据同步的次数较少,进而产生较少的数据搬迁开销。

Description

在神经网络推理中实现数据同步的装置和方法 技术领域
本发明实施例涉及神经网络推理技术,尤其涉及一种在神经网络推理中实现数据同步的装置和方法。
背景技术
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
计算机视觉是各个应用领域,如制造业、检验、文档分析、医疗诊断,和军事等领域中各种智能/自主系统中不可分割的一部分,它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的,被拍摄对象的数据与信息的学问。形象地说,就是给计算机安装上眼睛(照相机/摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等,从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息,所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说,计算机视觉就是用各种成像系统代替视觉器官获取输入信息,再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界,具有自主适应环境的能力。
神经网络作为计算机视觉的一个重要方法,在目标分类、检测等领域应用广泛。输入图像数据,通过训练好的神经网络,计算出所需要的语义信息,例如物体的类别等,这个过程为神经网络推理(Neural Network Inference)。
图1为一种对图像数据执行神经网络推理的架构图,该架构图包括相机(或称摄像头)、图像信号处理器(Image Signal Processor,ISP)和神经网络处理器(Neural network Processing Unit,NPU),具体地,NPU对相机采集的原始数据进行推理包括如下步骤:
步骤1:相机采集原始数据(Raw Data),由于原始数据和相机滤镜相关,数据格式对肉眼和神经网络皆不友好,且包含较多的噪声数据和不必要的信息,因此,相机将原始数据发给ISP以进行图像处理。
步骤2:ISP在接收到原始数据之后,对原始数据进行相关处理,包括去噪、色域转换、锐化和压缩等步骤,将原始数据转换为图像数据;
步骤3:NPU读取ISP处理后的图像数据,并且加载训练好的神经网络模型,进行神经网络推理,从而得到推理结果。
在上述步骤3中,NPU在执行神经网络推理的过程中,需要对推理过程中产生的多个中间数据进行数据同步,然后基于同步后的数据继续执行后面的推理过程,所谓数据同步是指对产生的数据不继续执行下一步的操作,而是等多个数据到齐之后,将这个多个数据 作为整体再执行下一步的操作。由于数据同步点的选择会影响到数据搬迁开销,因此,NPU在执行神经网络推理的过程中,选择何时进行数据同步将会影响到数据搬迁开销,进而影响NPU执行神经网络推理的性能。
发明内容
本申请提供一种在神经网络推理中实现数据同步的装置和方法,能够降低在存储器中进行数据搬迁的开销。
第一方面,本申请实施例提供一种在神经网络推理中实现数据同步的装置,包括:存储器,用于存储第一特征图;神经网络处理器NPU,用于:从该存储器中获取所述第一特征图,该第一特征图包含M个分块,M为正整数;利用第一异步方式对该M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,该第一异步方式是指对每个分块执行完神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对该中间结果执行下一层的推理计算;将M个推理结果进行数据同步以得到同步后的数据。其中,第一特征图可以是输入图像或者其他的初始输入,也可以是神经网络推理过程中产生的特征图,分块是第一特征图中的一部分。
由上述第一方面可知,该NPU在执行完神经网络模型中至少两层的推理计算之后才进行数据同步,这使得在神经网络推理过程中进行数据同步的次数较少,进而产生较少的数据搬迁开销。
在一种可能的实现方式中,该第一特征图是输入图像,该装置还包括:数字图像信号处理器ISP,用于对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在该存储器;
相应地,该NPU从该存储器中获取该第一特征图时,具体用于:从该存储器中获取该输入图像。
上述的实现方式中,由于原始图像包含较多的噪声数据和不必要的信息,因此需要ISP对原始图像执行图像处理以得到适用于进行神经网络推理的输入图像。
在一种可能的实现方式中,该ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在该存储器时,具体用于:将该原始图像划分成M个原始图像块;对该M个原始图像块依次进行图像处理以得到M个图像块;将该M个图像块作为该M个分块异步存入该存储器,其中,该M个图像块为该图像处理结果。由上述实现方式可知,ISP每生成一个图像块,则将生成的图像块存入该存储器,因此,ISP不对图像块执行数据同步过程,即异步处理,这可以降低因为数据同步而引入的数据搬迁开销,提高数据处理效率。
在一种可能的实现方式中,该ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在该存储器时,具体用于:将该原始图像按块进行图像处理以得到多个图像块;将该多个图像块进行数据同步以得到该图像处理结果,并将该图像处理结果存储在该存储器。
由上述实现方式可知,该ISP对生成的多个图像块进行数据同步,即同步处理。
在一种可能的实现方式中,针对该ISP执行上述同步处理,在该NPU还用于:将获取的第一特征图划分成该M个分块。
在一种可能的实现方式中,该NPU包括多个处理器核,该多个处理器核共享一个缓存, 该NPU还用于:将该同步后的数据存储在该存储器和该缓存中的一个。
在一种可能的实现方式中,该NPU将该同步后的数据存储在该存储器和该缓存中的一个,具体用于:将该同步后的数据的大小和该缓存的大小进行比较;在该同步后的数据的大小大于该缓存的大小时,将该同步后的数据存储在该存储器;在该同步后的数据的大小不大于该缓存的大小时,将该同步后的数据存储在该缓存。
由上述实现方式可知,NPU生成的同步后的数据可能存储在L2 cache而非存储器,这避免了NPU生成的所有同步后的数据全部存储在存储器,减少了在存储器中进行数据搬迁的开销。
在一种可能的实现方式中,该NPU还用于:从该存储器或该缓存中取出该同步后的数据;将该同步后的数据作为第二特征图执行该神经网络模型中一层或多层的推理计算;其中,该神经网络模型中一层或多层是该神经网络模型中所述至少两层的后续层。
在一种可能的实现方式中,该NPU还用于:利用该异步方式对该第二特征图中所包含的多个分块分别执行该神经网络模型中一层或多层的推理计算。
在一种可能的实现方式中,M的取值和第二特征图中所包含的多个分块的数量不同。
在一种可能的实现方式中,第二特征图所包含的多个分块中相邻的两个分块之间存在数据依赖。
在一种可能的实现方式中,该M个分块中相邻的两个分块之间存在数据依赖。
第二方面,本申请实施例提供一种在神经网络推理中实现数据同步的装置,包括:包括至少一个处理器核的计算引擎,用于:获取第一特征图,该第一特征图包含M个分块,M为正整数;利用异步方式对该M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,该异步方式是指对每个分块执行完该神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对该中间结果执行下一层的推理计算;将该M个推理结果进行数据同步以得到同步后的数据。
在一种可能的实现方式中,该装置还包括缓存,该至少一个处理器核共享该缓存,该计算引擎还用于:将该同步后的数据存储在该缓存。
在一种可能的实现方式中,该计算引擎具体用于:在该同步后的数据的大小不大于该缓存的大小时,将该同步后的数据存储在该缓存。
在一种可能的实现方式中,该NPU还用于:从该缓存中取出该同步后的数据;将该同步后的数据作为第二特征图执行该神经网络模型中一层或多层的推理计算;其中,该神经网络模型中一层或多层是该神经网络模型中所述至少两层的后续层。
在一种可能的实现方式中,该M个分块中相邻的两个分块之间存在数据依赖。
第三方面,本申请实施例提供一种在神经网络推理中实现数据同步的方法,该方法包括:神经网络处理器NPU获取第一特征图,该第一特征图包含M个分块,M为正整数;该NPU利用异步方式对该M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,该异步方式是指对每个分块执行完该神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对该中间结果执行下一层的推理计算;该NPU将该M个推理结果进行数据同步以得到同步后的数据。
在一种可能的实现方式中,该第一特征图是输入图像,在该NPU获取第一特征图之前,该方法还包括:数字信号处理器ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在存储器,该存储器为该NPU的外部存储器;
相应地,该NPU获取第一特征图,包括:该NPU从该存储器中获取该输入图像。
在一种可能的实现方式中,该ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在存储器,包括:该ISP将该原始图像划分成M个原始图像块;该ISP对该M个原始图像块依次进行图像处理以得到M个图像块;该ISP将该M个图像块作为该M个分块异步存入该存储器,其中,该M个图像块为该图像处理结果。
在一种可能的实现方式中,该ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在存储器,包括:该ISP将该原始图像按块进行图像处理以得到多个图像块;将该多个图像块进行数据同步以得到该图像处理结果,并将该图像处理结果存储在该存储器。
在一种可能的实现方式中,该NPU包括多个处理器核,该多个处理器核共享一个缓存,该方法还包括:该NPU将该同步后的数据存储在存储器和该缓存中的一个,该存储器为该NPU的外部存储器。
在一种可能的实现方式中,该方法还包括:该NPU从该存储器或该缓存中取出该同步后的数据;该NPU将该同步后的数据作为第二特征图执行该神经网络模型中一层或多层的推理计算;其中,该神经网络模型中一层或多层是该神经网络模型中所述至少两层的后续层。
需要说明的是,第三方面提供的实现数据同步的方法可以视为第一方面提供的实现数据同步的装置所执行的方法,第三方面提供的方法中的具体实现方式及相应技术效果可以参见第一方面中的相关描述,此处不再赘述。
第四方面,本申请实施例提供一种在神经网络推理中实现数据同步的装置,该装置包括用于执行第三方面中的方法的模块。
第五方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第三方面中的方法。
第六方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第三方面中的方法。
第七方面,提供一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取指令,执行第三方面中的方法。
附图说明
图1是一种对图像数据执行神经网络推理的架构图;
图2是本申请实施例提供的一种卷积神经网络CNN的结构示意图;
图3是本申请实施例提供的一种在神经网络推理中实现数据同步的系统架构图;
图4是本申请实施例提供的在神经网络推理中实现数据同步的流程图;
图5是本申请实施例的相邻分块的数据依赖图;
图6是本申请实施例提供的处理器核的示意性框图;
图7是本申请实施例提供的一种在神经网络推理中实现数据同步的流程图;
图8是本申请实施例提供的一种在神经网络推理中实现数据同步的装置结构图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2019130638-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(Deep Neural Network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2019130638-appb-000002
其中,
Figure PCTCN2019130638-appb-000003
是输入向量,
Figure PCTCN2019130638-appb-000004
是输出向量,
Figure PCTCN2019130638-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2019130638-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2019130638-appb-000007
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2019130638-appb-000008
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2019130638-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2019130638-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(CNN,Convolutional Neuron Network)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者特征图(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他 部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)循环神经网络(RNN,Recurrent Neural Networks)
RNN是用来处理序列数据的。在传统的神经网络模型中,是从输入层到隐含层再到输出层,层与层之间是全连接的,而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题,但是却仍然对很多问题却无能无力。例如,你要预测句子的下一个单词是什么,一般需要用到前面的单词,因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路,即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中,即隐含层本层之间的节点不再无连接而是有连接的,并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上,RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法,不过有一点区别:即,如果将RNN进行网络展开,那么其中的参数,如W,是共享的;而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中,每一步的输出不仅依赖当前步的网络,还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法Back propagation Through Time(BPTT)。
(5)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
本申请实施例提供的神经网络可以是卷积神经网络CNN。如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图2所示,CNN200可以包括输入层210,卷积层/池化层220(其中池化层为可选的),以及神经网络层230。
卷积层/池化层220:
卷积层:
如图2所示卷积层/池化层220可以包括如示例221-226层,举例来说:在一种实现中,221层为卷积层,222层为池化层,223层为卷积层,224层为池化层,225为卷积层,226为池化层;在另一种实现方式中,221、222为卷积层,223为池化层,224、225为卷积层,226为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,在如图2中220所示例的221-226各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层230:
在经过卷积层/池化层220的处理后,卷积神经网络200还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层220只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息),卷积神经网络200需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等。
在神经网络层230中的多层隐含层之后,也就是整个卷积神经网络200的最后层为输出层240,该输出层240具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络200的前向传播(如图2由210至240方向的传播为前向传播)完成,反向传播(如图2由240至210方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络200的损失,及卷积神经网络200通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图2所示的卷积神经网络200仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在。
本申请的主要思想在于NPU利用异步方式对特征图中多个分块分别执行神经网络模型中至少两层的推理计算,然后再对多个推理结果进行数据同步,即解决NPU在神经网络模型推理过程中何时进行数据同步的问题,如下将结合图3说明一个针对图像进行神经网络推理的具体场景,需要说明的是,本申请NPU执行神经网络模型的推理计算并不限定在图像场景,凡是可以应用神经网络模型进行推理计算的场景都可以适用于本申请。
下面将结合图像处理场景介绍本申请实施例提供的系统架构。
参见图3,本发明实施例提供了一种系统架构100。该系统架构100包括相机101、 ISP102、存储器103和NPU104,其中,NPU104包括计算引擎1041和二级缓存(L2 cache)1042,L2 cache 1042为计算引擎1041中所包含的多个处理器核所共享的cache,L2 cache的大小一般远小于存储器103的大小。
相机101(或称摄像头)采集原始图像,并将采集的原始图像发送给ISP进行图像处理。
具体地,相机101可以将原始图像按照行/列的方式依次发送给ISP102,例如,相机101采集了一个大小为720*1280p的原始图像,按照行/列的方式每次发送大小为20*1280p的原始图像块,共发送36个原始图像块,进而完成原始图像的发送。
ISP102在接收到原始图像之后,对原始图像进行图像处理,然后将图像处理结果存储在存储器103中,并且NPU会将该图像处理结果作为输入图像以执行神经网络模型的推理计算,其中,该输入图像也可以理解成是特征图(feature map)。
具体地,ISP可以按照如下两种方式来处理:
1)ISP在接收到原始图像之后,将原始图像划分成M 1个原始图像块,对M 1个原始图像块依次进行图像处理以得到对应的M 1个图像块,这M 1个图像块组成图像处理结果,然后将这M 1个图像块异步存入存储器103,这意味着ISP每生成一个图像块,则将生成的图像块存入存储器103,即ISP无需对这M 1个图像块进行数据同步过程,此时M 1个图像块可能分布在存储器103的不同存储区域,也可能分布在存储器103的连续存储区域。
NPU从存储器中依次获取这M 1个图像块,即图像处理结果,并将M 1个图像块作为输入图像以执行推理计算,并且每个图像块可以看作是该输入图像的分块,因此,该输入图像包含M 1个分块,即ISP中原始图像所划分的原始图像块的数量和NPU执行推理计算的输入图像所包含的分块的数量相同,都是M 1
2)ISP在接收到原始图像之后,然后对该原始图像按块进行图像处理以得到多个图像块,然后将这多个图像块进行数据同步,再将同步后的数据存入存储器103,此时这多个图像块分布在存储器103的连续存储区域。该种情况下,ISP将多个图像块进行数据同步以得到图像处理结果,并将该图像处理结果存储在存储器103,然后NPU从存储器103中获取该图像处理结果作为输入图像,NPU在执行推理计算时再将该输入图像划分成M 1个分块。此时,ISP中原始图像划分的块的数量和NPU执行推理计算的输入图像所划分的分块的数量M 1之间不相关,即可以相等也可以不相等。
由上可知,在上述两种方式中,NPU执行推理计算的输入图像都包括M 1个分块。
NPU104从存储器103中获取该输入图像,该输入图像包含M 1个分块,对这M 1个分块分别进行第1阶段(stage1)所有层的推理计算以得到对应的M 1个推理结果,然后将这M 1个推理结果进行数据同步,以得到第1阶段同步后的数据;其中,第1阶段的推理计算包含神经网络模型中一层或多层的推理计算,每个分块进行第1阶段的推理计算得到一个对应的推理结果。
需要说明的是,第1阶段同步后的数据是一个特征图,后面会对该特征图执行第2阶段的推理计算,当然上述每个推理结果其实也可以理解成是特征图或者特征图的一部分。
NPU104在获取第1阶段同步后的数据之后,将第1阶段同步后的数据的大小和L2cache的大小进行比较,如果第1阶段同步后的数据的大小大于L2 cache的大小,则将第1阶段同步后的数据存放在存储器103;如果第1阶段同步后的数据的大小不大于L2 cache的大小,则将第1阶段同步后的数据存放在L2 cache。当第1阶段同步后的数据存放在 L2 cache时,可以减少在存储器中进行数据搬迁的开销。
NPU104从存储器103或者L2 cache中取出第1阶段同步后的数据,将第1阶段同步后的数据作为特征图进行第2阶段(stage2)所有层的推理计算,与上述第1阶段的推理计算方式相同,在执行第2阶段的推理计算过程中,特征图也按照分块的方式进行推理计算,具体地,该特征图被划分为M 2个分块,对这M 2个分块分别进行第2阶段的推理计算以得到对应的M 2个推理结果,然后将这M 2个推理结果进行数据同步,以得到第2阶段同步后的数据;同样地,第2阶段的推理计算也包含该神经网络模型中一层或多层的推理计算。
NPU104在获取第2阶同步后的数据之后,也将其大小和L2 cache的大小进行比较,从而确定将第2阶段同步后的数据存放在存储器103或者L2 cache,具体可以参考上述对第1阶段同步后的数据的存放方式,此处不再赘述。
以此类推,NPU104采用与上述第1阶段和上述第2阶段相同的推理计算方式执行后续每个阶段的推理计算,从而完成N个阶段(stageN)的推理计算,即完成神经网络模型中所有层的推理计算,其中,第N个阶段的推理计算的输出是语义信息,比如图像识别的结果等。
需要说明的是,上述N个阶段中,每个阶段的推理计算包含神经网络模型中一层或多层的推理计算,且每个阶段的推理计算所包含的神经网络的层数和其他阶段的推理计算所包含的神经网络的层数可以相同,也可以不同。
存储器103为NPU104的外部存储器,存储器103可以为双倍数据率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random Access Memory,DDR SDRAM)、高带宽存储器(High Bandwidth Memory,HBM)或其他可读可写的存储器。
为了更清楚的说明NPU104在N个阶段执行推理计算的过程,图4为本申请实施例提供的一种在神经网络推理中实现数据同步的流程,其执行主体为NPU,具体地,该流程包括:
S401、对第i阶段的特征图所包含的M i个分块分别执行第i阶段的推理计算,以得到M i个推理结果;
其中,1≤i≤N,i的初始值为1,即从第1阶段开始进行推理计算,直至执行完第N阶段的推理计算。
S402、对M i个推理结果进行数据同步以得到同步后的数据;
S403、将该同步后的数据的大小和L2 cache进行比较,如果该同步后的数据的大小大于L2 cache的大小,则执行S404;如果该同步后的数据的大小不大于L2 cache的大小,则执行S405;
S404、将该同步后的数据存入存储器;
S405、将该同步后的数据存入L2 cache;
S406、判断i是否为N,如果i不等于N,即i小于N,则执行S407;如果i等于N,则执行S408;
S407、设置i=i+1,并且转到S401;
S408、推理计算结束。
需要说明的是,在不影响方案实现的前提下,上述步骤之间的执行顺序可以适当的调换,本申请对此不做限制。
由上可知,NPU在执行完每个阶段的推理计算之后才进行数据同步,即按照阶段的推 理结果来进行数据同步,由于一般每个阶段的推理计算包含神经网络模型中多层的推理计算,即NPU在每执行完神经网络模型中多层的推理之后才进行一次数据同步,这使得在神经网络推理过程中进行数据同步的次数较少,进而产生较少的数据搬迁开销,从而使得NPU执行神经网络推理的性能较高。
如下将对每个阶段的推理计算过程进行具体描述,以N个阶段中第i阶段为例来进行说明。
第i阶段推理计算可以包含神经网络模型中一层或多层的推理计算,如果第i阶段包含神经网络模型中一层的推理计算,则对第i阶段获取的特征图进行一层神经网络推理计算,具体地,将第i阶段获取的特征图所包含的M i个分块依次进行一层神经网络推理计算以得到M i个推理结果,然后对这M i个推理结果进行数据同步以得到第i阶段同步的数据;
如果第i阶段包含神经网络模型中多层的推理计算,则对第i阶段获取的特征图进行多层神经网络推理计算,具体地,将第i阶段获取的特征图所包含的M i个分块依次完成第i阶段的推理计算,即对其中一个分块执行第i阶段所有层的神经网络推理之后,再对与该分块相邻的下一个块执行第i阶段所有层的神经网络推理,直至特征图中所包含的M i个分块都执行第i阶段所有层的神经网络计算,进而得到M i个推理结果,然后对这M i个推理结果进行数据同步以得到第i阶段同步的数据;
进一步,利用异步方式对特征图中每个分块执行第i阶段所有层的推理计算,其中,该异步方式是指对每个分块执行完神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对该中间结果执行下一层的推理计算,直至执行完第i阶段所包含的所有层的推理计算;其中,每个分块在每层产生的中间结果可以被存储在L2 cache中,再从L2 cache中取出该中间结果以执行下一层的推理计算。由于中间结果没有存储在存储器中,而是存储在L2 cache中,因此,避免了在存储器中进行数据搬迁。
针对上述第i阶段推理计算过程的具体实现,需要说明如下几点:
1)数据依赖:在第i阶段的推理计算过程中,特征图会被划分为M i个分块,然后按照分块依次完成第i阶段中所包含的各层的推理计算,其中,相邻的两个分块之间存在数据依赖,即存在重合数据。
比如,图5示意了三个相邻的分块A、B和C,分块A包含第0行到第4行数据,分块B包含第4行到第8行数据,分块C包含第8行到第12行数据,分块A的边缘部分,即第4行数据在分块B中,因此,分块A和分块B之间存在数据依赖,第4行数据是重合数据;分块B的边缘部分,即第8行数据在分块C中,因此,分块B和分块C之间存在数据依赖,第8行数据是重合数据;
2)L2 cache的大小一般较小,在执行一个阶段的推理计算中,该阶段获取的特征图的大小一定,如果该特征图所包含的分块的数量M比较大,即该特征图所包含的分块的数量比较多,则每个分块的大小较小,相应地,每个分块在每层推理的中间结果比较小,则中间结果可以存放在L2 cache中;反之,如果M比较小,即该特征图所包含的分块的数量比较少,则每个分块的大小较大,相应地,每个分块在每层推理的中间结果比较大,则中间结果无法存放在L2 cache中。因此,为了保证该阶段获取的特征图中每个分块在每层推理的中间结果能够存放在L2 cache中,需要对分块数量M选取合适的值;
3)在不同的阶段,获取的特征图的大小不同,因此,为了确保在每个阶段,每个分块在每层推理所产生的中间结果都可以存放在L2 cache中,需要对每个阶段获取的特征 图所划分的分块的数量M分别设置一个合适的值,并且,在不同的阶段,M的取值可以相同,也可以不同。关于如何确定每个阶段M的取值,后文将会做进一步说明。
由上可知,NPU在执行上述整个推理计算过程中,L2 cache可以存放两类数据:
1)一个阶段同步后的数据,并且该同步后的数据的大小不大于L2 cache。随着神经网络层数的递增,越往后的阶段所产生的同步后的数据的大小越小,因此,在若干个阶段的推理计算后产生的同步后的数据可以存放在L2 cache;
2)在执行一个阶段推理计算过程中,获取的特征图所划分的分块在每层推理所产生的中间结果。
由于NPU在执行整个推理计算过程中,L2 cache可以存放上述两类数据,减少了存储器的存储负担,进而减少在存储器中进行数据搬迁的开销。
针对一个具体的神经网络模型,其所包含的神经网络层数是固定的,假设将这些神经网络层数分成N个阶段,即N个推理阶段,由于每个阶段同步后的数据才可能存放在存储器,每个推理阶段内部产生的中间结果不会存放在存储器,因此,N的取值越小,整个推理计算过程中产生的同步后的数据越少,即同步数据的次数越少,引起数据搬迁的开销越小,但是,由于N的取值越小,则每个阶段所包含的神经网络模型的层数越多,这会导致在每个阶段获取的特征图中M个分块之间的数据依赖增加,甚至是呈指数增加,进而增加每个阶段神经网络推理计算的计算量。因此,N的取值并非越小越好,需要确定一个合适的N值来实现数据搬迁开销和计算量之间的平衡。
由上可知,M和N的取值会影响到整个神经网络模型推理的性能,因此,如何确定N的值以及在每个阶段获取的特征图所划分的分块数量M的值是重要的。
针对一个确定的神经网络模型,为了求取适合该神经网络模型的M值(即M 1到M N这N个值)和N值,本申请提出建立一个关于时间的代价函数J,该代价函数J用于指示ISP进行图像处理的时间和NPU执行整个神经网络模型推理计算的时间的总和,即端到端的总时间,也可以称作端到端总时延。
具体地,J=T1(M 1)+T2(M i,N,L i)+T3(M i,N,L i),其中:
T1(M1)用于指示ISP对1个原始图像块进行图像处理的时间,且T1(M 1)受原始图像所划分的原始图像块的数量M 1影响,M 1也是第1阶段获取的特征图(即输入图像)所包含的分块的数量。
例如,
Figure PCTCN2019130638-appb-000011
其中,τ表示ISP处理整个原始图像的时间,这个时间可以通过计算或者按照ISP流水的时间来确定。
T2(Mi N,Li)用于指示NPU执行从第1阶段(Stage1)到第N阶段(StageN)这N个阶段的推理计算的时间的总和,其中,第i个阶段的推理计算时间是指第i个阶段获取的特征图所有分块执行该阶段神经网络模型所有层数的计算时间的和,1≤i≤N;且T2(M i,N,Li)受M i,N和Li影响,其中,M i用于指示第i阶段获取的特征图所划分的分块的数量,L i用于指示第i阶段中首层神经网络在整个神经网络模型中的层数,比如,L 2=3,L 3=6,则表明第2阶段中首层神经网络在整个神经网络模型中处于第3层,这意味着第1阶段包含2层神经网络,并且是整个神经网络模型的前两层网络;第3阶段中首层神经网络在整个神经网络模型中处于第6层,这意味着第1阶段和第2阶段共包含5层神经网络,由于第1阶段包含2层神经网络,则第2阶段包含4层神经网络。因此,由N和L i即可确定N个阶段中每个阶段所包含的神经网络层数,即N个阶段神经网络层数的分布情况;
例如,
Figure PCTCN2019130638-appb-000012
其中,P表示NPU的功率,Cycles(l,m,i)表示第i阶段获取的特征图的第m个分块在整个神经网络模型第L层进行卷积运算所需要的理论周期(cycle)数,进一步,Cycles(l,m,i)批处理大小(batch size)、第m个分块的大小、卷积核大小、NPU上计算单元的数量等确定,本申请对Cycles(l,m,i)具体计算方式不再进一步展开。
T3(M i,N,L i)用于指示NPU执行从第1阶段(Stage1)到第N阶段(StageN)这N个阶段的推理计算过程中在存储设备中进行数据搬运所产生的时间,且T3(M i,N,L i)也受M i,N和L i影响,进一步,T3由批处理大小(batch size)、第m个分块的大小、卷积核大小、NPU上计算单元的数量等确定,本申请对于T3的计算方式不再进一步展开;
上述关于时间的代价函数J是一个非线性函数,该J函数包含多个未知变量,比如N,M i和L i等,设置这些未知变量满足一定的约束条件,比如,1≤i≤N,1≤N≤神经网络模型中总的下采样次数,M i的取值保证每个分块在每层推理的中间结果的大小≤L2 cache的大小等,因此,通过求解一组满足上述约束条件的变量(N,M i,L i)的值,使得该目标函数J的值最小,即端到端的时延最低,此时求得的该组变量(N,M i,L i)的值即是能够保证整个神经网络模型推理的性能最优的一组值。
因此,针对一个确定的神经网络模型,基于上述非线性函数J求取一组合适的(N,M i,L i)的值,即确定了该神经网络模型总共划分的阶段、每个阶段中获取的特征图所划分的分块的数量和每个阶段包含该神经网络模型的哪几层,然后ISP和NPU按照求取的该组值分别执行ISP处理和整个神经网络的推理过程。
例如,当确定M 1=10,在ISP按照异步方式处理中,ISP将原始图像划分成10个图像块,并且NPU在第1阶段获取的特征图也包含10个分块,在ISP按照同步方式处理中,ISP将原始图像划分的图像块的数量不受10的约束,NPU将第1阶段获取的特征图划分成10个分块;当确定M 2=15,则NPU将第2阶段获取的特征图划分成15个分块。
由图3所示可知,NPU104包括计算引擎1041和L2 cache 1042,计算引擎1041中可以包括一个或者多个处理器核(图3以多个处理器核为示例),计算引擎1041用于执行神经网络模型的推理计算,而神经网络模型的推理计算的核心又是由处理器核来实现,处理器核负责执行标量、向量和张量相关的计算,处理器核也可以称作AI处理器核,简称AI core。
图6示例的给出了一个处理器核的实现架构,处理器核包括矩阵计算单元、向量计算单元、标量计算单元和累加器,矩阵计算单元和累加器主要完成与矩阵相关的运算;向量计算单元负责执行向量运算,例如向量乘、向量加、指数运算和对数运算等;标量计算单元主要负责各类型的标量数据运算和程序的流程控制。
进一步,为了配合处理器核中数据的传输和搬运,处理器核中还分布式地设置一系列的片上缓冲区和寄存器,比如输入缓冲区和输出缓冲区,标量计算单元周围配置寄存器,比如通用寄存器和专用寄存器。
本申请实施例提供一种在神经网络推理中实现数据同步的方法,具体包括:
S701、NPU获取第一特征图,该第一特征图包含M个分块,M为正整数;
S702、NPU利用异步方式对该M个分块分别执行神经网络模型中至少两层的推理计算 以得到M个推理结果;
其中,该异步方式是指对每个分块执行完该神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对该中间结果执行下一层的推理计算;
S703、NPU将该M个推理结果进行数据同步以得到同步后的数据。
进一步,在第一特征图是输入图像时,上述方法还包括:
S700、数字信号处理器ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像存储在存储器;其中,该存储器为该NPU的外部存储器;
具体地,ISP可以采用如下两种方式来对原始图像进行图像处理:
1)该ISP将该原始图像划分成M个原始图像块,该ISP对该M个原始图像块依次进行图像处理以得到M个图像块,然后该ISP将这M个图像块作为M个分块异步存入该存储器,其中,这M个图像块为图像处理结果。
2)该ISP将该原始图像按块进行图像处理以得到多个图像块,将该多个图像块进行数据同步以得到图像处理结果,并将该图像处理结果存储在该存储器。
对应地,S701中NPU获取该第一特征图包括:NPU从该存储器中获取该输入图像。
进一步,该NPU包括多个处理器核,该多个处理器核共享一个缓存,上述方法还包括:
S704、该NPU将该同步后的数据存储在存储器和该缓存中的一个,其中,该存储器为该NPU的外部存储器。
进一步,上述方法还包括:
S705、该NPU从该存储器或该缓存中取出该同步后的数据;
S706、该NPU将该同步后的数据作为第二特征图执行该神经网络模型中一层或多层的推理计算;其中,该神经网络模型中一层或多层是该神经网络模型中所述至少两层的后续层。
具体地,该NPU利用上述异步方式对该第二特征图中所包含的多个分块分别执行该神经网络模型中一层或多层的推理计算。
上述M的取值和第二特征图中所包含的多个分块的数量不同;第二特征图所包含的多个分块中相邻的两个分块之间存在数据依赖;该M个分块中相邻的两个分块之间存在数据依赖。
需要说明的是,上述数据同步的方法的具体实现可以参考前述装置实施例的相关实现,为描述方便,不再对数据同步的方法做进一步描述。
本申请实施例提供还一种在神经网络推理中实现数据同步的装置,具体包括:
获取模块801,用于获取第一特征图,该第一特征图包含M个分块,M为正整数;推理模块802,用于利用异步方式对该M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果;其中,该异步方式是指对每个分块执行完该神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对该中间结果执行下一层的推理计算;
同步模块803,用于将该M个推理结果进行数据同步以得到同步后的数据。
进一步,在第一特征图是输入图像时,实现数据同步的装置还包括:
图像处理模块800,用于对摄像头采集的原始图像进行图像处理,并将图像处理结果作为该输入图像。
进一步,该推理模块802,还用于:
将该同步后的数据作为第二特征图执行该神经网络模型中一层或多层的推理计算;其中,该神经网络模型中一层或多层是该神经网络模型中所述至少两层的后续层。具体地,利用上述异步方式对该第二特征图中所包含的多个分块分别执行该神经网络模型中一层或多层的推理计算。
需要说明的是,上述数据同步的装置的具体实现可以参考前述装置实施例的相关实现,为描述方便,不再对数据同步的装置做进一步描述。
需要说明的是,上述装置类示意图,如图3、图6和图8等,仅是本发明实施例提供的一种结构示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (25)

  1. 一种在神经网络推理中实现数据同步的装置,其特征在于,包括:
    存储器,用于存储第一特征图;
    神经网络处理器NPU,用于:
    从所述存储器中获取所述第一特征图,所述第一特征图包含M个分块,M为正整数;
    利用异步方式对所述M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,所述异步方式是指对每个分块执行完所述神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对所述中间结果执行下一层的推理计算;
    将所述M个推理结果进行数据同步以得到同步后的数据。
  2. 根据权利要求1所述的装置,其特征在于,所述第一特征图是输入图像,所述装置还包括:
    数字图像信号处理器ISP,用于对摄像头采集的原始图像进行图像处理,并将图像处理结果作为所述输入图像存储在所述存储器;
    所述NPU从所述存储器中获取所述第一特征图时,具体用于:
    从所述存储器中获取所述输入图像。
  3. 根据权利要求2所述的装置,其特征在于,所述ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为所述输入图像存储在所述存储器时,具体用于:
    将所述原始图像划分成M个原始图像块;
    对所述M个原始图像块依次进行图像处理以得到M个图像块;
    将所述M个图像块作为所述M个分块异步存入所述存储器,其中,所述M个图像块为所述图像处理结果。
  4. 根据权利要求2所述的装置,其特征在于,所述ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为所述输入图像存储在所述存储器时,具体用于:
    将所述原始图像按块进行图像处理以得到多个图像块;
    将所述多个图像块进行数据同步以得到所述图像处理结果,并将所述图像处理结果存储在所述存储器。
  5. 根据权利要求4所述的装置,其特征在于,所述NPU还用于:
    将获取的所述第一特征图划分成所述M个分块。
  6. 根据权利要求1-5任一所述的装置,其特征在于,所述NPU包括多个处理器核,所述多个处理器核共享一个缓存,所述NPU还用于:
    将所述同步后的数据存储在所述存储器和所述缓存中的一个。
  7. 根据权利要求6所述的装置,其特征在于,所述NPU将所述同步后的数据存储在所述存储器和所述缓存中的一个,具体用于:
    将所述同步后的数据的大小和所述缓存的大小进行比较;
    在所述同步后的数据的大小大于所述缓存的大小时,将所述同步后的数据存储在所述存储器;
    在所述同步后的数据的大小不大于所述缓存的大小时,将所述同步后的数据存储在所述缓存。
  8. 根据权利要求6或7所述的装置,其特征在于,所述NPU还用于:
    从所述存储器或所述缓存中取出所述同步后的数据;
    将所述同步后的数据作为第二特征图执行所述神经网络模型中一层或多层的推理计算。
  9. 根据权利要求8所述的装置,其特征在于,所述NPU在将所述同步后的数据作为第二特征图执行所述神经网络中一层或多层的推理计算时,用于:
    利用所述异步方式对所述第二特征图中所包含的多个分块分别执行所述神经网络模型中一层或多层的推理计算。
  10. 根据权利要求1-9任一所述的装置,其特征在于,所述M个分块中相邻的两个分块之间存在数据依赖。
  11. 一种在神经网络推理中实现数据同步的装置,其特征在于,包括:
    包括至少一个处理器核的计算引擎,用于:
    获取第一特征图,所述第一特征图包含M个分块,M为正整数;
    利用异步方式对所述M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,所述异步方式是指对每个分块执行完所述神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对所述中间结果执行下一层的推理计算;
    将所述M个推理结果进行数据同步以得到同步后的数据。
  12. 根据权利要求11所述的装置,其特征在于,所述装置还包括缓存,所述至少一个处理器核共享所述缓存,所述计算引擎还用于:
    将所述同步后的数据存储在所述缓存。
  13. 根据权利要求12所述的装置,其特征在于,所述计算引擎具体用于:
    在所述同步后的数据的大小不大于所述缓存的大小时,将所述同步后的数据存储在所述缓存。
  14. 根据权利要求12或13所述的装置,其特征在于,所述NPU还用于:
    从所述缓存中取出所述同步后的数据;
    将所述同步后的数据作为第二特征图执行所述神经网络模型中一层或多层的推理计算。
  15. 根据权利要求11-14任一所述的装置,其特征在于,所述M个分块中相邻的两个分块之间存在数据依赖。
  16. 一种在神经网络推理中实现数据同步的方法,其特征在于,包括:
    神经网络处理器NPU获取第一特征图,所述第一特征图包含M个分块,M为正整数;
    所述NPU利用异步方式对所述M个分块分别执行神经网络模型中至少两层的推理计算以得到M个推理结果,所述异步方式是指对每个分块执行完所述神经网络模型中一层的推理计算所得到的中间结果不进行数据同步,并且继续对所述中间结果执行下一层的推理计算;
    所述NPU将所述M个推理结果进行数据同步以得到同步后的数据。
  17. 根据权利要求16所述的方法,其特征在于,所述第一特征图是输入图像,在所述NPU获取第一特征图之前,所述方法还包括:
    数字信号处理器ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为所述输入图像存储在存储器,所述存储器为所述NPU的外部存储器;
    所述NPU获取第一特征图,包括:
    所述NPU从所述存储器中获取所述输入图像。
  18. 根据权利要求17所述的方法,其特征在于,所述ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为所述输入图像存储在存储器,包括:
    所述ISP将所述原始图像划分成M个原始图像块;
    所述ISP对所述M个原始图像块依次进行图像处理以得到M个图像块;
    所述ISP将所述M个图像块作为所述M个分块异步存入所述存储器,其中,所述M个图像块为所述图像处理结果。
  19. 根据权利要求17所述的方法,其特征在于,所述ISP对摄像头采集的原始图像进行图像处理,并将图像处理结果作为所述输入图像存储在存储器,包括:
    所述ISP将所述原始图像按块进行图像处理以得到多个图像块;
    将所述多个图像块进行数据同步以得到所述图像处理结果,并将所述图像处理结果存储在所述存储器。
  20. 根据权利要求16-19任一所述的方法,其特征在于,所述NPU包括多个处理器核,所述多个处理器核共享一个缓存,所述方法还包括:
    所述NPU将所述同步后的数据存储在存储器和所述缓存中的一个,所述存储器为所述NPU的外部存储器。
  21. 根据权利要求20所述的方法,其特征在于,还包括:
    所述NPU从所述存储器或所述缓存中取出所述同步后的数据;
    所述NPU将所述同步后的数据作为第二特征图执行所述神经网络模型中一层或多层的推理计算。
  22. 根据权利要求16-21任一所述的方法,其特征在于,所述M个分块中相邻的两个分块之间存在数据依赖。
  23. 一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行权利要求16-22任一所述的方法。
  24. 一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行权利要求16-22任一所述的方法。
  25. 一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取指令,执行权利要求16-22任一所述的方法。
PCT/CN2019/130638 2019-12-31 2019-12-31 在神经网络推理中实现数据同步的装置和方法 WO2021134519A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201980051147.4A CN113169989A (zh) 2019-12-31 2019-12-31 在神经网络推理中实现数据同步的装置和方法
PCT/CN2019/130638 WO2021134519A1 (zh) 2019-12-31 2019-12-31 在神经网络推理中实现数据同步的装置和方法
EP19958452.5A EP4075343A4 (en) 2019-12-31 2019-12-31 DEVICE AND METHOD FOR SYNCHRONIZING DATA IN A LEAD OF A NEURAL NETWORK

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/130638 WO2021134519A1 (zh) 2019-12-31 2019-12-31 在神经网络推理中实现数据同步的装置和方法

Publications (1)

Publication Number Publication Date
WO2021134519A1 true WO2021134519A1 (zh) 2021-07-08

Family

ID=76686046

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/130638 WO2021134519A1 (zh) 2019-12-31 2019-12-31 在神经网络推理中实现数据同步的装置和方法

Country Status (3)

Country Link
EP (1) EP4075343A4 (zh)
CN (1) CN113169989A (zh)
WO (1) WO2021134519A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658602A (zh) * 2021-08-16 2021-11-16 广州大彩光电科技有限公司 一种实时混音方法及装置

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
CN107704924A (zh) * 2016-07-27 2018-02-16 中国科学院自动化研究所 同步自适应时空特征表达学习模型的构建方法及相关方法
US20190012559A1 (en) * 2017-07-06 2019-01-10 Texas Instruments Incorporated Dynamic quantization for deep neural network inference system and method
CN109543754A (zh) * 2018-11-23 2019-03-29 中山大学 基于端对端深度学习的目标检测与语义分割的并行方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10108850B1 (en) * 2017-04-24 2018-10-23 Intel Corporation Recognition, reidentification and security enhancements using autonomous machines
CN109376763A (zh) * 2018-09-13 2019-02-22 山东师范大学 基于多样本推理神经网络的样本分类方法、系统及介质
CN110163370B (zh) * 2019-05-24 2021-09-17 上海肇观电子科技有限公司 深度神经网络的压缩方法、芯片、电子设备及介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107704924A (zh) * 2016-07-27 2018-02-16 中国科学院自动化研究所 同步自适应时空特征表达学习模型的构建方法及相关方法
CN106875012A (zh) * 2017-02-09 2017-06-20 武汉魅瞳科技有限公司 一种基于fpga的深度卷积神经网络的流水化加速系统
US20190012559A1 (en) * 2017-07-06 2019-01-10 Texas Instruments Incorporated Dynamic quantization for deep neural network inference system and method
CN109543754A (zh) * 2018-11-23 2019-03-29 中山大学 基于端对端深度学习的目标检测与语义分割的并行方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4075343A4 *

Also Published As

Publication number Publication date
CN113169989A (zh) 2021-07-23
EP4075343A1 (en) 2022-10-19
EP4075343A4 (en) 2023-01-25

Similar Documents

Publication Publication Date Title
WO2021120719A1 (zh) 神经网络模型更新方法、图像处理方法及装置
WO2021018163A1 (zh) 神经网络的搜索方法及装置
CN112651511B (zh) 一种训练模型的方法、数据处理的方法以及装置
WO2022001805A1 (zh) 一种神经网络蒸馏方法及装置
CN109993707B (zh) 图像去噪方法和装置
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
WO2022052601A1 (zh) 神经网络模型的训练方法、图像处理方法及装置
CN111291809B (zh) 一种处理装置、方法及存储介质
CN111914997B (zh) 训练神经网络的方法、图像处理方法及装置
WO2021244249A1 (zh) 一种分类器的训练方法、数据处理方法、系统以及设备
WO2021051987A1 (zh) 神经网络模型训练的方法和装置
CN110222718B (zh) 图像处理的方法及装置
WO2021018251A1 (zh) 图像分类方法及装置
CN112561028A (zh) 训练神经网络模型的方法、数据处理的方法及装置
US20230281973A1 (en) Neural network model training method, image processing method, and apparatus
WO2022267036A1 (zh) 神经网络模型训练方法和装置、数据处理方法和装置
CN113011562A (zh) 一种模型训练方法及装置
CN110705564B (zh) 图像识别的方法和装置
WO2021134519A1 (zh) 在神经网络推理中实现数据同步的装置和方法
WO2023122896A1 (zh) 一种数据处理方法和装置
WO2022227024A1 (zh) 神经网络模型的运算方法、训练方法及装置
WO2021238734A1 (zh) 一种神经网络的训练方法及相关设备
CN115169548A (zh) 基于张量的持续学习方法和装置
WO2023272431A1 (zh) 图像处理方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19958452

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2019958452

Country of ref document: EP

Effective date: 20220715

NENP Non-entry into the national phase

Ref country code: DE