WO2020062299A1 - Processeur de réseau neuronal, procédé de traitement de données et dispositif associé - Google Patents

Processeur de réseau neuronal, procédé de traitement de données et dispositif associé Download PDF

Info

Publication number
WO2020062299A1
WO2020062299A1 PCT/CN2018/109208 CN2018109208W WO2020062299A1 WO 2020062299 A1 WO2020062299 A1 WO 2020062299A1 CN 2018109208 W CN2018109208 W CN 2018109208W WO 2020062299 A1 WO2020062299 A1 WO 2020062299A1
Authority
WO
WIPO (PCT)
Prior art keywords
cores
matrix
input
core
batch
Prior art date
Application number
PCT/CN2018/109208
Other languages
English (en)
Chinese (zh)
Inventor
顾雄礼
李艳华
张惠敏
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to CN201880098253.3A priority Critical patent/CN112789627B/zh
Priority to PCT/CN2018/109208 priority patent/WO2020062299A1/fr
Publication of WO2020062299A1 publication Critical patent/WO2020062299A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the field of data computing technology in the field of artificial intelligence, and in particular, to a neural network processor, a data processing method, and related equipment.
  • CNN Convolutional neural network
  • BN Batch normalization
  • BN is also called gradient normalization processing method, which is performed by adding a normalization layer to a certain convolutional layer or each convolutional layer in the CNN for normalization. After it is processed, it enters the next layer of network as the input of the next layer of the network, that is, the mean and variance of the input matrices of these layers are standardized by introducing the BN algorithm, thereby solving the following problems:
  • normalization processing (including calculation of mean and variance) needs to be performed before the input of some layers or each layer in the CNN, and the deeper the number of CNN network model layers, the normalization is performed.
  • Embodiments of the present invention provide a neural network processor, a data processing method, and related equipment to improve the training speed of a neural network.
  • an embodiment of the present invention provides a neural network processor, which may include: n computing cores, an atomic operation accumulation unit, and an on-chip shared cache, where the n cores and the on-chip shared cache are respectively coupled to all The atomic operation accumulation unit, wherein n is an integer greater than 1; each of the n Cores is configured to calculate an intra-core mean ⁇ of the input matrix according to an input matrix, and write u to all The atomic operation accumulation unit, the input matrix includes m training samples, and the kernel average ⁇ is an average value of the m training samples x, where m is an integer greater than or equal to 1; according to the input matrix Calculate the average value v of m x 2 and write v to the atomic operation accumulation unit; the atomic operation accumulation unit is configured to accumulate n ⁇ s written by the n Cores to obtain S1, and The S1 is written into the on-chip shared cache; the n v written by the n cores are accumulated to obtain S2, and the S2
  • the neural network processor calculates the global variance of the batch, it no longer depends on first calculating the global mean value (each calculation core first calculates the internal mean value, and then the global summation and then average). Calculate the global variance (each calculation kernel first calculates the intra-core variance, then global summation and average), but each calculation kernel calculates the intra-core mean ⁇ of the intra-core training sample x and the intra-core mean v of x 2 , and then separately It is sent to the atomic operation accumulation unit to accumulate and store the accumulation result in the on-chip network cache.
  • the n calculation cores obtain the accumulation results of n ⁇ and n v from the on-chip network cache at one time and calculate based on the accumulation result
  • each calculation core in the neural network processor needs to obtain the global mean (that is, the sum of the core averages of all the calculation cores and then the mean) before the global variance can be calculated.
  • all computing cores synchronize the global average from the atomic operation accumulation unit (or one of the n computing cores), which takes a certain amount of time, and the more the computing cores are synchronized, the longer the time may be.
  • each calculation kernel calculates the variance in the kernel, and then affects the time for the atomic operation accumulation unit to accumulate the global variance. Therefore, in the embodiment of the present invention, during the calculation process of Batch Normalization, n calculation cores only need to be synchronized once (that is, the calculation elements for calculating the global mean and global variance are obtained at one time), which reduces the synchronization process once and greatly reduces The synchronization overhead and time increase the training speed of the entire neural network.
  • each of the n Cores is further configured to obtain S1 and S2 from the on-chip shared cache, and calculate n of the n Cores according to the S1 and S2.
  • the global variance of the input matrix includes: obtaining S1 and S2 from the on-chip shared cache, and according to a calculation formula: Calculate the global variance of the n input matrices of the n Cores.
  • the global variances of the n input matrices of the n Cores are calculated according to the calculation formulas of S1 and S2 and ⁇ 2 ; And the global average can be obtained simultaneously.
  • each of the n Cores is further configured to: according to a formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices of the n Cores, and ⁇ 2 is the global variance of the n input matrices of the n Cores, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • each of the n Cores may perform normalization processing of Batch Normalization according to the global mean and global variance calculated by itself.
  • each of the n Cores is further configured to: according to a formula Perform scaling and translation processing on the m training samples x after the normalization processing; among them, Is the normalized result of the ith training sample x i among the m training samples, y i is the output result obtained by batch normalizing B i of x i , ⁇ is the scaling parameter, and ⁇ is the translation parameter.
  • each of the n Cores can perform batch normalization translation and scaling processing according to the global mean and global variance calculated by itself.
  • each of the n Cores is used to obtain a feature map matrix and a weight matrix, and calculate the input matrix according to the feature map matrix and the weight matrix.
  • an input matrix of the Batch Normalization layer is obtained by performing operation on a feature map matrix and a weight matrix obtained in a previous layer of the Batch Normalization layer, and finally, after the Batch Normalization process, it is used as an input of the next layer.
  • an embodiment of the present invention provides a data processing method, which may include the following processing for each of the n input matrices: calculating an intra-core mean ⁇ of the input matrix according to the input matrix, u is written to the atomic operation accumulation unit, the input matrix includes m training samples, and the kernel average ⁇ is an average value of the m training samples x, where m is an integer greater than or equal to 1; Calculate m x 2 average values v according to the input matrix, and write v into the atomic operation accumulation unit; for n ⁇ calculated from n input matrices, and n v are processed as follows: Accumulate n ⁇ to obtain S1; accumulate n v to obtain S2; perform the following processing on S1 and S2: calculate the global variance of the n input matrices according to the S1 and S2.
  • the calculating the global variance of the n input matrices according to the S1 and S2 includes: according to a calculation formula: Calculate the global variance of the n input matrices.
  • the method further includes: according to a formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices, and ⁇ 2 is the variance of the n input matrices, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • the method further includes: according to a formula Perform scaling and translation processing on the m training samples x after the normalization processing; among them, Is the normalized result of the ith training sample x i among the m training samples, y i is the output result obtained by batch normalizing B i of x i , ⁇ is the scaling parameter, and ⁇ is the translation parameter.
  • the method further includes: obtaining a feature map matrix and a weight matrix, and calculating any one of the n input matrices according to the feature map matrix and the weight matrix. .
  • the present application provides a computing accelerator, the computing accelerator being a computing core Core in the neural network processor according to any one of the first aspect, and the computing accelerator is configured to execute the first aspect A function performed by any one of the n computing core Cores in the neural network processor according to any one.
  • the present application provides a computer storage medium that stores a computer program that, when executed by a processor, implements the data processing method flow described in any one of the second aspects.
  • an embodiment of the present invention provides a computer program.
  • the computer program includes instructions.
  • the computer program can execute the data processing method flow described in any one of the second aspects.
  • the present application provides a chip system including a processor, which is configured to implement functions involved in the data processing method flow according to any one of the second aspects.
  • the chip system further includes a memory, and the memory is configured to store program instructions and data necessary for data processing.
  • the chip system can be composed of chips, and can also include chips and other discrete devices.
  • FIG. 1 is a schematic diagram of a convolutional neural network according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of another convolutional neural network according to an embodiment of the present invention.
  • FIG. 3 is a batch normalization forward flowchart provided by an embodiment of the present invention.
  • FIG. 4 is a schematic diagram of a hardware structure of a neural network processor according to an embodiment of the present invention.
  • FIG. 5 is a schematic diagram of a refined hardware structure of an NPU according to an embodiment of the present invention.
  • FIG. 6 is another batch normalization forward flowchart provided by an embodiment of the present invention.
  • FIG. 7 is a schematic flowchart of a data processing method according to an embodiment of the present invention.
  • an embodiment herein means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are they independent or alternative embodiments that are mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and / or a computer.
  • an application running on a computing device and a computing device can be components.
  • One or more components can reside within a process and / or thread of execution, and a component can be localized on one computer and / or distributed between 2 or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (e.g., data from two components that interact with another component between a local system, a distributed system, and / or a network, such as the Internet that interacts with other systems through signals) Communicate via local and / or remote processes.
  • data packets e.g., data from two components that interact with another component between a local system, a distributed system, and / or a network, such as the Internet that interacts with other systems through signals
  • AI Artificial Intelligence
  • AI is a theory, method, technology, and method that uses digital computers or digital computer-controlled machines to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. operating system.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, and basic theories of AI.
  • Convolutional neural network is a multi-layer neural network. Each layer is composed of multiple two-dimensional planes, and each plane is composed of multiple independent neurons. The neurons share weights, and the number of parameters in the neural network can be reduced by weight sharing.
  • a processor performing a convolution operation usually converts a convolution of an input signal feature and a weight into a matrix multiplication operation between a signal matrix and a weight matrix.
  • the signal matrix and the weight matrix are divided into blocks to obtain multiple Fractional signal matrices and fractal weight matrices, and then matrix multiplication and accumulation are performed on the multiple fractal signal matrices and fractal weight matrices.
  • the convolution kernel can be initialized in the form of a matrix of random size. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the backpropagation (BP) algorithm to modify the parameters of the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model is getting smaller and smaller.
  • BP backpropagation
  • an error loss is caused by transmitting the input signal forward until the output, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, thereby converging the error loss.
  • the back-propagation algorithm is the back-propagation motion dominated by the error loss, and aims to obtain the optimal parameters of the super-resolution model, such as the weight matrix.
  • Convolution is the operation of the convolution kernel and the image matrix (the input matrix of the convolution layer).
  • the input matrix is a matrix extracted from the image matrix according to the stride of the convolution kernel during the convolution.
  • the convolution kernel is a small window that records the weights.
  • the convolution kernel slides on the image matrix in steps. Each time the convolution kernel corresponds to a sub-matrix of the image matrix, the weights in the convolution kernel are multiplied by the values contained in the sub-matrix and then added to the volume.
  • the kernel is currently at an element corresponding to the output feature map (output matrix).
  • the step that the convolution kernel moves once along the height of the image matrix is the step of the convolution kernel height sliding
  • the step that the convolution kernel moves once along the width of the image matrix is the step of the convolution kernel width sliding .
  • the sliding step of the convolution kernel is represented by the parameter stride.
  • the input matrix is extracted from the image matrix (that is, the input data) according to the stride of the convolution kernel during the convolution.
  • stride [s1, s2]
  • s1 represents the step of the convolution kernel height sliding
  • s2 represents the step of the convolution kernel width sliding.
  • Convolution operation is the most important operator in convolutional neural networks.
  • X represents the input feature map (input matrix of the convolutional layer)
  • X ' represents the matrix obtained by processing the X with im2col operation
  • W represents the weight matrix
  • b represents the offset
  • Y0 represents the result of the product of the matrix X' and W
  • Y represents the output feature map (the output matrix of the convolution layer).
  • the activation value of each element in the output Y is calculated to obtain the final result.
  • CNN is a deep neural network with a convolution structure. It is a deep learning architecture. Deep learning architecture refers to the use of machine learning algorithms to perform multiple levels at different levels of abstraction. Learning. As a deep learning architecture, CNN is a feed-forward artificial neural network, and each neuron in the feed-forward artificial neural network responds to overlapping regions in the image inputted therein.
  • FIG. 1 is a schematic diagram of a convolutional neural network provided by an embodiment of the present invention.
  • a convolutional neural network (CNN) 100 may include an input layer 110, a convolutional layer / pooling layer 120, and a neural network layer. 130, where the pooling layer is optional.
  • the convolution layer / pooling layer 120 may include, for example, 121-126 layers.
  • 121 layers are convolution layers
  • 122 layers are pooling layers
  • 123 layers are convolution layers
  • 124 The layer is a pooling layer
  • 125 is a convolution layer
  • 126 is a pooling layer.
  • 121 and 122 are convolution layers
  • 123 is a pooling layer
  • 124 and 125 are convolution layers
  • 126 is Pooling layer. That is, the output of the convolution layer can be used as the input of the subsequent pooling layer, or it can be used as the input of another convolution layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators.
  • the convolution operator is also called a kernel. Its function in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be a weight matrix. This weight matrix is usually defined in advance. In the process of convolution operation on the image, the weight matrix is usually one pixel and one pixel along the horizontal direction on the input image ( Or two pixels followed by two pixels ..., which depends on the value of step stride) to complete the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix and the depth dimension of the input image are the same.
  • the weight matrix will be extended to The entire depth of the input image. Therefore, convolution with a single weight matrix will produce a convolution output of a single depth dimension, but in most cases a single weight matrix is not used, but multiple weight matrices with the same dimensions are applied. The output of each weight matrix is stacked to form the depth dimension of the convolution image. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to eliminate unwanted noise in the image. Fuzzification ...
  • the multiple weight matrices have the same dimensions, and the extracted feature maps with the same dimensions have the same dimensions, and then the extracted multiple feature maps with the same dimensions are combined to form the output of the convolution operation. .
  • weight values in these weight matrices need to be obtained through a large amount of training in practical applications.
  • Each weight matrix formed by the weight values obtained through training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer (such as 121) often extracts more general features, which can also be referred to as low-level features; along with the convolutional neural network
  • the features extracted by subsequent convolutional layers (such as 126) become more and more complex, such as features such as high-level semantics.
  • the pooling layer may also be a multi-layer convolution layer followed by one or more pooling layers.
  • the sole purpose of the pooling layer is to reduce the spatial size of the image.
  • the pooling layer may include an average pooling operator and / or a maximum pooling operator for sampling the input image to obtain a smaller-sized image.
  • the average pooling operator can calculate the pixel values in the image within a specific range to produce an average value.
  • the maximum pooling operator can take the pixel with the largest value in the range in a specific range as the result of the maximum pooling.
  • the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image processed by the pooling layer may be smaller than the size of the image of the input pooling layer.
  • Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding subregion of the image of the input pooling layer.
  • the convolutional neural network 100 After processing by the convolutional layer / pooling layer 120, the convolutional neural network 100 is not enough to output the required output information. Because as described above, the convolution layer / pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (the required class information or other related information), the convolutional neural network 100 needs to use the neural network layer 130 to generate the output of one or a set of required classes. Therefore, the neural network layer 130 may include multiple hidden layers (such as 131, 132 to 13n shown in FIG. 1) and an output layer 140. The parameters included in the multiple hidden layer may be based on the specific task type. The relevant training data is obtained by pre-training. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc ...
  • the output layer 140 After the multiple hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is an output layer 140, which has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error.
  • the forward propagation of the entire convolutional neural network 100 (from 110 to 140 in FIG. 1 is forward propagation)
  • the back propagation (from 140 to 110 in FIG. 1 is back propagation) will begin to update
  • the weight values and deviations of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the results output by the convolutional neural network 100 through the output layer and the ideal results.
  • FIG. 1 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, as As shown in FIG. 2, FIG. 2 is a schematic diagram of another convolutional neural network according to an embodiment of the present invention.
  • a plurality of convolutional layers / pooling layers are parallel, and the extracted features are all input to the full neural network layer 130 for processing.
  • the normalization layer in this application can in principle be performed after any one of the above CNNs, or before any one of the layers, and with the output feature matrix of the previous layer as an input, its output can also be As input to any functional layer in CNN.
  • the normalization layer is generally performed after the convolution layer, and the feature matrix output by the previous convolution layer is used as the input matrix.
  • Batch is a part (or all) of the entire training set of the neural network. Further, the batch is disassembled. Divided into multiple mini-batch, each mini-batch is a subset of the training set corresponding to the batch. Batch Normalization calculates the mean and variance of each mini-batch corresponding subset, and performs a normalization operation on each subset based on the global mean and global variance of all mini-batch corresponding subsets, so as to obtain the corresponding value of each mini-batch The normalized subset is the output.
  • the training set corresponding to the batch has Z training samples.
  • the batch is divided into n mini-batches. Each mini-batch is a subset of the training set corresponding to the batch. Each mini-batch has m training samples.
  • Use B to represent any of the mini-batch.
  • the following shows the forward calculation method of Batch Normalization:
  • Mini-batch B ⁇ x 1 , ... x m ⁇ ; Among them, B has m training samples of x 1 , ... x m respectively ; It should be noted that this input is the input of the normalization layer, or Understand to normalize the output of the previous layer.
  • ⁇ B represents the average value of Mini-batch B input to the normalization layer. After m training samples x 1 , ... x m in Mini-batchB are summed and divided by m, the average value of Mini-batch ⁇ is obtained. B ; x i is the i-th training sample among the m samples of Mini-batch.
  • y i is the output result obtained after the i-th training sample x i in mini-batch B is processed by BN.
  • the scaling parameter ⁇ and translation parameter ⁇ are to solve the problem of reducing the network's expression ability due to normalization (because Basically will be limited to the two parameters introduced by the normal distribution).
  • ⁇ and number ⁇ are learned by the network itself during training, which facilitates the adaptive learning of CNN.
  • the ⁇ B used for normalization in the above formula (3) is the global mean corresponding to n mini-batches (that is, the internal mean of the n mini-batches is accumulated and then the mean value is calculated), and the used Is the global variance corresponding to the n mini-batches (ie, the internal variances of the n mini-batches are accumulated and then averaged).
  • FIG. 3 is a forward flowchart of Batch Normalization provided by an embodiment of the present invention. The specific calculation process is as follows:
  • X j is a matrix
  • x 1 ,... X m can be m column vectors in the modified matrix, that is, the i-th column vector in the matrix X j is the training sample in the corresponding mini-batch. x i .
  • Core j calculates the core average on the mini-batch corresponding to Core j
  • the value of j is 1, ... n;
  • the global average of the batch is calculated by a target core among the n cores, that is, other cores (cores other than the target core) in the n cores need to send the calculated average ⁇ B to the target core.
  • the target Core finally calculates the global average ⁇ B ′ , and then each Core reads the global average ⁇ B ′ on the target Core or broadcasts the target Core to each Core. It can be understood that the target Core can also store the global mean value ⁇ B ′ in the relevant calculation unit, so as to facilitate subsequent calculation of the global output result.
  • Core j calculates the intra-core variance of each mini-batch The value of j is 1, 2 ... n;
  • the above implementation process requires two synchronizations during the batch normalization process: the first synchronization: the global mean value ⁇ B ′ is synchronized between n mini-batches; the second synchronization, n mini-batches Global variance vector
  • the first synchronization the global mean value ⁇ B ′ is synchronized between n mini-batches
  • the second synchronization n mini-batches Global variance vector
  • synchronizing twice is expensive. If there is a layer of Batch Normalization behind each network layer, the synchronization overhead will greatly affect the training speed of the entire network.
  • the problem that this application mainly solves is how to reduce the two synchronization time in the Batch Normalization process, in order to reduce the synchronization overhead in the Batch Normalization process, and improve the training speed of the entire network.
  • the foregoing application scenarios are merely exemplary implementations in the embodiments of the present invention, and the application scenarios in the embodiments of the present invention include but are not limited to the above application scenarios.
  • FIG. 4 is a schematic diagram of a hardware structure of a neural network processor according to an embodiment of the present invention.
  • the neural network processor NPU 20 is mounted on a CPU 30 (such as Host CPU) as a coprocessor, and is hosted by Host The CPU assigns tasks.
  • the NPU 20 may include N computing cores 201 (NPU Cores) for short, and the N Cores 201 may be connected to and communicate with each other through an on-chip interconnection 202 network.
  • NPU Cores N computing cores 201
  • the N Cores 201 are coupled to the atomic operation accumulation unit 203, the on-chip shared cache 204, and the external memory 40 through an on-chip interconnect network 202.
  • the atomic operation accumulation unit 203 and the on-chip shared cache 204 may be integrated together;
  • the external memory 40 may be a double-rate synchronous dynamic random access memory (Double Data Rate, DDR), a high-bandwidth memory (High Bandwidth Memory, HBM), and the like.
  • FIG. 5 is a schematic diagram of an NPU detailed hardware structure provided by an embodiment of the present invention.
  • the core of Core 201 Part is the arithmetic circuit 2013.
  • the direct current memory access controller DMAC 2017 controls the arithmetic circuit 1203 to extract matrix data in the memory (including the input memory 2011 and the weight memory 2012) for multiplication; further, the DMAC 1204 also controls the arithmetic circuit 2013.
  • the operation result or the matrix data in the unified memory 2016 enters the accumulator 2015 and / or the vector calculation unit 2014 for further operations. among them,
  • the arithmetic circuit 2013 can also be referred to as a matrix operation unit (Cube unit), which is used to complete the operation of matrix * matrix.
  • the computing circuit 2013 may include a plurality of processing units (Process Engines, PEs).
  • the arithmetic circuit 2013 is a two-dimensional pulsation array.
  • the operation circuit 2013 may also be a one-dimensional pulsation array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • the arithmetic circuit 1203 is a general-purpose matrix processor.
  • Vector calculation unit 2014 is used to further process the output of the arithmetic circuit 2013, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, and so on. It is mainly used for non-convolution / FC layer network calculations in neural networks, such as pooling, batch normalization, local response normalization, and so on.
  • the vector calculation unit 2014 can store the processed output vector to the unified memory 2016.
  • the vector calculation unit 2014 may apply a non-linear function to the output of the arithmetic circuit 2013, such as a vector of accumulated values, to generate an activation value.
  • the vector calculation unit 2014 generates a normalized value or a merged value, or the vector calculation unit 2014 generates a normalized value and a merged value.
  • a vector of the processed output can be used as an activation input for the arithmetic circuit 2013, for example for use in subsequent layers in a neural network.
  • the unified memory 2016 can also be called an on-chip memory (On-chip buffer), which is used to store input data and output data.
  • the weight data (weight matrix) is transferred to the weight memory 2012 through the DMAC 2017.
  • the input matrix (input matrix) is also transferred to the unified memory 2016 or the input memory 2011 through the DMAC.
  • the input memory 2011 may also be referred to as a feature map memory (feature map memory) and is used to store a feature map matrix.
  • feature map memory feature map memory
  • Weight memory 2012 is used to store the weight matrix.
  • the format of the weight matrix includes four dimensions: convolution kernel height, convolution kernel width, input channel number (convolution kernel depth), and output channel number (convolution kernel number).
  • the weight matrix is the convolution kernel.
  • the weight matrix may be a matrix composed of each convolution kernel used by the convolution layer for convolution.
  • Direct Memory Access Controller 2017, which is used to move the input data or input matrix from external memory 40 such as DDR / HBM to various memory buffers, or from the unified memory 2016 to the output data DDR / HBM.
  • DMAC Direct Memory Access Controller
  • a complete DMA transfer process needs to go through four steps: DMA request, DMA response, DMA transfer, and DMA end.
  • Control unit (Flow control) 2018, chip control unit / flow, control data reading mode, used to control processing flow and control data reading mode.
  • On-chip interconnect network is a communication method of system-on-chip (SoC). It is the main component of multi-core technology and is used for multiple computing cores. Interactions, and interactions with multiple computing cores Core 201 and external memory 40 and internal memory 208.
  • Atomic operation accumulator unit 203 is used to store and accumulate output results obtained by multiple computing cores Core 201.
  • On-chip shared buffer 204 is used to buffer the data written by the atomic operation accumulation unit, such as the global mean and global variance. It is also used to synchronize the global mean and global variance to n Core201 under the control of DMAC2017. .
  • the optional on-chip shared cache 204 may be integrated with the atomic operation accumulation unit 203.
  • the unified memory 2016, the input memory 2011, and the weight memory 2012 are all On-Chip memories.
  • the external memory that is, DDR / HBM, can be privately owned by the CNN 20's hardware architecture, and it can also be used by other processors while serving the CNN 20.
  • the main CPU 30 is further configured to run general operating system software, and control the neural network processor 20 to perform neural network training under the function of the general operating system software.
  • the neural network processor 20 described above may also be integrated in the main CPU 30 as part of the main CPU 30; it may also be another functional chip coupled to the main CPU 30 and capable of implementing related functions.
  • the functions performed by the main CPU 30 may also be distributed and executed on multiple different function chips, which are not specifically limited in this embodiment of the present invention.
  • the arithmetic circuit 2013 takes the data corresponding to the matrix B from the weight memory 2012 and buffers the data on each PE in the arithmetic circuit.
  • the arithmetic circuit 2013 takes matrix A data from the input memory 2011 and performs matrix operations on the matrix B, and then performs an addition operation in the accumulator 2015. Partial or final results of the obtained matrix are stored in the unified memory 2016.
  • the hardware structure of the CNN is only one exemplary hardware structure provided by the embodiment of the present invention.
  • the hardware structure of the CNN in the embodiment of the present invention includes, but is not limited to, the above structure and connection relationship.
  • the neural network processor 20 includes n computing cores Core 201, an atomic operation accumulation unit 203, and an on-chip shared cache 204, and the n Core 201 and the on-chip shared cache 204 are respectively coupled to the atom
  • the accumulation unit 203 is operated, where n is an integer greater than one.
  • each of the n Cores 201 also includes multiple functional modules, such as a vector calculation unit 2014, DMAC 2017, and the like. Please refer to the relevant descriptions in FIG. 4 and FIG. 5, and details are not described herein again. among them
  • Each of the n Cores 201 uses a vector calculation unit 2014 to calculate the mean kernel ⁇ of the input matrix according to the input matrix, and writes u to the atomic operation accumulation unit 2014 under the control of DMAC 2017.
  • the input matrix includes m training samples, and the kernel average ⁇ is an average value of the m training samples x, where m is an integer greater than or equal to 1; and m is calculated using a vector calculation unit 2014 according to the input matrix.
  • X 2 mean v, and write v to the atomic operation accumulation unit 203 under the control of DMAC 2017;
  • Atomic operation accumulation unit 203 accumulates n ⁇ s written by the n Cores to obtain S1, and writes the S1 to the on-chip shared cache 204; accumulates n v written by the n Cores Obtaining S2, and writing the S2 to the on-chip shared cache 204;
  • Each of the n Cores 201 further obtains S1 and S2 from the on-chip shared cache 204, and calculates the global variance of the n input matrices of the n Cores 201 according to the S1 and S2.
  • the neural network processor calculates the global variance of the batch, it no longer depends on first calculating the global mean value (each calculation core first calculates the internal mean value, then the global summation, and then averages). Calculate the global variance (each calculation kernel first calculates the intra-core variance, then global summation and average), but each calculation kernel calculates the intra-core mean ⁇ of the intra-core training sample x and the intra-core mean v of x 2 , and then separately It is sent to the atomic operation accumulation unit to accumulate and store the accumulation result in the on-chip network cache.
  • the n calculation cores obtain the accumulation results of n ⁇ and n v from the on-chip network cache at one time and calculate based on the accumulation result To calculate the global mean and global variance.
  • an atomic operation accumulation unit is added to the on-chip network NoC of the neural network processor, different calculation cores complete the accumulation operation during the process of writing data to the same address (that is, the atomic operation accumulation unit), on the one hand, it does not occupy the calculation.
  • the computing resources of the core but the accumulation is completed after each calculation and writing process, and the accumulation process is dispersed to save the total time; avoiding sending all data to a certain calculation core to complete the accumulation operation, leading to accumulation At the same time, other calculation cores are waiting in the air, which is inefficient.
  • the sequence of calculating the ⁇ and v by the vector calculation unit 2014 in each of the Core 201 described above is not specifically limited, that is, ⁇ may be calculated first, and v may be calculated first, and each After the calculation core 201 calculates its own ⁇ or v, it can be immediately sent to the atomic operation accumulation unit 203 for accumulation, so as to synchronize accumulation and save time.
  • ⁇ or v may be uniformly sent to a specified Core 201 among n Core 201 for cumulative calculation. That is, the processor 20 may not have the above-mentioned atomic operation accumulation unit 203 and the on-chip shared cache 204, but a core 201 may replace the functions of the above-mentioned atomic operation accumulation unit 203 and the on-chip shared cache 204.
  • each Core 201 of the n Cores also obtains S1 and S2 from the on-chip shared cache 204 under the control of the DMAC, and uses a vector calculation unit 2014 to calculate S1 and S2 according to the S1 and S2.
  • the global variance of the n input matrices of the n Core 201 specifically includes: obtaining S1 and S2 from the on-chip shared cache, and according to a calculation formula: Calculate the global variance of the n input matrices of the n Cores.
  • the unit 203 has completed the calculation of S1 and S2, so DMAC 2017 can control the Core where it is located to obtain S1 and S2 from the on-chip shared cache.
  • each of the n Cores 201 is further processed by a vector calculation unit 2014 according to a formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices of the n Cores, and ⁇ 2 is the global variance of the n input matrices of the n Cores, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • each of the n Cores 201 is obtained by using a vector calculation unit 2014 according to a formula.
  • each of the n Cores 201 is used to obtain a feature map matrix and a weight matrix from the DDR / HBM 40, and read the obtained feature map matrix and weight matrix respectively Go to the corresponding input memory 2011 and weight memory 2012, and use the arithmetic circuit 2013 and accumulator 2015 to perform multiplication and addition operations on the feature map matrix and the weight matrix to obtain the input matrix.
  • the embodiment of the present invention provides the following Batch Normalization processing process in combination with actual application scenarios.
  • the specific calculation process of the neural network processor may be as follows:
  • the training set corresponding to Batch has Z training samples.
  • the batch is divided into n mini-batches.
  • Each mini-batch is a subset of the training set corresponding to Batch.
  • Each mini-batch has m training samples.
  • Mini-batch B ⁇ x 1 , ... x m ⁇ ; Among them, B has m training samples of x 1 , ... x m respectively ; It should be noted that this input is the input of the normalization layer, or Understand to normalize the output of the previous layer.
  • the above-mentioned Batch Normalization process is a BN process for a certain mini-batch in the batch. From a global perspective of the batch, all n mini-batches in the entire batch need to perform the above-mentioned BN processing, and eventually According to the output results of n mini-batch, Batch normalization can be performed to calculate the mean and standard deviation of the batch, thereby obtaining the output result of the entire batch. Assume that each mini-batch corresponds to a computing core Core, then n mini-batches correspond to n Cores, and the core corresponding to the j-th mini-batch in the n mini-batch is Corej, and the value of j 1, 2 ... n. Please refer to FIG. 6, which is another batch normalization forward flowchart provided by an embodiment of the present invention. The specific calculation process is as follows:
  • DMAC 2017 uses the on-chip interconnect network 202 to control the reading of the Feature Map matrix and weight coefficient matrix corresponding to n Mini-batch from HBM / DDR 40 to the input memory 2011 (such as Feature map) and weight corresponding to each Core.
  • the memory 2012 (such as Weight). For example, read the Feature map matrix and weight corresponding to the first mini-batch into the Feature map, weight, and weight corresponding to Core1, and read the Feature map matrix and weight corresponding to the second mini-batch to Core 2.
  • the Feature map matrix and weight corresponding to the nth mini-batch are read into the corresponding Feature map, buff, and weight in Coren, which will not be described later. It can be understood that the weight coefficient matrices corresponding to different mini-batch are the same.
  • the jth Core that is, the arithmetic circuit 2013 and the accumulator 2015 on the Corej, calculate the result of multiplying the Feature Map matrix and the Weight matrix as Xj, and temporarily store it in the unified memory 2016.
  • the value of j is 1, ... n.
  • Each of the n Core 201s calculates the average value in its own calculation core.
  • the vector calculation unit 2014 on Core j calculates the average value of its corresponding j-th mini-batch B.
  • ⁇ B represents the kernel average of Mini-batch B input to the normalization layer. After summing m training samples x 1 ,... X m in Mini-batch B and dividing by m, ⁇ B ( (I.e., the kernel average of the input matrix of each Core201), x i is the ith training sample of m samples; i takes 1, 2, ... n; j takes 1, 2, ... ... n.
  • DMAC 2017 controls each Core 201 out of n Cores 201 through the on-chip interconnect network 202, and writes the core average ⁇ B calculated by itself into the atomic operation accumulation unit 203 through the on-chip interconnect network 202.
  • Atomic operations The accumulation unit 203 reads the value of the original address and adds it to the current write value to obtain
  • u j represents the kernel average u of the corresponding mini-batch calculated by the j-th core of the n cores, that is, ⁇ B.
  • DMAC 2017 controls each Core 201 out of n Core 201 through the on-chip interconnect network 202, and calculates the average value v of m x 2 respectively.
  • the vector calculation unit 2014 of the j-th Core j calculates m And then averaged to get The value of i is 1, ... n; the value of j is 1, ... n.
  • DMAC 2017 controls all Cores 201 through the on-chip interconnect network 202 and writes the calculated v to the atomic operation accumulation unit 203.
  • the atomic operation accumulation unit 203 reads the value of the original address and adds it to the current write value to obtain All Mini-batch
  • v j represents the average value of m x 2 in the mini-batch corresponding to the j th Core j of the n Cores.
  • the atomic operation accumulation unit 203 writes the accumulated S1 into the on-chip shared cache 204, and writes the accumulated S2 into the on-chip shared cache 204.
  • DMAC 2017 controls each of the n Cores 201 through the on-chip interconnect network 202 to go to the on-chip shared cache 204 to obtain S1 and S2.
  • the jth Core j obtains the on-chip shared cache 204 through the on-chip interconnect network.
  • S1 and S2 and calculate the global mean value based on S1 and S2 respectively And global variance
  • the value of i is 1, ... n; the value of j is 1, ... n.
  • each Core 201 has calculated the global mean and global variance, so each Core 201 can be combined with a normalized calculation formula And zoom and pan operations to calculate their respective After activating the ReLU, it is stored in the HBM / DDR through the on-chip interconnect network 202 for subsequent reverse processing.
  • the sum when calculating the mean ⁇ in the kernel and the mean value of x 2 in each of the calculation kernels, the sum may be calculated in the kernel first, that is, only the sum of m training samples x is calculated. And the sum of m training samples x 2 is written into the atomic operation accumulation unit, and the resulting values are m * S1 and m * S2. Therefore, when the subsequent calculation kernel calculates the global mean and global variance separately, it needs to do more. The division operation of m is performed once, so as to finally obtain the global mean and global variance.
  • the implementation of other parts is similar to the principle of the foregoing embodiment of the invention, and will not be repeated here.
  • FIG. 7 is a schematic flowchart of a data processing method according to an embodiment of the present invention, which can be applied to the neural network processor corresponding to FIG. 4 or FIG. 5 described above.
  • the method may include the following steps S701 to S703.
  • the method may further include steps S700, S704, and S705.
  • Step S700 Obtain a feature map matrix and a weight matrix, and calculate any one of the n input matrices according to the feature map matrix and the weight matrix.
  • Step S701 For each input matrix of the n input matrices, perform the following processing: calculate the kernel mean ⁇ of the input matrix according to the input matrix, and write u into the atomic operation accumulation unit, where the input matrix includes m training samples, where the kernel average ⁇ is the average value of the m training samples x, where m is an integer greater than or equal to 1; the average value v of m x 2 is calculated according to the input matrix, and v is written to the atomic operation accumulation unit.
  • Step S702 The n ⁇ s calculated according to the n input matrices and the n vs are processed as follows: the n ⁇ s are accumulated to obtain S1; the n vs are accumulated to obtain S2.
  • Step S703 Perform the following processing on S1 and S2: calculate the global variances of the n input matrices according to the S1 and S2.
  • the calculating the global variance of the n input matrices according to the S1 and S2 includes: according to a calculation formula: Calculate the global variance of the n input matrices.
  • Step S704 according to the formula Respectively normalize the m training samples x in the input matrix, where: Is the global mean of the n input matrices, and ⁇ 2 is the global variance of the n input matrices, A normalized result of the ith training sample x i among the m training samples, where ⁇ is a value greater than 0.
  • Step S705 according to the formula Perform scaling and translation processing on the m training samples x after the normalization processing; among them, Is the normalized result of the ith training sample x i among the m training samples, y i is the output result obtained by batch normalizing B i of x i , ⁇ is the scaling parameter, and ⁇ is the translation parameter.
  • An embodiment of the present invention further provides a computer storage medium, where the computer storage medium may store a program, and when the program is executed, it includes part or all of the steps described in any of the foregoing method embodiments.
  • An embodiment of the present invention also provides a computer program.
  • the computer program includes instructions.
  • the computer program can execute some or all steps of any data processing method.
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the above units is only a logical function division.
  • multiple units or components may be combined or integrated.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, which may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objective of the solution of this embodiment.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each of the units may exist separately physically, or two or more units may be integrated into one unit.
  • the above integrated unit may be implemented in the form of hardware or in the form of software functional unit.
  • the technical solution of the present application is essentially a part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, which is stored in a storage medium. It includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device, and specifically a processor in a computer device) to perform all or part of the steps of the foregoing method in each embodiment of the present application.
  • a computer device which may be a personal computer, a server, or a network device, and specifically a processor in a computer device
  • the foregoing storage medium may include: a U disk, a mobile hard disk, a magnetic disk, an optical disk, a read-only memory (abbreviation: ROM), or a random access memory (Random Access Memory, abbreviation: RAM).
  • ROM read-only memory
  • RAM random access memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Image Analysis (AREA)

Abstract

Des modes de réalisation de la présente invention concernent un processeur de réseau neuronal, un procédé et un dispositif associé, le processeur de réseau neuronal comprenant : n noyaux de calcul, une unité d'accumulation d'opération atomique et une mémoire cache partagée intégrée ; chaque noyau est utilisé pour calculer une moyenne μ à l'intérieur d'un noyau d'une matrice d'entrée en fonction de la matrice d'entrée et pour écrire μ dans l'unité d'accumulation d'opération atomique ; le calcul d'une moyenne v de m x2s en fonction de la matrice d'entrée, et l'écriture de v dans l'unité d'accumulation d'opération atomique ; l'unité d'accumulation d'opération atomique est utilisée pour accumuler n µ écrits par les n noyaux pour obtenir S1 et pour écrire S1 dans la mémoire cache partagée intégrée ; l'accumulation de n v écrits par les n noyaux pour obtenir S2, et l'écriture de S2 dans la mémoire cache partagée intégrée. Chaque noyau est en outre utilisé pour acquérir S1 et S2 à partir de la mémoire cache partagée intégrée et pour calculer la variance globale de n matrices d'entrée de n noyaux en fonction de S1 et de S2. L'utilisation de la présente invention permet d'augmenter la vitesse d'apprentissage du réseau neuronal.
PCT/CN2018/109208 2018-09-30 2018-09-30 Processeur de réseau neuronal, procédé de traitement de données et dispositif associé WO2020062299A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201880098253.3A CN112789627B (zh) 2018-09-30 2018-09-30 一种神经网络处理器、数据处理方法及相关设备
PCT/CN2018/109208 WO2020062299A1 (fr) 2018-09-30 2018-09-30 Processeur de réseau neuronal, procédé de traitement de données et dispositif associé

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2018/109208 WO2020062299A1 (fr) 2018-09-30 2018-09-30 Processeur de réseau neuronal, procédé de traitement de données et dispositif associé

Publications (1)

Publication Number Publication Date
WO2020062299A1 true WO2020062299A1 (fr) 2020-04-02

Family

ID=69949555

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/109208 WO2020062299A1 (fr) 2018-09-30 2018-09-30 Processeur de réseau neuronal, procédé de traitement de données et dispositif associé

Country Status (2)

Country Link
CN (1) CN112789627B (fr)
WO (1) WO2020062299A1 (fr)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881880A (zh) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 一种基于新型网络的票据文本识别方法
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN115278360A (zh) * 2022-07-18 2022-11-01 天翼云科技有限公司 一种视频数据处理方法及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022468A (zh) * 2016-05-17 2016-10-12 成都启英泰伦科技有限公司 人工神经网络处理器集成电路及该集成电路的设计方法
CN106056211A (zh) * 2016-05-25 2016-10-26 清华大学 神经元计算单元、神经元计算模块及人工神经网络计算核
CN106844294A (zh) * 2016-12-29 2017-06-13 华为机器有限公司 卷积运算芯片和通信设备
CN108256638A (zh) * 2018-01-05 2018-07-06 上海兆芯集成电路有限公司 微处理器电路以及执行神经网络运算的方法

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10083395B2 (en) * 2015-05-21 2018-09-25 Google Llc Batch processing in a neural network processor
US11062206B2 (en) * 2015-11-12 2021-07-13 Deepmind Technologies Limited Training neural networks using normalized target outputs
CN105488565A (zh) * 2015-11-17 2016-04-13 中国科学院计算技术研究所 加速深度神经网络算法的加速芯片的运算装置及方法
US10831444B2 (en) * 2016-04-04 2020-11-10 Technion Research & Development Foundation Limited Quantized neural network training and inference
WO2017185335A1 (fr) * 2016-04-29 2017-11-02 北京中科寒武纪科技有限公司 Appareil et procédé d'exécution d'une opération de normalisation par lots
KR102592721B1 (ko) * 2017-01-11 2023-10-25 한국전자통신연구원 이진 파라미터를 갖는 컨볼루션 신경망 시스템 및 그것의 동작 방법
CN108090565A (zh) * 2018-01-16 2018-05-29 电子科技大学 一种卷积神经网络并行化训练加速方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106022468A (zh) * 2016-05-17 2016-10-12 成都启英泰伦科技有限公司 人工神经网络处理器集成电路及该集成电路的设计方法
CN106056211A (zh) * 2016-05-25 2016-10-26 清华大学 神经元计算单元、神经元计算模块及人工神经网络计算核
CN106844294A (zh) * 2016-12-29 2017-06-13 华为机器有限公司 卷积运算芯片和通信设备
CN108256638A (zh) * 2018-01-05 2018-07-06 上海兆芯集成电路有限公司 微处理器电路以及执行神经网络运算的方法

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111881880A (zh) * 2020-08-10 2020-11-03 晶璞(上海)人工智能科技有限公司 一种基于新型网络的票据文本识别方法
CN112308762A (zh) * 2020-10-23 2021-02-02 北京三快在线科技有限公司 一种数据处理方法及装置
CN115278360A (zh) * 2022-07-18 2022-11-01 天翼云科技有限公司 一种视频数据处理方法及电子设备
CN115278360B (zh) * 2022-07-18 2023-11-07 天翼云科技有限公司 一种视频数据处理方法及电子设备

Also Published As

Publication number Publication date
CN112789627B (zh) 2023-08-22
CN112789627A (zh) 2021-05-11

Similar Documents

Publication Publication Date Title
WO2020221200A1 (fr) Procédé de construction de réseau neuronal, procédé et dispositifs de traitement d'image
WO2020216227A9 (fr) Procédé et appareil de classification d'image et procédé et appareil de traitement de données
WO2020238293A1 (fr) Procédé de classification d'image, procédé et appareil de formation de réseau neuronal
WO2022001805A1 (fr) Procédé et dispositif de distillation de réseau neuronal
WO2022042713A1 (fr) Procédé d'entraînement d'apprentissage profond et appareil à utiliser dans un dispositif informatique
WO2021155792A1 (fr) Appareil de traitement, procédé et support de stockage
WO2021147325A1 (fr) Procédé et appareil de détection d'objets, et support de stockage
WO2021057056A1 (fr) Procédé de recherche d'architecture neuronale, procédé et dispositif de traitement d'image, et support de stockage
WO2022179492A1 (fr) Procédé de traitement par élagage pour réseau neuronal convolutif, procédé et dispositifs de traitement de données
WO2021051987A1 (fr) Procédé et appareil d'entraînement de modèle de réseau neuronal
US20220157041A1 (en) Image classification method and apparatus
CN111898703B (zh) 多标签视频分类方法、模型训练方法、装置及介质
WO2021018251A1 (fr) Procédé et dispositif de classification d'image
WO2021129668A1 (fr) Procédé d'apprentissage de réseau neuronal et dispositif
WO2020062299A1 (fr) Processeur de réseau neuronal, procédé de traitement de données et dispositif associé
WO2021227787A1 (fr) Procédé et appareil de formation de prédicteur de réseau neuronal, et procédé et appareil de traitement d'image
WO2023020613A1 (fr) Procédé de distillation de modèle et dispositif associé
WO2022111387A1 (fr) Procédé de traitement de données et appareil associé
WO2022267036A1 (fr) Procédé et appareil d'entraînement de modèle de réseau neuronal ainsi que procédé et appareil de traitement de données
WO2023165361A1 (fr) Procédé de traitement de données et dispositif associé
WO2023280113A1 (fr) Procédé de traitement de données, procédé d'entraînement pour modèle de réseau neuronal et appareil
WO2022156475A1 (fr) Procédé et appareil de formation de modèle de réseau neuronal, et procédé et appareil de traitement de données
Yuan et al. Low-res MobileNet: An efficient lightweight network for low-resolution image classification in resource-constrained scenarios
WO2021136058A1 (fr) Procédé et dispositif de traitement vidéo
WO2023231887A1 (fr) Procédé et dispositif d'apprentissage continu à base de tenseur

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18934963

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18934963

Country of ref document: EP

Kind code of ref document: A1