CN111401538A - Data processing method and device, computer equipment and storage medium - Google Patents

Data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN111401538A
CN111401538A CN201910910120.8A CN201910910120A CN111401538A CN 111401538 A CN111401538 A CN 111401538A CN 201910910120 A CN201910910120 A CN 201910910120A CN 111401538 A CN111401538 A CN 111401538A
Authority
CN
China
Prior art keywords
operator
operators
concat
glue
processor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910910120.8A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Cambricon Information Technology Co Ltd
Original Assignee
Shanghai Cambricon Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Cambricon Information Technology Co Ltd filed Critical Shanghai Cambricon Information Technology Co Ltd
Priority to CN201910910120.8A priority Critical patent/CN111401538A/en
Publication of CN111401538A publication Critical patent/CN111401538A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Neurology (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application discloses a data processing method, a data processing device, computer equipment and a storage medium, when an optimized structure exists in a neural network model, optimization aiming at the neural network model can be realized by executing at least one optimization operation in a plurality of optimization operations on the neural network model, and the overall performance of the neural network model is improved. When a request for a machine learning processing task is received, the optimized neural network model is called to reduce redundant computation, and then resource consumption of computer equipment is reduced.

Description

Data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a data processing method and apparatus, a computer device, and a storage medium.
Background
The neural network is an arithmetic mathematical model for simulating animal neural network behavior characteristics and performing distributed parallel information processing, the network is formed by connecting a large number of nodes (or called neurons) in a star-lake manner, and by adjusting the interconnection relationship among the large number of nodes inside, input data and weight are utilized to generate output data to simulate the information processing process of human brain and generate a result after pattern recognition.
In the prior art, when an algorithm designer designs a neural network model, a glue operator is often introduced into the neural network model for the purpose of neatness and conciseness of the description of the neural network model. Here, the "glue" operator means that the operator itself does not involve any computation logic, and its input data and output data do not change regardless of the number of the numbers or the values themselves. However, the introduction and combination of unreasonable "glue" operators can cause that some unnecessary and unreasonable access behaviors are added in the execution process of the neural network model on the higher-level computation diagram level, which affects the improvement of excellent performance brought by the artificial intelligence processor aiming at optimization of the computation part of the neural network model on the hardware structure and instruction design, and reduces the overall performance of the neural network model. This clearly increases the resource consumption of the computer device when it is running the above-mentioned neural network model containing the "glue" operator that can be optimized.
Disclosure of Invention
The embodiment of the application provides a data processing method, a data processing device, computer equipment and a storage medium, which can optimize a neural network model and improve the overall performance of the neural network model. In addition, when the computer device runs the optimized neural network model, the resource consumption of the computer device can be reduced.
In order to achieve the above object, in a first aspect, an embodiment of the present application provides a data processing method, including:
a general processor acquires a calculation graph corresponding to a neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the general processor acquires the corresponding binary instructions according to the optimization result so as to distribute the binary instructions to the corresponding artificial intelligence processors to execute tasks.
To achieve the above object, in a second aspect, an embodiment provides a data processing apparatus including means for performing the method of the first aspect. Specifically, the apparatus may include:
the acquisition unit is used for acquiring a calculation graph corresponding to the neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the optimization unit is used for optimizing the calculation graph according to the logical relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the execution unit is used for acquiring the corresponding binary instructions according to the optimization result so as to distribute the binary instructions to the corresponding artificial intelligence processors to execute tasks.
In order to achieve the above object, in a third aspect, the present application provides a computer device, including a plurality of heterogeneous processors and a memory, where the processors and the memory are connected to each other, where the processors include a general-purpose processor and an artificial intelligence processor, the memory is used for storing a computer program that supports the computer device to execute the above method, and the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method of the first aspect.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method of the first aspect.
By implementing the embodiment of the application, the computer equipment optimizes the neural network model according to the logical relation of the glue operator in the neural network model so as to improve the overall performance of the neural network model. When the computer device calls the optimized neural network model to execute the machine learning processing task, the resource consumption of the computer device can be reduced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below.
Fig. 1A is a schematic diagram of reshape operator semantics provided in an embodiment of the present application;
fig. 1B is a schematic diagram of a transpose operator semantic provided in an embodiment of the present application;
FIG. 1C is a diagram illustrating concat operator semantics provided by an embodiment of the present application;
FIG. 1D is a diagram illustrating split operator semantics provided by an embodiment of the present application;
fig. 1E is a schematic diagram of continuous storage of tensor data provided by an embodiment of the present application;
FIG. 1F is a schematic diagram of an exemplary guaranteed operation provided by an embodiment of the present disclosure;
fig. 1G is a schematic diagram of a stride-containing memory distribution provided in an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a data processing method according to an embodiment of the present application;
FIG. 4A is a schematic diagram illustrating an optimization of a neural network model provided by an embodiment of the present application;
FIG. 4B is a schematic diagram of another neural network model optimization provided by an embodiment of the present application;
FIG. 4C is a schematic diagram of another neural network model optimization provided by an embodiment of the present application;
FIG. 4D is a schematic diagram illustrating another neural network model optimization provided by an embodiment of the present application;
fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
It should be understood that the terms "first," "second," and "third," etc. in the claims, description, and drawings of the present disclosure are used to distinguish between different objects and not to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the disclosure herein is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.
As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".
In order to better understand the technical solutions described in the present application, the following first explains the technical terms related to the embodiments of the present application:
(1) tensor (tensor)
In the technical scheme, the tensor is only the feature description of a piece of stored data, and the tensor records information such as the shape and the type of the data.
In the embodiment of the present application, the tensor should be understood as tensor data, and may include input tensor data and output tensor data in the neural network model, and may also include feature tensor data and the like.
Taking the artificial intelligence deep learning framework tensorial flow as an example, the order (rank), shape (shape) and dimension (dimension number) are generally used to describe the dimensions of the tensor, and the relationship can be expressed as shown in table 1:
TABLE 1
Figure BDA0002213619820000041
Figure BDA0002213619820000051
As shown in table 1, tensor a is 4, which represents one number.
As shown in table 1, the tensor a ═ 6,2], which represents a two-dimensional matrix, specifically, a matrix of 6 rows and 2 columns.
(2) Partitioning of operators
In the prior art, an algorithm designer adopts an operator as a basic unit and constructs a calculation graph describing a neural network algorithm by using tensor data associated with the operator. In the embodiment of the application, the division is performed according to the semantics of the operators, and the operators in the current deep learning can be divided into two types. This is explained in detail below:
the first type of operators are responsible for acquiring output features from input features, and have respective specific calculation tasks, and multiply, add, nonlinear calculate, compare and select and other mathematical operations on input data. For example, the convolution operator performs convolution calculation on a local region of the input feature image by using a convolution kernel, and obtains an output feature by linear calculation on data in the input feature image; for another example, the full join operator linearly combines all the input features by using a matrix multiplication mode; also for example, the pooling operator samples input data to obtain output data, and so on.
The semantics of another class of operators, which do not involve any computational logic, have input data and output data that are neither the number of values nor the values themselves have any change, are generally used to adjust the format, shape, and arrangement in memory of tensor data in a computational graph of a neural network model, in order to adjust the tensor data computed upstream of the neural network model into a form that is better and more convenient for downstream computations, as part of the "glued" neural network context computation. In particular, this class of operators is referred to as "glue" operators. Accordingly, the part of the computation graph that is made up of the "glue" operator is called the "glue" subgraph.
(3) 'glue' operator
In the embodiment of the present application, there are 4 kinds of "glue" operators, including reshape operator, transpose operator, concat operator, and split operator. One of which is described next:
A. reshape operator
In the embodiment of the present application, the reshape operator, that is, the tensor reshaping operator, refers to the re-interpretation of the shape of the tensor.
In practical applications, the reshape operator can be used to shape the tensor data. Specifically, the reshape operator can be expressed as: reshape (tenor, shape, name ═ None) is used to transform tenor into the form of parameter shape.
In one case, the parameter shape [ -1], indicates that tenor is expanded into a list.
In one case, the parameter shape ═ a, b, c.., n ], where a, b, c.. n are all positive integers greater than 0, represents the transformation of tensor into a multidimensional matrix. In one case, the parameter shape ═ a, -1, c., n ], where b ═ 1, a, c., n are all positive integers greater than 0, indicating that tf is automatically calculated from the original size of tenor.
Taking the tensor a ═ 3,2,4 as an example, after the reshape1 operator operation is performed on the tensor a, the tensor B is obtained, where the tensor B ═ 2,6, 2. In particular, see the schematic diagram of reshape operator semantics as shown in fig. 1A.
B. transpose operator
In the embodiment of the present application, the transpose operator, that is, the tensor transpose operator, refers to transposing the tensor.
In practical applications, the transpose operator can be used to adjust the dimensional order of the tensor data. Specifically, the transpose operator can be expressed as: transit (a, perm ═ None, name ═ transit') is used to transpose the order of tensors according to the perm parameters. Here, the perm parameter is a full permutation of the natural number sequence [1,2, 3.., n ], and different full permutations represent different transpose functions.
In general, a multidimensional tensor has multiple dimensions and has a precedence order among the dimensions, and a transpose operator can change the precedence order of the dimensions. Furthermore, it should be noted that in some scenarios, the transpose operator is also referred to as permute operator.
Taking the tensor a ═ 3,2,4 as an example, after a transpose operator operation is performed on the tensor a, a tensor B is obtained, where the tensor B ═ 4,2, 3. In particular, see the schematic diagram of transpose operator semantics as shown in fig. 1B.
C. concat operator
In the embodiment of the present application, the concat operator, that is, the concatenation operator, is configured to concatenate the plurality of tensor data into a tensor along the specified dimension. The other dimensions of the input tensor should be consistent except in the specified dimension. By means of the concat operator, the neural network splices a plurality of tensors representing features from different positions upstream into one, so that the features can be processed together in downstream calculations. In particular, see the schematic diagram of concat operator semantics shown in fig. 1C.
D. split operator
In the embodiment of the present application, the split operator, that is, the splitting operator, is used to split one tensor into a plurality of tensors in a specified dimension. The split tensors are consistent in other dimensions except for the specified dimension. Through the split operator, the features belonging to the same tensor data can be split into a plurality of parts, so that the targeted processing is respectively carried out in the subsequent calculation. In particular, see the schematic diagram of split operator semantics shown in fig. 1D.
In summary, in the embodiment of the present application, the glue operator is configured to adjust at least one of a format of tensor data, a shape of the tensor data, and an arrangement of the tensor data in the memory in the computation graph corresponding to the neural network model.
It should be noted that, in the embodiment of the present application, the glue operator may include, but is not limited to, the 4 different types of operators described above, and may also include other operators, and the embodiment of the present application is not particularly limited.
(4) Data arrangement of tensor data in storage
In the neural network calculation, a multidimensional tensor is used as a basic unit for data transfer among operators. Typically, data is stored in memory in a continuous manner. For example, as shown in FIG. 1E, data is stored in I0-I15In 16 consecutive bits.
For example, the tensor with the shape of (D0, D1 and D2) is stored in a continuous memory with the size of D0 × D1 × D2, and the data of the coordinates (n0, n1 and n2) in the tensor is to be accessed, and the address of the data in the memory can be determined based on the starting address of the data in the memory and the data offset (n0 × D1+ n1) × D2+ n2 obtained by calculation.
It can be understood that the use of such a tightly continuous storage method to store multidimensional tensor data is very intuitive and convenient, and the scaling of the element coordinates and their offsets in memory is also very concise. In the prior art, a deep learning framework, for example, Caffe and MXNet, manages memory management of tensor data in a neural network model in this way, and on the basis of the memory management, kernel functions of various operators such as convolution and pooling on a general purpose processor and an artificial intelligence processor (e.g., GPU) are implemented. However, this memory arrangement is far from optimal for performance. In order to meet hardware design and improve performance, hardware manufacturers design different arrangements of data in a memory, and the distinctive arrangements are main reasons for performance waste of 'glue' subgraphs on neural network processing.
(5) Dimension order
Taking a convolutional neural network as an example (specifically, the convolutional neural network is used for image classification or object detection), tensor data in a calculation graph of a neural network model generally has 4 dimensions, which are N representing the batch size of data processed by current calculation, C representing the number of feature images, and H and W representing the size of the feature images.
In the embodiment of the application, the tensor data can be in the dimension order of NCHW, namely N is the outermost dimension in the process of solving the migration, and W is the innermost dimension, for example, default tensor data in Caffe uses the dimension order, MXNet and TensorFlow can support the dimension order, and the migration of the element with the coordinate of (N, C, H, W) in the storage is ((N × C + C) × H + H) × W + W.
In the embodiment of the present application, the dimension order of tensor data may also be NHWC (where C is the innermost dimension), and the corresponding coordinate-to-offset conversion method is ((n × H + H) × W + W) × C + C. in practical applications, NHWC is closer to a BMP (full name: Bitmap) picture data storage format than NCHW, and data is stored in BMP-formatted files by pixel points, each of which stores color values of all channels, which makes it unnecessary to perform additional dimension conversion when reading an input image.
In the embodiment of the present application, the dimension order of the tensor data may also be CHWN (where N is the innermost dimension), and the corresponding coordinate-to-offset conversion manner is ((c × H + H) × W + W) × N + N.
From the perspective of an artificial intelligence processor, in order to maximize performance benefits, the most suitable dimension order is selected to store tensor data in combination with the self microstructure design.
For example, an operator sequence consisting of transpose and reshape implements a variation process of (N, C, H, W) → (N, H, W, C) → (N, C × W,1,1), which is intended to merge data in the C, H, W dimensions into one dimension and ensure that the original C dimension can be at the innermost side of the merged dimension.
In the embodiment of the present application, for an artificial intelligence processor that stores tensor data by using a dimension order other than NCHW, the difference in dimension does not cause an error in the calculation result, but affects the performance. When the artificial intelligence processor adopts different dimension orders, the correctness of the final result can be ensured as long as the operation equivalent to the abstract semantic meaning is realized on the actual dimension order of each operator in the execution process. For example, as shown in fig. 1F, tensor data actually adopts the data arrangement of NCWH in storage, and the definition of the neural network model is given based on NCHW. In this case, in order to ensure the equivalence of each operation, the result of each operator in the actual execution process should be transformed into the dimensional order assumed in the definition phase through the transformation phi on the basis of the input data, and the completion indexAnd (5) operating a fixed operator, and obtaining the correct arrangement of the output tensors corresponding to the actual dimension sequence NCWH through phi inverse transformation. Since the assumed order is NCHW and the arrangement order of the tensor data actually used is NCWH, the transformation phi and the inverse transformation
Figure BDA0002213619820000091
Are all transpose operations with parameters (0,1,3, 2). In a concrete implementation, a transit operator can merge multiple internal transit processes, but a reshape operator adds one more transit process in the implementation, which is not conceivable by an algorithm designer at the beginning of designing an algorithm but is necessary for ensuring consistency of implementation and abstract semantics. Therefore, moving the original computation graph structure on the artificial intelligence processor can affect the performance on the premise that the algorithm designer lacks the understanding of the sequence of the bottom-layer dimensions.
(6) Stride (stride)
As mentioned above, tensor data is typically stored in memory in a continuous and compact manner, but the artificial intelligence processor may employ a discontinuous data storage manner.
In the embodiment of the present application, the discontinuous storage mode means that the mathematical dimension of the half of the tensor data is greatly smaller than the actual dimension used for calculating the offset in the storage, wherein the actual dimension used for calculating the offset is called stride, for example, as shown in fig. 1G, the W dimension in the two-dimensional tensor, and the inside dimension itself is 4, but the actual storage is laid out according to 6, accordingly, when reading data in the same H dimension across W, 6 values are required to be skipped, instead of 4 values, more generally, stride _ n, stride _ C, stride _ H, and stride _ W are used to respectively indicate the offsets that need to be skipped when reading the next value along N, C, H, W four dimensions, and for the coordinates (n, C, H, W) of a given element in the memory based on the starting address, the offsets of the element in the storage are n × stride _ n + C × stride _ C + H × stride _ H + × stride _ W, and the various layouts of the tensor arranged tensors in the continuous tight state, such as run, nhw, stride _ C, stride _ W, and may be regarded as continuous layout in the form of ×, nhc, stride, and stride _ W — H —.
For an artificial intelligence processor, adopting stride in data layout is often a consideration of data alignment and access bit width. The vector calculation is used for the problems of alignment and rounding in the neural network model, such as parallel calculation of convolution along the C dimension by hardware, the vector calculation instruction and the long-bit wide register allow the multiply-add of 64 floating point numbers to be processed at a time, and accordingly, data with the width of 64C dimension can be read from a storage at a time for calculation. Tensor data and operators that are not integer multiples of 64 in the C dimension are always present in neural network models. In order to process the tail-most remaining part, the access and calculation instructions need to be implemented separately, which makes the instructions cumbersome in design. Furthermore, the memory unit may have a limitation of memory access alignment, that is, the starting address of each memory access must be a multiple of a constant, which further increases the difficulty of instruction implementation. To avoid this, a simpler approach is to align the dimensions of the tensor data directly up to the nearest integer multiple, with the supplementary part filled with 0's. For most operators including convolution, pooling, and full join operators, the additional 0 has no effect on the final calculation result even if it participates in the calculation. By complementing 0, stride of corresponding dimension becomes integral multiple of bit width of calculation and access, thereby avoiding the trouble of separately processing tail data.
In practical application, for tensor data stored continuously, reshape is an operation with zero overhead, and only shape information of the tensor needs to be modified, but when the dimension related to stride alignment is involved, the overhead introduced by the reshape operator cannot be ignored. For example, assuming the two dimensions of the tensor in FIG. 1G are combined into one, the storage locations of most elements need to be readjusted, eliminating the last two 0 s of the W dimension.
(7) Data segmentation (Blocking)
Specifically, a vector register and a Single Instruction multiple data stream SIMD (SIMD) can be used to perform parallel computation on convolution along a certain dimension (usually C) dimension, but a data bit width that can be processed at one time is limited, in order to ensure that an intermediate result in the register can be fully utilized as much as possible, the C dimension is further split by an input tensor, and the C dimension is divided into sub-segments according to the data bit width that can be processed by a general processor and is continuously stored in a memory, so that the utilization rate of a cache is improved. Assuming that the SIMD instructions of the artificial intelligence processor can complete 8 floating-point calculations at a time, the layout of N, C, H, W is segmented and adjusted to N, C/8, H, W, 8. The segmentation idea is also suitable for the calculation optimization of some artificial intelligence processors, and the difference is that the latter can process wider vector data at one time, and the segmentation method can also ensure the access continuity of the calculation stage, which is beneficial to improving the access efficiency.
In practical applications, for an artificial intelligence processor using a segmented data layout, adjustment of data layout involving segmentation dimensions needs to consider the influence of segmentation, and compared with the aforementioned dimension order and stride, performance improvement means that can be used for the segmented layout are fewer, but in some special cases, different neural network computation graph structures still have certain influence on performance.
Generally speaking, there are various reasons to make the artificial intelligence processor select the storage data arrangement mode according with its own characteristics, and the algorithm designer is difficult to know the details hidden in the bottom layer, so that it is possible to cause performance waste by moving the original computation graph structure on the artificial intelligence processor, and reasonably adjusting the structure of the 'glue' subgraph (the 'glue' subgraph is composed of 'glue' operators) can avoid a large amount of unnecessary access and storage expenses, and optimize the execution performance of the whole neural network model.
In the following embodiments of the present application, an implementation of how to optimize the neural network model according to the logical relationship of the "glue" operator in the neural network model will be described in detail.
(8) Artificial intelligence processor
An artificial intelligence processor, also referred to as a special purpose processor, in the embodiments of the present application refers to a processor that is specific to a particular application or domain. For example: a Graphics Processing Unit (GPU), also called a display core, a visual processor, and a display chip, is a special processor dedicated to image operation on a personal computer, a workstation, a game machine, and some mobile devices (such as a tablet computer and a smart phone). Another example is: a Neural Network Processor (NPU), which is a special processor for matrix multiplication in the field of artificial intelligence, adopts a structure of data-driven parallel computation, and is particularly good at Processing massive multimedia data such as video and images.
Fig. 2 is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 2, the computer device 20 may comprise a general purpose processor 201, a memory 202, a communication bus 203, a communication interface 204 and at least one artificial intelligence processor 205, the general purpose processor 201, the artificial intelligence processor 205 being connected to said memory 202 and said communication interface 203 via said communication bus.
The general-purpose Processor 201 may be a Central Processing Unit (CPU), and the general-purpose Processor 201 may also be other general-purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field-Programmable Gate arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. The general purpose processor 201 may be a microprocessor or the general purpose processor 201 may be any conventional processor or the like.
The general purpose processor 201 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the neural network pruning method of the present application may be implemented by integrated logic circuits of hardware in the general-purpose processor 201 or instructions in the form of software.
The Memory 202 may be a Read-Only Memory (ROM), a Random Access Memory (RAM), or other Memory. In the embodiment of the present application, the memory 202 is used for storing data and various software programs, for example, a program for optimizing the neural network model according to the logical relationship of the glue operator in the embodiment of the present application, and the like.
Alternatively, in embodiments of the present application, the memory may include a physical device for storing information, typically a medium that digitizes the information and stores it in an electrical, magnetic, or optical manner. The memory according to this embodiment may further include: devices that store information using electrical energy, such as RAM, ROM, etc.; devices that store information using magnetic energy, such as hard disks, floppy disks, tapes, core memories, bubble memories, usb disks; devices for storing information optically, such as CDs or DVDs. Of course, there are other ways of memory, such as quantum memory, graphene memory, and so forth.
Communication interface 204 enables communication between computer device 20 and other devices or communication networks using transceiver means, such as, but not limited to, transceivers. For example, model files sent by other devices may be received via communication interface 204.
The artificial intelligence processor 205 may be mounted as a coprocessor to a main CPU (host CPU) for which tasks are assigned. In practical applications, the artificial intelligence processor 205 may implement one or more operations. For example, taking a neural Network Processing Unit (NPU) NPU as an example, a core portion of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 202 and perform a multiply-add operation.
Optionally, the artificial intelligence processor 205 may include 8 clusters (clusters), each cluster including 4 artificial intelligence processor cores.
Alternatively, artificial intelligence processor 205 may be a reconfigurable architecture artificial intelligence processor. Here, the reconfigurable architecture means that if a certain artificial intelligent processor can flexibly change its own architecture according to different application requirements by using reusable hardware resources, so as to provide an architecture matching with each specific application requirement, then the artificial intelligent processor is called a reconfigurable computing system, and its architecture is called a reconfigurable architecture.
It should be understood that computer device 20 is only one example provided for the embodiments of the present application and that computer device 20 may have more or fewer components than shown, may combine two or more components, or may have a different configuration implementation of components.
The following flow chart of the data processing method provided in the embodiment of the present application shown in fig. 3 is used to specifically describe how the neural network model is optimized in the data preprocessing stage in the embodiment of the present application, and the flow chart may include, but is not limited to, the following steps:
step S300, a general processor obtains a calculation graph corresponding to the neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation graph.
In an embodiment of the present application, a "Neural Network model," also referred to as a model, such as a "first Neural Network model," a "second Neural Network model," or a "third Neural Network model," may receive input data and generate a prediction output based on the received input data and current model parameters.
In the embodiment of the present application, the neural network model includes a glue operator. Specifically, the glue operator may include a reshape operator, a transpose operator, a concat operator, a split operator, and the like, and may further include other glue operators that may be used to adjust a format of tensor data in the neural network model, a shape of the tensor data, and arrangement of the tensor data in the memory, which is not specifically limited in the embodiment of the present application.
As previously mentioned, the arrangement of tensor data in memory may include, but is not limited to: tensor data is stored in a memory in a tightly continuous storage manner, tensor data is stored in a memory in a proper dimensional sequence, tensor data is stored in a memory in a discontinuous storage manner (for example, a memory distribution containing stride), a segmented layout, and the like.
In the embodiment of the application, the computation graph is a core data structure representing the neural network computation, and reflects the scale of data in the neural network computation, the type of computation and the complex dependency relationship between the data and the computation. Specifically, the basic elements of the structure of the graph are two, nodes and edges. The nodes are connected through directed edges to indicate that one data entity obtains another data entity through specific calculation.
In the embodiments of the present application, an operator refers to a function that implements a specific function. For example, the reshape operator is used to re-interpret the shape of the tensor data. For another example, a transpose operator is taken as an example, and the operator is used to adjust the dimensional order of tensor data.
In the embodiment of the application, the directed edge may be used to represent the connection relationship between the operators, and may also be used to represent the execution sequence of the artificial intelligence processor when executing the neural network model.
S302, the general processor optimizes the calculation graph according to the logic relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of reshape operator, transpose operator and split operator.
In one possible implementation, the logical relationship of the glue operator includes a logical relationship between concat operators, for example, a plurality of consecutive concat operators; in another possible implementation, the logical relationship of the glue operator includes the logical relationship of the concat operator with other operators, for example, the concat operator is adjacent to the reshape operator; as another example, the concat operator is adjacent to the transpose operator; as another example, the concat operator is adjacent to the split operator, and so on. Here, operators are adjacent to operators for characterizing the output data of one operator as the input data of another operator.
In the embodiment of the present application, the logical relationship of the glue operator should be understood as the execution logic of the computer device in the process of executing the program code of the neural network model. For example, during the execution of a certain program code, a computer device executes a reshape operator first and then a transpose operator, in which case: the computer device takes the output tensor data of the reshape operator as the input tensor data of the transpose operator.
Specifically, in the embodiment of the present application, the logical relationship of the glue operator may include the following situations, which are described in detail below:
the first case: the output tensor data of the plurality of reshape operators are the input tensor data of the concat operator.
In a specific implementation, the logical relationship of the glue operator includes that output tensor data of a plurality of reshape operators are input tensor data of a concat operator; the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph, and the method comprises the following steps:
and when the input tensors corresponding to the reshape operators are different in length of only one dimension at most, taking the output tensor data of the concat operator as the input tensor data of the reshape operators.
In the embodiment of the present application, the dimension refers to a dimension of tensor data in a computational graph in a neural network model. For example, taking a convolutional neural network as an example, the dimensions of tensor data in the computation graph in the convolutional neural network model may generally include 4 dimensions, which are N representing the batch size of data processed by the current computation, C representing the number of feature images, and H and W representing the feature image size, respectively.
In the embodiment of the present application, as shown in (a) in fig. 4A, the computation graph corresponding to the neural network model includes a concat operator and a plurality of reshape operators, where output tensor data of the plurality of reshape operators are input tensor data of the concat operator, and when an input tensor corresponding to each of the plurality of reshape operators has a length that is different by only one dimension at most, for example, a length in the W dimension, in this case, as shown in (b) in fig. 4A, the output tensor data of the concat operator is used as input tensor data of the plurality of reshape operators.
For ease of understanding, the following description is made with reference to specific examples, where tensor a is [3,4,5], tensor B is [3,6,5], tensor a and tensor B may result in tensor C being [6,2,5] and tensor D being [9,2,5] after passing through the respective resipe operators, and simultaneously, tensor E being [15,2,5] after passing through the concat operator. Analyzing the tensors a and B (the tensors a and B are the input tensors of the reshape operator) it can be known that the length of only one dimension of the tensors a and B (the dimension 6 in the tensor a and the dimension 4 in the tensor B) is different, then in this case, the output tensor data of the concat operator is taken as the input tensor data of the plurality of reshape operators, so that its implementation can be described as: tensor a ═ 3,4,5], tensor B ═ 3,6,5], tensor a and tensor B can result in tensor C ═ 3,10,5 after the concat operator, while tensor D' can result in tensor D ═ 15,2,5 after the reshape operator. It will be appreciated that, since the above-described optimization operation may improve the overall performance of the neural network model, the resource consumption of the computer device may be reduced when the processor (e.g., general purpose processor CPU, special purpose processor artificial intelligence processor) is running the optimized neural network model.
It should be noted that, in the embodiment of the present application, when a plurality of reshape operators are a plurality of consecutive reshape operators, the plurality of consecutive reshape operators may be merged to obtain one reshape operator. For example, the reshape1 operator is adjacent to reshape2, and a tensor a ═ a1, a2, A3, a.., An ] can be obtained after reshape1 operator for tensor a, which can be referred to as a tensor B ═ B1, B2, B3.., Bn. Meanwhile, after the tensor B passes through the reshape2 operator, a tensor C is obtained, where the tensor C is [ C1, C2, C3,.. cndot ]. It can be understood that the input of reshape3 operator, which is obtained by combining reshape1 operator with reshape2 operator, is the a tensor, and the output is the C tensor. For example, a is [1,32,1,1], after the reshape1 operator, B is [1,4,4,2], and after the reshape2 operator, C is [16,2 ]. By adopting the technical scheme described in the application, reshape1 operator and reshape2 operator are combined to obtain reshape3 operator, and tensor A is directly changed from tensor A ═ 1,32,1,1 to tensor C ═ 16,2 after passing through reshape3 operator. It will be appreciated that when a processor (e.g., a general purpose processor CPU, a special purpose processor artificial intelligence processor) is running the neural network model, here, since the neural network model is an optimized model, the resource consumption of the computer device may be reduced.
The second case: the output tensor data of the plurality of transpose operators are the input tensor data of the concat operator.
In a specific implementation, the logical relationship of the glue operator includes that output tensor data of a plurality of transpose operators are input tensor data of a concat operator; the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph, and the method comprises the following steps:
and when the perm parameters corresponding to the plurality of transpose operators are the same, taking the output tensor data of the concat operator as the input tensor data of the plurality of transpose operators.
As previously mentioned, the transpose operator can be expressed as: a transit operator may include a perm parameter, if any. In the embodiment of the present application, the perm parameter is a full permutation of the natural number sequence [1,2, 3.., n ], and different full permutations represent different transpose operators.
Specifically, full queuing is defined as: m (m is less than or equal to n) elements are randomly selected from n different elements and are arranged according to a certain sequence, namely one arrangement of m elements which are extracted from n different elements. All permutations when m is n are called full permutations. For example, the full arrangement of the three elements 1,2,3 may be: 1,2, 3; 1,3, 2; 2,1, 3; 2,3, 1; 3,1, 2; 3,2,1.
In this embodiment of the present application, the term that perm parameters corresponding to a plurality of transpose operators are the same means: all queues corresponding to the multiple transpose operators are the same.
In the embodiment of the present application, as shown in (a) in fig. 4B, the computation graph corresponding to the neural network model includes a concat operator and a plurality of transit operators, where output tensor data of the plurality of transit operators is input tensor data of the concat operator, and when perm parameters corresponding to the plurality of transit operators are the same, as shown in (B) in fig. 4B, the output tensor data of the concat operator is used as input tensor data of the plurality of transit operators.
For the sake of understanding, the following description is made with reference to specific examples, where the tensor a is [3,4,5], the tensor B is [3,6,5], the tensor a and the tensor B pass through the respective corresponding transpose operators, specifically, the perm parameters corresponding to each of the plurality of transposes are [1, 0,2], the tensor C is [4,3,5] and the tensor D is [6,3,5], and at the same time, the tensor E is [10,3,5] when the tensor C and the tensor D pass through the concat operator. In this case, the output tensor data of the concat operator is taken as the input tensor data of the plurality of transpose operators, so that the implementation process thereof can be described as: the tensor a ═ 3,4,5, the tensor B ═ 3,6,5, the tensor a and the tensor B can obtain the tensor C ' ═ 3,10,5 after passing through the concat operator, and at the same time, the tensor D ' can obtain the tensor D ' ═ 10,3,5 after passing through the transpose operator. It will be appreciated that, since the above-described optimization operation may improve the overall performance of the neural network model, the resource consumption of the computer device may be reduced when the processor (e.g., general purpose processor CPU, special purpose processor artificial intelligence processor) is running the optimized neural network model.
It should be noted that, in the embodiment of the present application, when the plurality of transpose operators are consecutive transpose operators, the consecutive transpose operators may be combined to obtain one transpose operator. Specifically, the M consecutive transpose operators include a first transpose operator and a second transpose operator; the merging the M consecutive transpose operators into one transpose operator includes:
determining perm parameters corresponding to the first transpose operator and the second transpose operator respectively;
and determining a first parameter according to the perm parameters corresponding to the first and second transpose operators, wherein the first parameter is the perm parameter corresponding to the merged transpose operator.
In a specific implementation, the determining a first parameter according to perm parameters corresponding to the first transpose operator and the second transpose operator includes:
in determining the first parameter, calculating according to the following formula:
perm3[i]=perm1[perm2[i]]
wherein perm3 represents the first parameter, perm1 represents a perm parameter corresponding to the first transpose operator, and perm2 represents a perm parameter corresponding to the second transpose operator.
Here, the parenthesis [ ] indicates that the element in the array is taken.
For example, the perm parameter for the first transpose operator is perm1 ═ 1,2, the perm parameter for the second transpose operator is perm2 ═ 2,1, and when i ═ 1, perm3[1] ═ perm1[ perm2[1] ] ═ 2. When i is 2, perm3[2] ═ perm1[ perm2[2] ] -1. Thus, perm parameter perm3 corresponding to the merged transpose operator can be obtained as [2,1 ]. Further, the merged transpose operator transposes the order of the tensors under the determined perm3 parameters.
For ease of understanding, the following description is set forth in connection with specific examples. For example, the transit _1423 operator and the transit _1243 operator are adjacent to each other, and the tensor a becomes [1,4,3,2], and after passing through the transit _1423 operator, the tensor B becomes [1,2,4,3], and after passing through the transit _1243 operator, the tensor C becomes [1,2,3,4 ]. By adopting the technical scheme described in the application, the transpose _1423 operator and the transpose _1243 operator are combined to obtain the transpose _1432 operator, and the tensor a is directly changed from the tensor a ═ 1,4,3,2] to the tensor C ═ 1,2,3,4 after passing through the transpose _1432 operator. When a processor (e.g., a general purpose processor CPU, a dedicated processor artificial intelligence processor) is running the neural network model, here, since the neural network model is an optimized model, the resource consumption of the computer device can be reduced.
The third situation: the output tensor data of the split operator is the input tensor data of the concat operator.
In a specific implementation, the logical relationship of the glue operator includes that output tensor data of the split operator is input tensor data of the concat operator; the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph, and the method comprises the following steps:
and in the case that the concat operator and the split operator operate in the same dimension, combining the concat operator and the split operator for elimination.
In the embodiment of the present application, as shown in (a) in fig. 4C, the computation graph corresponding to the neural network model includes a concat operator and a split operator, where output tensor data of the split operator is input tensor data of the concat operator, and in a case where the same dimension is satisfied when the concat operator and the split operator operate respectively, for example, the concat operator and the split operator are the same in the C dimension during execution, in this case, as shown in (b) in fig. 4C, the concat operator and the split operator are merged and eliminated.
For the sake of understanding, the following description is made with reference to specific examples, where the tensor a is [3,10,5], after passing through a split operator, the tensor a can obtain the tensor B is [3,4,5] and the tensor C is [3,6,5], and at the same time, after passing through a concat operator, the tensor B and the tensor C can obtain the tensor D is [3,10,5 ]. Since the respective operation dimensions of the split operator and the split operator are the same, that is, the output tensor data satisfying the split operator are the input tensor data of the concat operator, in this case, the concat operator and the split operator are merged and eliminated. It will be appreciated that, since the above-described optimization operation may improve the overall performance of the neural network model, the resource consumption of the computer device may be reduced when the processor (e.g., general purpose processor CPU, special purpose processor artificial intelligence processor) is running the optimized neural network model.
A fourth scenario: n consecutive concat operators.
In a specific implementation, the logical relationship of the glue operator includes N consecutive concat operators; wherein N is a positive integer greater than or equal to 2; the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph, and the method comprises the following steps:
and under the condition that the operation dimensionality of the N continuous concat operators is the same dimensionality, combining the N continuous concat operators.
In this embodiment of the present application, as shown in (a) in fig. 4D, the computation graph corresponding to the neural network model includes multiple concat operators, where the multiple concat operators operate in the same dimension, for example, an N dimension, in this case, the computer device may merge the multiple concat operators to obtain one concat operator, specifically, please refer to the optimization structure shown in (b) in fig. 4D.
It should be noted that, in the embodiment of the present application, for example, the logical relationship of the glue operator in the computation graph corresponding to the neural network model includes a logical relationship between a concat operator and a plurality of reshapes, a logical relationship between a concat operator and a plurality of transpose operators, a logical relationship between a concat operator and a split operator, and N consecutive concat operators, and when the computer device optimizes the computation graph corresponding to the neural network model according to the logical relationship of the glue operator in the computation graph corresponding to the neural network model, at least one optimization operation may be performed in the process, for example, the computation graph corresponding to the neural network model is optimized according to the logical relationship between the concat operator and the plurality of reshapes; for another example, a computation graph corresponding to the neural network model is optimized according to the logical relationship between the concat operator and the plurality of transpose operators; or, a combined implementation manner of one or more optimization operations may be included, or the optimization operations may be performed for all situations that may be optimized; further, the combinatorial optimization may be performed based on a situation that can be optimized, for example, a computation graph corresponding to the neural network model includes a plurality of reshape operators, a first concat operator, and a second concat operator, where before the optimization, output tensor data of the reshape operators are input tensor data of the first concat operator, output tensor data of the first concat operator is input tensor data of the second concat operator, and the computer device may optimize the first concat operator and the second concat operator according to the above described optimization manner to obtain an optimized third concat operator, and then, in this case, a logic relationship between the optimized operators is: the output tensor data of the reshape operators are input tensor data of the third concat operator, the computer device determines that the logic relationship between the current reshape operators and the third concat operator belongs to one of the implementation modes which can be optimized and are described in the application, the computer device can perform optimization according to the logic relationship between the reshape operators and the third concat operator, specifically, the third concat operator is adjusted to be before the reshape operators, and a final optimized neural network model is obtained, and the like.
And S304, the general processor acquires a corresponding binary instruction according to the optimization result so as to distribute the binary instruction to a corresponding artificial intelligence processor to execute tasks.
In the embodiment of the application, the general processor can call the set compiling interface of the artificial intelligence learning library to compile according to the optimization result of the neural network model, and obtain the corresponding binary instruction. The binary instructions are processed by the runtime library to generate a machine learning processing task. In practical application, the general processor can put the machine learning processing tasks into the task queue, and finally the driver schedules the machine learning processing tasks in the task queue to be executed by the artificial intelligence processor, so that an operation result is obtained.
In the embodiment of the application, the machine learning processing task refers to that a neural Network model completes a certain task by acquiring learning capacity, and specifically, in order to improve the practicability of the neural Network model, different neural Network models correspond to different machine learning processing tasks, for example, the machine learning processing task corresponding to a deep learning neural Network model can be image classification, text classification and the like, the machine learning processing task corresponding to a convolutional neural Network model can be image recognition, video classification and the like, and the machine learning processing task corresponding to a long-time Short Term Memory Network model (L ong Short Term Memory Network, L STM) can be voice recognition, picture description, natural language processing and the like.
In an embodiment of the present application, the request of the machine learning processing task may be an execution instruction input by a user for the neural network model. When the computer equipment receives a request of a machine learning processing task, the corresponding neural network model is obtained according to the type of the machine learning processing task, the neural network model is operated on the artificial intelligence processor, and then an operation result aiming at the machine learning processing task can be obtained. It should be noted that the neural network model run by the processor (e.g., general purpose processor, artificial intelligence processor) is an optimized neural network model.
In the embodiment of the present application, the execution result of the machine learning processing task refers to a result obtained when the computer device executes the machine learning processing task, and may include, but is not limited to: the accuracy of the neural network model when executing the machine learning processing task; the run time of the neural network model when performing machine learning processing tasks, and so on. Further optionally, the computer device may output the operation result, for example, the computer device displays the operation result through a display screen. It can be understood that, due to the fact that at least one optimization operation is executed on the calculation graph corresponding to the neural network model, the overall performance of the neural network model can be improved, redundant calculation can be reduced when the artificial intelligence processor calls the second neural network model to execute the machine learning processing task, and resource consumption of computer equipment can be further reduced.
By implementing the embodiment of the application, the computer equipment optimizes the neural network model according to the logical relation of the glue operator in the neural network model so as to improve the overall performance of the neural network model. When the computer device calls the optimized neural network model to execute the machine learning processing task, the resource consumption of the computer device can be reduced.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It should be further noted that, although the steps in the flowchart of fig. 3 are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 3 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
While the method of the embodiments of the present application has been described in detail, in order to better implement the above-described aspects of the embodiments of the present application, the following provides a corresponding apparatus for implementing the above-described aspects in a coordinated manner.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a data processing apparatus provided in an embodiment of the present application, where the apparatus 50 may include at least:
an obtaining unit 500, configured to obtain a computation graph corresponding to the neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the optimizing unit 502 is configured to optimize the computation graph according to the logical relationship of the glue operator in the computation graph, so as to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the execution unit 504 is configured to obtain a corresponding binary instruction according to the optimization result, so as to allocate the binary instruction to a corresponding artificial intelligence processor to execute a task.
In one possible implementation manner, the logical relationship of the glue operator includes that the output tensor data of the plurality of reshape operators are the input tensor data of the concat operator; the optimization unit 502 is specifically configured to:
and when the input tensors corresponding to the reshape operators are different in length of only one dimension at most, taking the output tensor data of the concat operator as the input tensor data of the reshape operators.
In one possible implementation, the logical relationship of the glue operator includes that the output tensor data of the plurality of transpose operators are the input tensor data of the concat operator; the optimization unit 502 is specifically configured to:
and when the perm parameters corresponding to the plurality of transpose operators are the same, taking the output tensor data of the concat operator as the input tensor data of the plurality of transpose operators.
In one possible implementation, the logical relationship of the glue operator includes that the output tensor data of the split operator is the input tensor data of the concat operator; the optimization unit 502 is specifically configured to:
and in the case that the concat operator and the split operator operate in the same dimension, combining the concat operator and the split operator for elimination.
In one possible implementation manner, the logical relationship of the glue operator includes N consecutive concat operators; wherein N is a positive integer greater than or equal to 2; the optimization unit 502 is specifically configured to:
and under the condition that the operation dimensionality of the N continuous concat operators is the same dimensionality, combining the N continuous concat operators.
It should be understood that the above-described apparatus embodiments are merely exemplary, and that the apparatus of the present disclosure may be implemented in other ways. For example, the division of the units/modules in the above embodiments is only one logical function division, and there may be another division manner in actual implementation. For example, multiple units, modules, or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented.
The units or modules described as separate parts may or may not be physically separate. A component described as a unit or a module may or may not be a physical unit, and may be located in one apparatus or may be distributed over a plurality of apparatuses. The solution of the embodiments in the present disclosure can be implemented by selecting some or all of the units according to actual needs.
Furthermore, it should be noted that the present application also provides a computer storage medium for storing computer software instructions for the computer device shown in fig. 2, which contains a program for executing the method embodiments described above. By executing the stored program, the neural network model can be optimized according to the logical relation of the glue operator in the neural network model, so that the overall performance of the neural network model is improved. When the computer device calls the optimized neural network model, the resource consumption of the computer device can be reduced because redundant operation does not need to be executed.
As can be seen from the above, according to the data processing method, the data processing apparatus, the computer device, and the storage medium provided in the embodiments of the present application, the method can optimize the neural network model according to the logical relationship of the glue operator in the neural network model, so as to improve the overall performance of the neural network model. When the computer device calls the optimized neural network model, the resource consumption of the computer device can be reduced because redundant operation does not need to be executed.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
Further, the foregoing may be better understood in light of the following clauses:
for example, clause a1, a data processing method, comprising:
a general processor acquires a calculation graph corresponding to a neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the general processor acquires the corresponding binary instructions according to the optimization result so as to distribute the binary instructions to the corresponding artificial intelligence processors to execute tasks.
A2. According to the method described in a1, the logical relationship of the glue operator includes that the output tensor data of the plurality of reshape operators are the input tensor data of the concat operator.
A3. The method according to a2, wherein the general purpose processor optimizes the computation graph according to the logical relationship of glue operators in the computation graph, and the method comprises the following steps:
and when the input tensors corresponding to the reshape operators are different in length of only one dimension at most, taking the output tensor data of the concat operator as the input tensor data of the reshape operators.
A4. The method of a1, wherein the logical relationship of the glue operator includes that the output tensor data of the plurality of transpose operators are the input tensor data of the concat operator.
A5. The method of claim a4, the general purpose processor optimizing the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and when the perm parameters corresponding to the plurality of transpose operators are the same, taking the output tensor data of the concat operator as the input tensor data of the plurality of transpose operators.
A6. The method of a1, wherein the logical relationship of the glue operator includes that the output tensor data of the split operator is the input tensor data of the concat operator.
A7. The method of claim a6, the general purpose processor optimizing the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and in the case that the concat operator and the split operator operate in the same dimension, combining the concat operator and the split operator for elimination.
A8. According to the method of A1, the logical relationship of the glue operator comprises N consecutive concat operators; wherein N is a positive integer greater than or equal to 2.
A9. The method of claim A8, the general purpose processor optimizing the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and under the condition that the operation dimensionality of the N continuous concat operators is the same dimensionality, combining the N continuous concat operators.
B1. A data processing apparatus comprising:
the acquisition unit is used for acquiring a calculation graph corresponding to the neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the optimization unit is used for optimizing the calculation graph according to the logical relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the execution unit is used for acquiring the corresponding binary instructions according to the optimization result so as to distribute the binary instructions to the corresponding artificial intelligence processors to execute tasks.
B2. The apparatus of B1, wherein the logical relationship of the glue operator includes that the output tensor data of the plurality of reshape operators are the input tensor data of the concat operator.
B3. According to the apparatus of B2, the optimization unit is specifically configured to:
and when the input tensors corresponding to the reshape operators are different in length of only one dimension at most, taking the output tensor data of the concat operator as the input tensor data of the reshape operators.
B4. The apparatus of B1, the logical relationship of the glue operator comprising the output tensor data of the plurality of transpose operators being the input tensor data of the concat operator.
B5. The apparatus according to claim B4, wherein the optimization unit is specifically configured to:
and when the perm parameters corresponding to the plurality of transpose operators are the same, taking the output tensor data of the concat operator as the input tensor data of the plurality of transpose operators.
B6. The apparatus of B1, the logical relationship of the glue operator comprising the output tensor data of the split operator being the input tensor data of the concat operator.
B7. The apparatus according to claim B6, wherein the optimization unit is specifically configured to:
and in the case that the concat operator and the split operator operate in the same dimension, combining the concat operator and the split operator for elimination.
B8. According to the apparatus of B1, the logical relationship of the glue operator includes N consecutive concat operators; wherein N is a positive integer greater than or equal to 2.
B9. The apparatus according to claim B8, wherein the optimization unit is specifically configured to:
and under the condition that the operation dimensionality of the N continuous concat operators is the same dimensionality, combining the N continuous concat operators.
C1. A computer device comprising a processor and a memory, the processor and memory being interconnected, wherein the processor comprises a general purpose processor and an artificial intelligence processor, the memory for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims a1-a 9.
D1. A computer-readable storage medium storing a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method of any of claims a1-a 9.
The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed. Meanwhile, a person skilled in the art should, according to the idea of the present disclosure, change or modify the embodiments and applications of the present disclosure. In view of the above, this description should not be taken as limiting the present disclosure.

Claims (12)

1. A data processing method, comprising:
a general processor acquires a calculation graph corresponding to a neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the general processor optimizes the calculation graph according to the logical relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the general processor acquires the corresponding binary instructions according to the optimization result so as to distribute the binary instructions to the corresponding artificial intelligence processors to execute tasks.
2. The method of claim 1, wherein the logical relationship of the glue operator includes that the output tensor data of the plurality of reshape operators is the input tensor data of the concat operator.
3. The method of claim 2, wherein the general purpose processor optimizes the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and when the input tensors corresponding to the reshape operators are different in length of only one dimension at most, taking the output tensor data of the concat operator as the input tensor data of the reshape operators.
4. The method of claim 1, wherein the logical relationship of the glue operator comprises the output tensor data of the plurality of transpose operators being the input tensor data of a concat operator.
5. The method of claim 4, wherein the general purpose processor optimizes the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and when the perm parameters corresponding to the plurality of transpose operators are the same, taking the output tensor data of the concat operator as the input tensor data of the plurality of transpose operators.
6. The method of claim 1, wherein the logical relationship of the glue operator comprises that the output tensor data of the split operator is the input tensor data of the concat operator.
7. The method of claim 6, wherein the general purpose processor optimizes the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and in the case that the concat operator and the split operator operate in the same dimension, combining the concat operator and the split operator for elimination.
8. The method of claim 1, wherein the logical relationship of the glue operator comprises N consecutive concat operators; wherein N is a positive integer greater than or equal to 2.
9. The method of claim 8, wherein the general purpose processor optimizes the computation graph according to logical relationships of glue operators in the computation graph, comprising:
and under the condition that the operation dimensionality of the N continuous concat operators is the same dimensionality, combining the N continuous concat operators.
10. A data processing apparatus, comprising:
the acquisition unit is used for acquiring a calculation graph corresponding to the neural network model; wherein the calculation graph comprises a glue operator; the glue operator is used for adjusting parameters of tensor data in the calculation map;
the optimization unit is used for optimizing the calculation graph according to the logical relation of the glue operator in the calculation graph to obtain an optimization result; the logical relation of the glue operators comprises the logical relation between the concat operators, or the logical relation between the concat operators and other operators; the other operators comprise any one of a reshape operator, a transpose operator and a split operator;
and the execution unit is used for acquiring the corresponding binary instructions according to the optimization result so as to distribute the binary instructions to the corresponding artificial intelligence processors to execute tasks.
11. A computer device comprising a processor and a memory, the processor and memory being interconnected, wherein the processor comprises a general purpose processor and an artificial intelligence processor, the memory being for storing a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-9.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to carry out the method according to any one of claims 1-9.
CN201910910120.8A 2019-09-24 2019-09-24 Data processing method and device, computer equipment and storage medium Pending CN111401538A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910910120.8A CN111401538A (en) 2019-09-24 2019-09-24 Data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910910120.8A CN111401538A (en) 2019-09-24 2019-09-24 Data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111401538A true CN111401538A (en) 2020-07-10

Family

ID=71428300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910910120.8A Pending CN111401538A (en) 2019-09-24 2019-09-24 Data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111401538A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669852A (en) * 2020-12-15 2021-04-16 北京百度网讯科技有限公司 Memory allocation method and device and electronic equipment
CN112767230A (en) * 2021-02-26 2021-05-07 清华大学 GPU graph neural network optimization method and device
CN112862071A (en) * 2021-01-28 2021-05-28 展讯通信(上海)有限公司 Data processing method and device
CN113297860A (en) * 2021-06-24 2021-08-24 上海携旅信息技术有限公司 Method, system, electronic device and storage medium for optimizing machine translation model
CN115577760A (en) * 2021-07-14 2023-01-06 华为技术有限公司 Data processing method, system and related equipment
CN117273115A (en) * 2023-11-24 2023-12-22 上海燧原科技有限公司 Static generation method, device, equipment and medium of reverse calculation graph
WO2024012491A1 (en) * 2022-07-15 2024-01-18 北京有竹居网络技术有限公司 Method for optimizing computing power of neural network module, chip, electronic device and medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669852A (en) * 2020-12-15 2021-04-16 北京百度网讯科技有限公司 Memory allocation method and device and electronic equipment
CN112669852B (en) * 2020-12-15 2023-01-31 北京百度网讯科技有限公司 Memory allocation method and device and electronic equipment
CN112862071A (en) * 2021-01-28 2021-05-28 展讯通信(上海)有限公司 Data processing method and device
CN112767230A (en) * 2021-02-26 2021-05-07 清华大学 GPU graph neural network optimization method and device
CN113297860A (en) * 2021-06-24 2021-08-24 上海携旅信息技术有限公司 Method, system, electronic device and storage medium for optimizing machine translation model
CN115577760A (en) * 2021-07-14 2023-01-06 华为技术有限公司 Data processing method, system and related equipment
CN115577760B (en) * 2021-07-14 2023-06-02 华为技术有限公司 Data processing method, system and related equipment
WO2024012491A1 (en) * 2022-07-15 2024-01-18 北京有竹居网络技术有限公司 Method for optimizing computing power of neural network module, chip, electronic device and medium
CN117273115A (en) * 2023-11-24 2023-12-22 上海燧原科技有限公司 Static generation method, device, equipment and medium of reverse calculation graph
CN117273115B (en) * 2023-11-24 2024-03-29 上海燧原科技股份有限公司 Static generation method, device, equipment and medium of reverse calculation graph

Similar Documents

Publication Publication Date Title
CN110659728B (en) Neural network optimization method, device, computer equipment and storage medium
CN111401510A (en) Data processing method and device, computer equipment and storage medium
WO2021057746A1 (en) Neural network processing method and apparatus, computer device and storage medium
CN111401538A (en) Data processing method and device, computer equipment and storage medium
US20220391678A1 (en) Neural network model processing method and apparatus, computer device, and storage medium
CN111401539A (en) Data processing method and device, computer equipment and storage medium
Lian et al. High-performance FPGA-based CNN accelerator with block-floating-point arithmetic
US20220391665A1 (en) Method for splitting neural network model by using multi-core processor, and related product
Liang et al. Evaluating fast algorithms for convolutional neural networks on FPGAs
CN111401511A (en) Data processing method and device, computer equipment and storage medium
US20220121903A1 (en) Method of performing splitting in neural network model by means of multi-core processor, and related product
US11740870B2 (en) Convolutional network hardware accelerator device, system and method
US20160342888A1 (en) Memory efficiency for convolutional neural networks operating on graphics processing units
CN110826708B (en) Method for realizing neural network model splitting by using multi-core processor and related product
CN111401537A (en) Data processing method and device, computer equipment and storage medium
US12086711B2 (en) Data dividing method and processor for convolution operation
Zhou et al. Addressing sparsity in deep neural networks
Wu Review on FPGA-based accelerators in deep learning
Odetola et al. 2l-3w: 2-level 3-way hardware–software co-verification for the mapping of convolutional neural network (cnn) onto fpga boards
CN111860825A (en) Data processing method and related product
Lin Convolutional Layer Implementations in High-Level Synthesis for FPGAs
KR20210014897A (en) Matrix operator and matrix operation method for artificial neural network
Dube et al. Tunable precision control for approximate image filtering in an in-memory architecture with embedded neurons
Kang et al. Tensor virtualization technique to support efficient data reorganization for CNN accelerators
US20230130747A1 (en) Computer-readable recording medium storing learning program, learning method, and information processing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200911

Address after: Room 611-194, R & D center building, China (Hefei) international intelligent voice Industrial Park, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Anhui Cambrian Information Technology Co.,Ltd.

Address before: 201306 floor 6, block B, 168 Tonghui Road, Pudong New Area, Shanghai.

Applicant before: Shanghai Cambricon Information Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200710