CN114385182A - Data processing method, device and equipment and computer storage medium - Google Patents

Data processing method, device and equipment and computer storage medium Download PDF

Info

Publication number
CN114385182A
CN114385182A CN202111556390.7A CN202111556390A CN114385182A CN 114385182 A CN114385182 A CN 114385182A CN 202111556390 A CN202111556390 A CN 202111556390A CN 114385182 A CN114385182 A CN 114385182A
Authority
CN
China
Prior art keywords
intermediate representation
subgraph
optimized
static
computational
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111556390.7A
Other languages
Chinese (zh)
Inventor
阳柳
邬轩
杨强
李文成
顾剑
刘勇鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Phytium Technology Co Ltd
Original Assignee
Phytium Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Phytium Technology Co Ltd filed Critical Phytium Technology Co Ltd
Priority to CN202111556390.7A priority Critical patent/CN114385182A/en
Publication of CN114385182A publication Critical patent/CN114385182A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a data processing method, a device, equipment and a computer storage medium, wherein the method comprises the following steps: acquiring static intermediate representation and dynamic intermediate representation of a deep learning model, wherein static tensor data are loaded on each directed edge in a computational sub-graph corresponding to the static intermediate representation; in a computational subgraph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge; optimizing the static intermediate representation and the dynamic intermediate representation to obtain the optimized static intermediate representation and the optimized dynamic intermediate representation; compiling the optimized static intermediate representation and dynamic intermediate representation to obtain an executable object code of the back-end hardware equipment, so that the back-end hardware equipment executes the object code to realize the function of the object code. The invention optimizes the static intermediate representation and the dynamic intermediate representation simultaneously, has better optimization effect, can realize the independence of the front-end framework and the independence of the rear-end platform, improves the portability of the intermediate representation and is suitable for various hardware platforms.

Description

Data processing method, device and equipment and computer storage medium
Technical Field
The present invention relates to the field of computer software technologies, and in particular, to a data processing method, apparatus, device, and computer storage medium.
Background
The new wave of artificial intelligence enthusiasm represented by machine learning and deep learning has been continuously heated for years, deep learning is a research hotspot in the field of artificial intelligence in recent years, breakthrough progress is achieved in various fields, and nowadays, a plurality of deep learning frameworks and a plurality of hardware platforms supporting the deep learning frameworks are provided. The deep learning framework and hardware diversity provides great benefits to users and is critical to maintaining the healthy development of artificial intelligence ecosystems, but supporting multiple frameworks and hardware requires enormous workload, which also provides significant challenges to artificial intelligence developers.
With the wider application of deep learning, the realization efficiency of training and reasoning of a deep learning algorithm on different hardware architectures is more and more concerned, and because deep learning has a plurality of different front ends and back ends, a bridge is needed to effectively realize optimization and mapping between the front ends and the back ends. IR (Intermediate Representation) is an intermediary for translation between source code and object code in the program compiling process, the design of IR is very critical for a compiler, and IR considers the completeness of compiling from source code to object code, the ease of use and performance of compiling optimization. Therefore, competition of intermediate representation will be an important ring for future framework. With reference to the experience of traditional compiler design, NNVM (neural Network Virtual machine), TVM (resistor Virtual machine), and XLA (estimated Linear Algebra), all of which have begun to be good attempts, are optimization and compilation tools built around a particular intermediate representation. They do not perform adaptive testing and optimization on multiple hardware platforms and artificial intelligence development frameworks, resulting in lower portability of the intermediate representation and the need to re-optimize for different hardware platforms. In addition, in the practical application process, the types of the intermediate representation vary widely, and the difficulty of optimizing the intermediate representation increases exponentially with the increase of the scale of the expression items in the intermediate representation. Based on this, how to optimize the intermediate representation is an urgent problem to be solved.
Disclosure of Invention
Embodiments of the present invention provide a data processing method, an apparatus, a device, and a computer storage medium, which optimize static intermediate representation and dynamic intermediate representation simultaneously, so that the optimization effect is better, the independence of a front-end framework and the independence of a back-end platform can be realized, and the portability of the intermediate representation is improved, so as to be suitable for various hardware platforms.
In a first aspect, an embodiment of the present invention provides a data processing method, including:
obtaining static intermediate representation and dynamic intermediate representation of the deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge;
converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame, and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation;
screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized;
optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph;
compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code.
In the embodiment, the static intermediate representation and the dynamic intermediate representation are simultaneously optimized, so that the optimization effect is better, the independence of a front-end framework and the independence of a rear-end platform can be realized, and the portability of the intermediate representation is improved so as to be suitable for various hardware platforms.
Optionally, in the embodiment of the present invention, the optimization of the converted static intermediate representation may adopt several possible implementation manners as follows:
in a first possible implementation manner, the optimizing the converted static intermediate representation includes:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function;
or,
obtaining a layout requirement corresponding to tensor operation of each node in the converted static intermediate representation;
and performing data layout conversion on the static tensor data of the converted static intermediate representation according to the layout requirement.
In this embodiment, operator fusion is optimized on the converted static intermediate representation, and each group of nodes having the same attribute in the converted static intermediate representation is fused into the same kernel function, so that the intermediate result is not saved back to the global memory, the migration of intermediate data is reduced, and frequent back-and-forth reading and writing are avoided. Or optimizing the data layout conversion of the converted static intermediate representation so as to optimize the local access efficiency.
In a second possible implementation manner, the optimizing the converted static intermediate representation includes:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused static intermediate representation according to the layout requirement.
In the embodiment, operator fusion and data layout conversion are optimized on the converted static intermediate representation, so that the intermediate result is not stored back to the global memory, the migration of intermediate data is reduced, frequent back-and-forth reading and writing is avoided, and the local memory access efficiency is optimized.
Optionally, in the embodiment of the present invention, the at least one computation sub-graph is screened according to the respective running time length corresponding to the at least one computation sub-graph, and the following possible implementation manners may be adopted to obtain the at least one computation sub-graph to be optimized:
in a first possible implementation manner, the screening the at least one computation sub-graph according to the respective corresponding running time of the at least one computation sub-graph to obtain at least one computation sub-graph to be optimized includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
In the embodiment, the optimal computation subgraph is obtained by optimizing the computation subgraph with longer running time in the running process, so that the optimized dynamic intermediate representation is obtained. In this implementation, the performance of the device to execute the object code may be improved.
In a second possible implementation manner, the screening the at least one computation sub-graph according to the respective corresponding running time of the at least one computation sub-graph to obtain at least one computation sub-graph to be optimized includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
In the embodiment, a certain number of computation subgraphs with long running time are optimized in the running process, so that the optimized dynamic intermediate representation is obtained. In this implementation, the performance of the device to execute the object code may be improved.
In a third possible implementation manner, the screening the at least one computation sub-graph according to the respective corresponding running time of the at least one computation sub-graph to obtain at least one computation sub-graph to be optimized includes:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
In the embodiment, the optimal computation subgraph is obtained by optimizing the computation subgraph with a large difference of the running time lengths before and after the change in the running process, so that the optimized dynamic intermediate representation is obtained. In this implementation, the performance of the device to execute the object code may be improved.
In a second aspect, an embodiment of the present invention further provides a data processing method, including:
acquiring a target code and an image to be processed;
processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by the data processing method of any one of the above first aspects.
According to the embodiment, the static intermediate representation and the dynamic intermediate representation are optimized and then compiled to obtain the object code, so that the running efficiency of the object code can be improved, and the image processing efficiency is improved.
In a third aspect, an embodiment of the present invention further provides a data processing apparatus, including a receiving unit and a processing unit:
the receiving unit is used for acquiring a static intermediate representation and a dynamic intermediate representation of the deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge;
the processing unit is used for converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation; screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized; optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph; compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code.
Optionally, the processing unit in the embodiment of the present invention may optimize the converted static intermediate representation by using the following possible implementation manners:
in a first possible implementation manner, the processing unit optimizes the converted static intermediate representation, specifically:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function;
or,
obtaining a layout requirement corresponding to tensor operation of each node in the converted static intermediate representation;
and performing data layout conversion on the static tensor data of the converted static intermediate representation according to the layout requirement.
In a second possible implementation manner, the processing unit optimizes the converted static intermediate representation, specifically:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused static intermediate representation according to the layout requirement.
Optionally, in the embodiment of the present invention, the processing unit filters the at least one computation sub-graph according to the respective running time length corresponding to the at least one computation sub-graph, and the following possible implementation manners may be adopted for obtaining the at least one computation sub-graph to be optimized:
in a first possible implementation manner, the processing unit filters at least one computation subgraph according to the respective corresponding running time of the at least one computation subgraph to obtain at least one computation subgraph to be optimized, and specifically includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
In a second possible implementation manner, the processing unit filters the at least one computation subgraph according to the respective corresponding running time of the at least one computation subgraph to obtain at least one computation subgraph to be optimized, and specifically includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
In a third possible implementation manner, the processing unit filters the at least one computation subgraph according to the respective corresponding running time of the at least one computation subgraph to obtain at least one computation subgraph to be optimized, and specifically includes:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
It should be noted that, specific implementation schemes and advantageous effects related to the embodiments of the data processing apparatus provided in the third aspect of the embodiment of the present invention are the same as those of the embodiments of the data processing method provided in the first aspect of the embodiment of the present invention, and are not described herein again.
In a fourth aspect, an embodiment of the present invention further provides a data processing apparatus, including a receiving unit and a processing unit:
the receiving unit is used for acquiring a target code and an image to be processed;
the processing unit is used for processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by the data processing method of any one of the above first aspects.
In a fifth aspect, an embodiment of the present invention further provides a data processing apparatus, including: a memory for storing a program, and a processor for executing the program stored in the memory, wherein when the program stored in the memory is executed, the processor is configured to execute the data processing method according to any one of the first aspect or the second aspect.
In a sixth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the data processing method according to any one of the foregoing first aspects, or the data processing method according to the foregoing second aspect.
Compared with the prior art, the data processing method, the device, the equipment and the computer storage medium provided by the embodiment of the invention have the beneficial effects that: obtaining static intermediate representation and dynamic intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge; converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame, and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation; screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized; optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph; compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code. The embodiment of the invention optimizes the static intermediate representation and the dynamic intermediate representation simultaneously, has better optimization effect, can realize independence of a front-end framework and independence of a rear-end platform, improves the transportability of the intermediate representation, and is suitable for various hardware platforms.
Drawings
FIG. 1 is a flow chart diagram illustrating one embodiment of a data processing method provided by the present invention;
FIG. 2 is a flow chart illustrating a data processing method according to another embodiment of the present invention;
FIG. 3 is a flow chart illustrating a data processing method according to another embodiment of the present invention;
FIG. 4 is a flow chart illustrating a data processing method applied to image processing according to the present invention;
FIG. 5 is a block diagram of an embodiment of a data processing apparatus according to the present invention;
FIG. 6 is a schematic diagram of another embodiment of a data processing apparatus according to the present invention;
fig. 7 is a schematic structural diagram of an embodiment of a data processing apparatus provided in the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a data processing method according to an embodiment of the present invention. The data processing method comprises the following steps:
s101, obtaining static intermediate representation and dynamic intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; and in the computational subgraph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge.
Specifically, in the initial stage of deep learning, each deep learning researcher needs to write a large number of repeated codes. To improve the efficiency of the work, the researchers written the codes as a frame and put the frame on the network for all the researchers to use together. Then a different frame appears on the web. Over time, the best-used frames have become popular because of the large number of frames used. Currently, the most popular deep learning frameworks worldwide are paddlepaddlepaddle, tensiflow, Caffe, Theano, MXNet, Torch, and PyTorch. The deep learning framework is an interface, library or tool that enables us to more easily and quickly construct deep learning models without having to delve into the details of the underlying algorithms. The deep learning framework defines the model by utilizing the pre-constructed and optimized component set, and provides a clear and concise method for constructing the deep learning model. Since deep learning has many different front and back ends, IR (Intermediate Representation) mediates the translation between source and object code and can express the data structure of the source program and the related algorithms.
Illustratively, the static intermediate representation and the dynamic intermediate representation of the deep learning model obtained in this embodiment each include at least one computational subgraph, where each computational subgraph includes N nodes and directed edges connecting different nodes. In a computational subgraph corresponding to the static intermediate representation, static tensor data are loaded on each directed edge; and in the computational subgraph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge.
It should be noted that the data of the deep learning model may be classified by the trained neural network, and a static data stream and a dynamic data stream are obtained accordingly, so that a static intermediate representation may be obtained according to the obtained static data stream, and a dynamic intermediate representation may be obtained according to the obtained dynamic data stream.
S102, converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame, and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation.
Specifically, the static intermediate representation of the deep learning model obtained in S101 is related to a front-end model framework, and in order to enable the optimized static intermediate representation to be applicable to various back-end hardware devices, the static intermediate representation related to the front-end model framework needs to be converted into a static intermediate representation unrelated to the front-end model framework, an NNVM compiler may be used to decouple the static intermediate representation, and the nodes and static tensor data related to the front-end model framework in the static intermediate representation are removed to obtain a static intermediate representation unrelated to the front-end model framework, that is, the converted static intermediate representation. When the static intermediate representation is decoupled, the adopted compiler is not limited to the NNVM compiler described above, and may be a TVM compiler, an XLA compiler, or the like, as long as the compiler that decouples the static intermediate representation can be implemented. And then, optimizing the converted static intermediate representation irrelevant to the front-end model framework so as to improve the running performance of the compiled object code.
It should be noted that the static intermediate representation may be converted into a form of a decoupling framework, where the form of the decoupling framework specifically means that the converted intermediate representation has no relation with the model framework on the upper layer, so as to obtain the static intermediate representation independent of the front-end model framework. Because the static intermediate representation is a computation graph containing various attribute mapping information, when optimization is performed on the form of a decoupling framework, that is, optimization is performed on the static intermediate representation which is independent of a front-end model framework, the computation graph can have more attributes (for example, adding some new operators and adjusting the attributes for specific operators) or become another new computation graph.
S103, in the computation subgraphs corresponding to the dynamic intermediate representation, screening the computation subgraphs according to the respective corresponding running time of the computation subgraphs to obtain at least one computation subgraph to be optimized.
Specifically, each computation subgraph is executed on back-end hardware equipment aiming at the computation subgraph corresponding to the dynamic intermediate representation, so that corresponding running time of the execution on the back-end hardware equipment can be obtained, all corresponding computation subgraphs are screened according to the running time, and at least one computation subgraph to be optimized can be obtained.
S104, optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph.
Specifically, the intermediate representation is used as an intermediary for translation between the source code and the object code, and can express a data structure and a related algorithm of the source program, and compile the intermediate representation to obtain the object code executable by the back-end hardware device. A good intermediate representation is provided that has the ability to accurately and consistently represent the source program information, while also facilitating transformation. The intermediate representation may be operated on multiple times during the transformation and thus become very complex. Therefore, in order to improve the running performance of the compiled target code, the acquired computational subgraph to be optimized in the first dynamic intermediate representation needs to be optimized.
It should be noted that the dynamic intermediate representation obtained from the deep learning model includes at least one computational subgraph, so that when the computational subgraphs are screened to obtain computational subgraphs to be optimized, some computational subgraphs which do not need to be optimized may appear, and further, the optimized dynamic intermediate representation may include only the optimized computational subgraph, or may include the optimized computational subgraph and the computational subgraph which does not need to be optimized.
And S105, compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of the back-end hardware equipment, so that the back-end hardware equipment executes the target code to realize the function of the target code.
Specifically, the optimized static intermediate representation and the optimized dynamic intermediate representation are compiled, and an executable object code of the back-end hardware device can be obtained. And the back-end hardware equipment executes the object code, so that the function of the object code can be realized.
In the embodiment, the static intermediate representation and the dynamic intermediate representation are simultaneously optimized, so that the optimization effect is better, the independence of a front-end framework and the independence of a rear-end platform can be realized, and the portability of the intermediate representation is improved so as to be suitable for various hardware platforms.
As an alternative embodiment, the optimizing the converted static intermediate representation includes:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function;
or,
obtaining a layout requirement corresponding to tensor operation of each node in the converted static intermediate representation;
and performing data layout conversion on the static tensor data of the converted static intermediate representation according to the layout requirement.
Specifically, operator fusion is carried out on the converted static intermediate representation, all nodes contained in the converted static intermediate representation are grouped according to node attributes to obtain at least one group of nodes, all nodes in each group of nodes have the same attribute, and each group of nodes with the same attribute in the converted static intermediate representation are correspondingly fused into the same kernel function, so that intermediate results are not stored back to a global memory, the migration of intermediate data is reduced, and frequent back-and-forth reading and writing are avoided. Or, data layout conversion may be performed on the converted static intermediate representation, tensor operation is a basic operator of the computational graph, tensors in different operators may have different layout requirements, the layout requirement corresponding to the tensor operation of each node in the converted static intermediate representation is obtained, and data layout conversion is performed on the static tensor data of the converted static intermediate representation according to the layout requirement, so as to optimize local memory access efficiency. For example, a certain node in the converted static intermediate representation may use a tensor operation of 4x4, so the original static tensor data of 2x2 needs to be converted into a block of 4x4 to be stored, so as to optimize the local access and storage efficiency.
As an alternative embodiment, the optimizing the converted static intermediate representation includes:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused static intermediate representation according to the layout requirement.
Specifically, operator fusion and data layout conversion are carried out on the converted static intermediate representation, all nodes contained in the converted static intermediate representation are grouped according to node attributes to obtain at least one group of nodes, all nodes in each group of nodes have the same attribute, and each group of nodes with the same attribute in the converted static intermediate representation are correspondingly fused into the same kernel function to obtain at least one fusion node. And acquiring a layout requirement corresponding to tensor operation of each fusion node, and performing data layout conversion on the static tensor data of the converted static intermediate representation after fusion according to the layout requirement. For example, a certain fusion node in the converted static intermediate representation may use a tensor operation of 4x4, so the original static tensor data of 2x2 needs to be converted into a block of 4x4 to be stored, so as to optimize the local access efficiency.
In the embodiment, operator fusion and data layout conversion are optimized on the converted static intermediate representation, so that the intermediate result is not stored back to the global memory, the migration of intermediate data is reduced, frequent back-and-forth reading and writing is avoided, and the local memory access efficiency is optimized.
As an optional embodiment, the screening the at least one computational subgraph according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
Illustratively, the dynamic intermediate representation comprises at least one computational subgraph, and the computational subgraph is dynamically changed, and for each computational subgraph which is changed, the running time of the changed computational subgraph executed on the back-end hardware device is obtained. And comparing the running time with a preset first time threshold, screening the condition that the running time is greater than the preset first time threshold, and taking the computation subgraph corresponding to the running time as the computation subgraph to be optimized under the condition that the running time is greater than the preset first time threshold.
In the embodiment, the optimal computation subgraph is obtained by optimizing the computation subgraph with longer running time in the running process, so that the optimized dynamic intermediate representation is obtained.
As an optional embodiment, the screening the at least one computational subgraph according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
Illustratively, the first dynamic intermediate representation includes at least one computational subgraph, and the computational subgraph is dynamically changed, and for each computational subgraph which is changed, the running time of the changed computational subgraph executed on the back-end hardware device is obtained. And sequencing all the obtained operation durations according to the sequence of durations from large to small to obtain a sequencing result. And taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized. Wherein L is a positive integer greater than 0.
In the embodiment, a certain number of computation subgraphs with long running time are optimized in the running process, so that the optimized dynamic intermediate representation is obtained. In this implementation, the performance of the device to execute the object code may be improved.
As an optional embodiment, the screening the at least one computational subgraph according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized includes:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
Illustratively, the first dynamic intermediate representation includes at least one computational subgraph, and the computational subgraph is dynamically changed, and for the same computational subgraph which is changed, a first running time length for the computational subgraph which is changed to be executed on the back-end hardware device and a second running time length for the computational subgraph which is not changed to be executed on the back-end hardware device are obtained. And calculating a difference value between the first operation time length and the second operation time length, comparing the difference value with a preset second time threshold value, screening the condition that the difference value is greater than the preset second time threshold value, and taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized under the condition that the difference value is greater than the second threshold value.
In the embodiment, the optimal computation subgraph is obtained by optimizing the computation subgraph with a large difference of the running time lengths before and after the change in the running process, so that the optimized dynamic intermediate representation is obtained. In this implementation, the performance of the device to execute the object code may be improved.
The present invention further provides a data processing method for optimizing a static intermediate representation, please refer to fig. 2, and fig. 2 is a flowchart illustrating another embodiment of the data processing method according to the present invention. The data processing method comprises the following steps:
s201, acquiring a first static intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework, the first static intermediate representation has a graph structure, the graph structure comprises M nodes and directed edges connecting different nodes, static tensor data are carried on the directed edges, and M is greater than 1.
Specifically, in the initial stage of deep learning, each deep learning researcher needs to write a large number of repeated codes. To improve the efficiency of the work, the researchers written the codes as a frame and put the frame on the network for all the researchers to use together. Then a different frame appears on the web. Over time, the best-used frames have become popular because of the large number of frames used. Currently, the most popular deep learning frameworks worldwide are paddlepaddlepaddle, tensiflow, Caffe, Theano, MXNet, Torch, and PyTorch. The deep learning framework is an interface, library or tool that enables us to more easily and quickly construct deep learning models without having to delve into the details of the underlying algorithms. The deep learning framework defines the model by utilizing the pre-constructed and optimized component set, and provides a clear and concise method for constructing the deep learning model. Since deep learning has many different front and back ends, IR (Intermediate Representation) mediates the translation between source and object code and can express the data structure of the source program and the related algorithms.
Illustratively, the first static intermediate representation of the deep learning model obtained in this embodiment has a graph structure, where the graph structure includes M nodes and directed edges connecting different nodes, the directed edges carry static tensor data, and M is a positive integer greater than 1. The graph structure, i.e., the computation graph, represents the workload of the deep learning model from the front-end model framework by an intermediate representation in the form of the computation graph.
It should be noted that the data of the deep learning model may be classified by the trained neural network, and a static data stream and a dynamic data stream are correspondingly obtained, so that the first static intermediate representation may be obtained according to the obtained static data stream.
S202, converting the first static intermediate representation into a second static intermediate representation independent of the front-end model framework.
Specifically, the first static intermediate representation of the deep learning model obtained in S201 is related to the front-end model framework, and in order to enable the optimized static intermediate representation to be applicable to a variety of back-end hardware devices, the first static intermediate representation related to the front-end model framework needs to be converted into a second static intermediate representation unrelated to the front-end model framework.
It should be noted that the first static intermediate representation may be converted into a form of a decoupling framework, and the form of the decoupling framework specifically means that the converted intermediate representation has no relation with the model framework of the upper layer, so as to obtain a second static intermediate representation independent of the front-end model framework.
S203, optimizing the second static intermediate representation to obtain a third static intermediate representation.
Specifically, the intermediate representation is used as an intermediary for translation between the source code and the object code, and can express a data structure and a related algorithm of the source program, and compile the intermediate representation to obtain the object code executable by the back-end hardware device. A good intermediate representation is provided that has the ability to accurately and consistently represent the source program information, while also facilitating transformation. The intermediate representation may be operated on multiple times during the transformation and thus become very complex. Therefore, in order to improve the running performance of the compiled object code, the second static intermediate representation needs to be optimized.
Since the static intermediate representation is a computation graph containing various attribute mapping information, when the form of the decoupling framework is optimized, that is, the second static intermediate representation is optimized, the computation graph can have more attributes (for example, adding some new operators and adjusting the attributes for specific operators) or become another new computation graph.
S204, compiling the third static intermediate representation to obtain an object code which can be executed by the back-end hardware equipment, so that the back-end hardware equipment can execute the object code to realize the function of the object code.
Specifically, the optimized third static intermediate representation is compiled, and an executable object code of the back-end hardware device can be obtained. And the back-end hardware equipment executes the object code, so that the function of the object code can be realized.
In this embodiment, the static intermediate representation is converted into a static intermediate representation unrelated to the front-end model framework, and the converted static intermediate representation is optimized, so that independence of the front-end framework and independence of the back-end platform can be achieved, and portability of the intermediate representation is improved, so that the method is suitable for various hardware platforms.
As an alternative embodiment, the S202, converting the first static intermediate representation into a second static intermediate representation independent of the front-end model framework, includes:
and decoupling the first static intermediate representation, and removing nodes and static tensor data related to the front-end model framework in the first static intermediate representation to obtain a second static intermediate representation unrelated to the front-end model framework.
Illustratively, the first static intermediate representation has a graph structure comprising M nodes and directed edges connecting the different nodes, the directed edges carrying static tensor data. The first static intermediate representation can be decoupled using an NNVM compiler, and nodes and static tensor data in the first static intermediate representation that are associated with the front-end model framework are removed to obtain a second static intermediate representation that is not associated with the front-end model framework. When the first static intermediate representation is decoupled, the adopted compiler is not limited to the NNVM compiler, and may be a TVM compiler, an XLA compiler, or the like, as long as the compiler can decouple the static intermediate representation.
In this embodiment, the first static intermediate representation related to the front-end model framework is converted into the second static intermediate representation unrelated to the front-end model framework, so that independence of the front-end framework and independence of the back-end platform can be achieved, and portability of the intermediate representation is improved, so that the method is suitable for various hardware platforms.
As an optional embodiment, the optimizing the second static intermediate representation includes:
and fusing the nodes contained in the second static intermediate representation so as to fuse a plurality of nodes with the same attribute in the second static intermediate representation into the same kernel function.
Specifically, operator fusion is performed on the second static intermediate representation, all nodes included in the second static intermediate representation are grouped according to node attributes to obtain at least one group of nodes, all nodes in each group of nodes have the same attribute, and each group of nodes having the same attribute in the second static intermediate representation are correspondingly fused into the same kernel function, so that intermediate results are not stored back to the global memory, the migration of intermediate data is reduced, and frequent back-and-forth reading and writing are avoided.
As an optional embodiment, the optimizing the second static intermediate representation includes:
obtaining layout requirements corresponding to tensor operation of each node in the second static intermediate representation;
data layout conversion is performed on the static tensor data of the second static intermediate representation according to the layout requirement.
Specifically, data layout conversion is performed on the second static intermediate representation, tensor operation is a basic operator of the computational graph, tensors in different operators may have different layout requirements, the layout requirement corresponding to the tensor operation of each node in the second static intermediate representation is obtained, and data layout conversion is performed on static tensor data of the second static intermediate representation according to the layout requirement, so that local memory access efficiency is optimized. For example, a certain node in the second static intermediate representation may use a tensor operation of 4x4, so the original static tensor data of 2x2 needs to be converted into a block of 4x4 to be stored, so as to optimize the local access and storage efficiency.
As an optional embodiment, the optimizing the second static intermediate representation includes:
fusing nodes contained in the second static intermediate representation so as to fuse a plurality of nodes with the same attribute in the second static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused second static intermediate representation according to the layout requirement.
Specifically, operator fusion and data layout conversion are carried out on the second static intermediate representation, all nodes contained in the second static intermediate representation are grouped according to node attributes to obtain at least one group of nodes, all nodes in each group of nodes have the same attribute, and each group of nodes with the same attribute in the second static intermediate representation are correspondingly fused into the same kernel function to obtain at least one fusion node. And acquiring a layout requirement corresponding to tensor operation of each fusion node, and performing data layout conversion on the static tensor data of the second static intermediate representation after fusion according to the layout requirement. For example, a certain fusion node in the second static intermediate representation may use a tensor operation of 4x4, so the original 2x2 static tensor data needs to be converted into a block of 4x4 for storage, so as to optimize the local access efficiency.
In this embodiment, operator fusion and data layout conversion are optimized on the second static intermediate representation, so that the intermediate result is not stored back to the global memory, the migration of intermediate data is reduced, frequent read-write operations are avoided, and the local memory access efficiency is optimized.
The present invention further provides a data processing method for optimizing the dynamic intermediate representation, please refer to fig. 3, and fig. 3 is a flowchart illustrating another embodiment of the data processing method according to the present invention. The data processing method comprises the following steps:
s301, acquiring a first dynamic intermediate representation of the deep learning model; the first dynamic intermediate representation comprises at least one computational subgraph, the computational subgraph comprises N nodes and directed edges connecting different nodes, dynamic tensor data are loaded on the directed edges, and N is greater than 1.
Specifically, in the initial stage of deep learning, each deep learning researcher needs to write a large number of repeated codes. To improve the efficiency of the work, the researchers written the codes as a frame and put the frame on the network for all the researchers to use together. Then a different frame appears on the web. Over time, the best-used frames have become popular because of the large number of frames used. Currently, the most popular deep learning frameworks worldwide are paddlepaddlepaddle, tensiflow, Caffe, Theano, MXNet, Torch, and PyTorch. The deep learning framework is an interface, library or tool that enables us to more easily and quickly construct deep learning models without having to delve into the details of the underlying algorithms. The deep learning framework defines the model by utilizing the pre-constructed and optimized component set, and provides a clear and concise method for constructing the deep learning model. Since deep learning has many different front and back ends, IR (Intermediate Representation) mediates the translation between source and object code and can express the data structure of the source program and the related algorithms.
Illustratively, the first dynamic intermediate representation of the deep learning model obtained in this embodiment includes at least one computational subgraph, where each computational subgraph includes N nodes and a directed edge connecting different nodes, the directed edge carries dynamic tensor data, and N is a positive integer greater than 1.
It should be noted that the data of the deep learning model may be classified by the trained neural network, and a static data stream and a dynamic data stream are correspondingly obtained, so that the first dynamic intermediate representation may be obtained according to the obtained dynamic data stream.
S302, screening the at least one computation subgraph according to the running time length corresponding to the at least one computation subgraph respectively to obtain at least one computation subgraph to be optimized.
Specifically, each computation subgraph is executed on the back-end hardware device, so that corresponding running time of each computation subgraph executed on the back-end hardware device can be obtained, all corresponding computation subgraphs are screened according to the running time, and at least one computation subgraph to be optimized can be obtained.
S303, optimizing the at least one computational subgraph to be optimized to obtain a second dynamic intermediate representation; wherein the second dynamic intermediate representation comprises at least an optimized computational subgraph.
Specifically, the intermediate representation is used as an intermediary for translation between the source code and the object code, and can express a data structure and a related algorithm of the source program, and compile the intermediate representation to obtain the object code executable by the back-end hardware device. A good intermediate representation is provided that has the ability to accurately and consistently represent the source program information, while also facilitating transformation. The intermediate representation may be operated on multiple times during the transformation and thus become very complex. Therefore, in order to improve the running performance of the compiled target code, the acquired computational subgraph to be optimized in the first dynamic intermediate representation needs to be optimized.
It should be noted that the first dynamic intermediate representation obtained from the deep learning model includes at least one computational subgraph, so that when the computational subgraphs are screened to obtain computational subgraphs to be optimized, some computational subgraphs which do not need to be optimized may appear, and further, the second dynamic intermediate representation obtained after optimization may include only the optimized computational subgraph, or may include the optimized computational subgraph and the computational subgraphs which do not need to be optimized.
S304, compiling the second intermediate representation to obtain an executable object code of the back-end hardware equipment, so that the back-end hardware equipment executes the object code to realize the function of the object code.
Specifically, the optimized second static intermediate representation is compiled, and an executable object code of the back-end hardware device can be obtained. And the back-end hardware equipment executes the object code, so that the function of the object code can be realized.
In the embodiment, the computational subgraphs in the dynamic intermediate representation are screened to obtain the computational subgraphs to be optimized, and the computational subgraphs to be optimized are optimized, so that the independence of a front-end framework and the independence of a rear-end platform can be realized, and the portability of the intermediate representation is improved, so that the method is suitable for various hardware platforms.
As an optional embodiment, the screening the at least one computational sub-graph according to the respective running time lengths corresponding to the at least one computational sub-graph to obtain at least one computational sub-graph to be optimized includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
Illustratively, the first dynamic intermediate representation includes at least one computational subgraph, and the computational subgraph is dynamically changed, and for each computational subgraph which is changed, the running time of the changed computational subgraph executed on the back-end hardware device is obtained. And comparing the running time with a preset first time threshold, screening the condition that the running time is greater than the preset first time threshold, and taking the computation subgraph corresponding to the running time as the computation subgraph to be optimized under the condition that the running time is greater than the preset first time threshold.
In the embodiment, the optimal computation subgraph is obtained by optimizing the computation subgraph with longer running time in the running process, so that the optimized dynamic intermediate representation is obtained.
As an optional embodiment, the screening the at least one computational sub-graph according to the respective running time lengths corresponding to the at least one computational sub-graph to obtain at least one computational sub-graph to be optimized includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
Illustratively, the first dynamic intermediate representation includes at least one computational subgraph, and the computational subgraph is dynamically changed, and for each computational subgraph which is changed, the running time of the changed computational subgraph executed on the back-end hardware device is obtained. And sequencing all the obtained operation durations according to the sequence of durations from large to small to obtain a sequencing result. And taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized. Wherein L is a positive integer greater than 0.
In the embodiment, a certain number of computation subgraphs with long running time are optimized in the running process, so that the optimized dynamic intermediate representation is obtained.
As an optional embodiment, the screening the at least one computational sub-graph according to the respective running time lengths corresponding to the at least one computational sub-graph to obtain at least one computational sub-graph to be optimized includes:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
Illustratively, the first dynamic intermediate representation includes at least one computational subgraph, and the computational subgraph is dynamically changed, and for the same computational subgraph which is changed, a first running time length for the computational subgraph which is changed to be executed on the back-end hardware device and a second running time length for the computational subgraph which is not changed to be executed on the back-end hardware device are obtained. And calculating a difference value between the first operation time length and the second operation time length, comparing the difference value with a preset second time threshold value, screening the condition that the difference value is greater than the preset second time threshold value, and taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized under the condition that the difference value is greater than the second threshold value.
In the embodiment, the optimal computation subgraph is obtained by optimizing the computation subgraph with a large difference of the running time lengths before and after the change in the running process, so that the optimized dynamic intermediate representation is obtained.
In practical application, the invention can select to optimize only the static intermediate representation or only the dynamic intermediate representation or simultaneously optimize the static intermediate representation and the dynamic intermediate representation according to actual requirements so as to realize different optimization purposes.
In a specific application, the present embodiment takes image processing as an example, please refer to fig. 4, and fig. 4 is a schematic flow chart illustrating a data processing method applied to image processing according to the present invention. The data processing method comprises the following steps:
s401, acquiring a target code and an image to be processed;
s402, processing the image to be processed according to the target code to obtain an image processing result; the object code is obtained by the data processing method according to any of the above embodiments of the present invention.
Specifically, the present embodiment may be executed by an image processor (GPU), and after obtaining the object code and the image to be processed, the image processor processes the image to be processed according to the object code, and accordingly obtains the image processing result. Wherein, the object code can be obtained by optimizing by using the data processing method described in any of the above embodiments, and the image processing can include, but is not limited to, any one of convolution processing, classification processing, and grayscale processing. The intermediate representation is optimized and then compiled to obtain the object code, so that the running efficiency of the object code can be improved, and the image processing efficiency is improved.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It should be further noted that, taking fig. 1 as an example, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Correspondingly, the invention further provides a data processing device, which can realize all the processes of the data processing method in the embodiment.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. The data processing device comprises a receiving unit 501 and a processing unit 502;
the receiving unit 501 is configured to obtain a static intermediate representation and a dynamic intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge;
the processing unit 502 is configured to convert the static intermediate representation into a static intermediate representation that is independent of the front-end model frame, and optimize the converted static intermediate representation to obtain an optimized static intermediate representation; screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized; optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph; compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code.
As an optional embodiment, the processing unit 502 optimizes the converted static intermediate representation, specifically including:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function;
or,
obtaining a layout requirement corresponding to tensor operation of each node in the converted static intermediate representation;
and performing data layout conversion on the static tensor data of the converted static intermediate representation according to the layout requirement.
As an optional embodiment, the processing unit 502 optimizes the converted static intermediate representation, specifically including:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused static intermediate representation according to the layout requirement.
As an optional embodiment, the processing unit 502 filters at least one computation subgraph according to the respective corresponding running time of the at least one computation subgraph to obtain at least one computation subgraph to be optimized, which specifically includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
As an optional embodiment, the processing unit 502 filters at least one computation subgraph according to the respective corresponding running time of the at least one computation subgraph to obtain at least one computation subgraph to be optimized, which specifically includes:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
As an optional embodiment, the processing unit 502 filters at least one computation subgraph according to the respective corresponding running time of the at least one computation subgraph to obtain at least one computation subgraph to be optimized, which specifically includes:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
In a specific application, the present embodiment takes image processing as an example, please refer to fig. 6, and fig. 6 is a schematic structural diagram of another embodiment of a data processing apparatus according to the present invention. The data processing device comprises a receiving unit 601 and a processing unit 602;
the receiving unit 601 is configured to obtain a target code and an image to be processed;
the processing unit 602 is configured to process the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by any one of the data processing methods described above.
In a specific implementation, the working principle, the control flow and the technical effect of the data processing apparatus provided in the embodiment of the present invention are the same as those of the data processing method in the above embodiment, and are not described herein again.
Referring to fig. 7, fig. 7 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention. The data processing device comprises a memory 701 and a processor 702, wherein the memory 701 is used for storing programs, the processor 702 executes the programs stored in the memory 701, and when the programs stored in the memory 701 are executed, the processor 702 is used for executing the data processing method of any one of the above embodiments.
As an alternative embodiment, the computer program may be divided into one or more modules/units (e.g. computer program 1, computer program 2, … …) which are stored in the memory 701 and executed by the processor 702 to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the data processing device.
The Processor 702 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc., the general purpose Processor may be a microprocessor, or the Processor 702 may be any conventional Processor, the Processor 702 is a control center of the data Processing apparatus, and various interfaces and lines are used to connect various parts of the data Processing apparatus.
The memory 701 mainly includes a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function, and the like, and the data storage area may store related data and the like. In addition, the memory 701 may be a high speed random access memory, may also be a non-volatile memory, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), and the like, or the memory 701 may also be other volatile solid state memory devices.
It should be noted that the data processing device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the structural diagram of fig. 7 is only an example of the data processing device, and does not constitute a limitation of the data processing device, and may include more or less components than those shown, or combine some components, or different components.
The embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus where the computer-readable storage medium is located is controlled to execute the data processing method according to any of the above embodiments.
Based on the understanding that the constituent modules of the above-described apparatus, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium, and that, in essence, a part of the technical solution of the present application or all or part of the technical solution may be embodied in the form of a software product, and the computer product is stored in the computer-readable storage medium.
The computer readable storage medium may be an internal storage unit of the device of the foregoing embodiment, such as a hard disk or a memory. The computer readable storage medium may be an external storage device of the above-described apparatus, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the computer-readable storage medium may include both an internal storage unit and an external storage device of the device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the apparatus. The above-described computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the above embodiments of the methods when the computer program is executed. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
To sum up, the embodiment of the present invention provides a data processing method, apparatus, device and computer storage medium, by obtaining a static intermediate representation and a dynamic intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge; converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame, and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation; screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized; optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph; compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code. The embodiment of the invention optimizes the static intermediate representation and the dynamic intermediate representation simultaneously, has better optimization effect, can realize independence of a front-end framework and independence of a rear-end platform, improves the transportability of the intermediate representation, and is suitable for various hardware platforms.
While the foregoing is directed to embodiments of the present invention, it will be appreciated by those skilled in the art that various changes may be made in the embodiments without departing from the principles of the invention, and that such changes and modifications are intended to be included within the scope of the invention.

Claims (16)

1. A data processing method, comprising:
obtaining static intermediate representation and dynamic intermediate representation of the deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge;
converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame, and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation;
screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized;
optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph;
compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code.
2. The data processing method of claim 1, wherein the optimizing the translated static intermediate representation comprises:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function;
or,
obtaining a layout requirement corresponding to tensor operation of each node in the converted static intermediate representation;
and performing data layout conversion on the static tensor data of the converted static intermediate representation according to the layout requirement.
3. The data processing method of claim 1, wherein the optimizing the translated static intermediate representation comprises:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused static intermediate representation according to the layout requirement.
4. The data processing method of claim 1, wherein the screening the at least one computational subgraph according to the respective running time duration of the at least one computational subgraph to obtain at least one computational subgraph to be optimized comprises:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
5. The data processing method of claim 1, wherein the screening the at least one computational subgraph according to the respective running time duration of the at least one computational subgraph to obtain at least one computational subgraph to be optimized comprises:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
6. The data processing method of claim 1, wherein the screening the at least one computational subgraph according to the respective running time duration of the at least one computational subgraph to obtain at least one computational subgraph to be optimized comprises:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
7. A data processing method, comprising:
acquiring a target code and an image to be processed;
processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by a data processing method according to any one of claims 1 to 6.
8. A data processing apparatus, characterized by comprising a receiving unit and a processing unit:
the receiving unit is used for acquiring a static intermediate representation and a dynamic intermediate representation of the deep learning model; the deep learning model is a model constructed based on a front-end model framework; in a computational sub-graph corresponding to the static intermediate representation, each directed edge bears static tensor data; in a computational sub-graph corresponding to the dynamic intermediate representation, dynamic tensor data are loaded on each directed edge;
the processing unit is used for converting the static intermediate representation into a static intermediate representation irrelevant to the front-end model frame and optimizing the converted static intermediate representation to obtain an optimized static intermediate representation; screening at least one computational subgraph in the computational subgraph corresponding to the dynamic intermediate representation according to the respective corresponding running time of the at least one computational subgraph to obtain at least one computational subgraph to be optimized; optimizing the at least one computational subgraph to be optimized to obtain an optimized dynamic intermediate representation; wherein the optimized dynamic intermediate representation comprises at least an optimized computational subgraph; compiling the optimized static intermediate representation and the optimized dynamic intermediate representation to obtain an executable target code of a back-end hardware device, so that the back-end hardware device executes the target code to realize the function of the target code.
9. The data processing apparatus according to claim 8, wherein the processing unit optimizes the converted static intermediate representation by:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function;
or,
obtaining a layout requirement corresponding to tensor operation of each node in the converted static intermediate representation;
and performing data layout conversion on the static tensor data of the converted static intermediate representation according to the layout requirement.
10. The data processing apparatus according to claim 8, wherein the processing unit optimizes the converted static intermediate representation by:
fusing nodes contained in the converted static intermediate representation so as to fuse a plurality of nodes with the same attribute in the converted static intermediate representation into the same kernel function to obtain at least one fused node;
acquiring layout requirements corresponding to tensor operation of each fusion node;
and performing data layout conversion on the static tensor data of the fused static intermediate representation according to the layout requirement.
11. The data processing apparatus according to claim 8, wherein the processing unit screens the at least one computational subgraph according to the respective running time duration corresponding to the at least one computational subgraph to obtain the at least one computational subgraph to be optimized, specifically comprising:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment;
and under the condition that the running time length is greater than a first threshold value, taking the computation subgraph corresponding to the running time length as the computation subgraph to be optimized.
12. The data processing apparatus according to claim 8, wherein the processing unit screens the at least one computational subgraph according to the respective running time duration corresponding to the at least one computational subgraph to obtain the at least one computational subgraph to be optimized, specifically comprising:
aiming at the changed computation subgraph, acquiring the running time of the changed computation subgraph executed on the back-end hardware equipment to obtain a sequencing result; wherein, in the sequencing result, the longer the running time is, the more the ranking is;
and taking the computation subgraph corresponding to the running time of the L-bit before ranking in the sequencing result as the computation subgraph to be optimized, wherein L is greater than 0.
13. The data processing apparatus according to claim 8, wherein the processing unit screens the at least one computational subgraph according to the respective running time duration corresponding to the at least one computational subgraph to obtain the at least one computational subgraph to be optimized, specifically comprising:
aiming at the same changed computation subgraph, acquiring a first running time of the changed computation subgraph executed on the back-end hardware equipment and a second running time of the computation subgraph before changing executed on the back-end hardware equipment;
and under the condition that the difference value between the first operation time length and the second operation time length is larger than a second threshold value, taking the computation subgraph corresponding to the first operation time length as the computation subgraph to be optimized.
14. A data processing apparatus, characterized by comprising a receiving unit and a processing unit:
the receiving unit is used for acquiring a target code and an image to be processed;
the processing unit is used for processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by a data processing method according to any one of claims 1 to 6.
15. A data processing apparatus, characterized by comprising: a memory for storing a program and a processor for executing the program stored by the memory, the processor being adapted to perform the data processing method of any of claims 1-6 or the data processing method of claim 7 when the program stored by the memory is executed.
16. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform a data processing method according to any one of claims 1-6 or a data processing method according to claim 7.
CN202111556390.7A 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium Pending CN114385182A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111556390.7A CN114385182A (en) 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111556390.7A CN114385182A (en) 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN114385182A true CN114385182A (en) 2022-04-22

Family

ID=81198692

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111556390.7A Pending CN114385182A (en) 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN114385182A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034402A (en) * 2022-06-20 2022-09-09 寒武纪行歌(南京)科技有限公司 Model reasoning performance optimization method and device and related products
CN115423089A (en) * 2022-08-19 2022-12-02 贝式计算(天津)信息技术有限公司 Compiler optimization method, device and equipment based on tensor data calculation inference
CN116755714A (en) * 2023-08-16 2023-09-15 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for operating deep neural network model

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169870A1 (en) * 2008-12-29 2010-07-01 David Dice System and Method for Reducing Transactional Abort Rates Using Compiler Optimization Techniques
CN103207786A (en) * 2013-04-28 2013-07-17 中国人民解放军信息工程大学 Progressive intelligent backtracking vectorization code tuning method
US20140082597A1 (en) * 2012-09-14 2014-03-20 Hassan Chafi Unifying static and dynamic compiler optimizations in source-code bases
US20160170727A1 (en) * 2014-12-12 2016-06-16 The Regents Of The University Of Michigan Runtime Compiler Environment With Dynamic Co-Located Code Execution
CN111880807A (en) * 2020-07-31 2020-11-03 Oppo广东移动通信有限公司 Deep learning compiling method, device, equipment and storage medium
CN112598121A (en) * 2020-12-21 2021-04-02 北京时代民芯科技有限公司 Efficient operator optimization method for deep learning compiler
CN113469322A (en) * 2020-03-31 2021-10-01 杭州海康威视数字技术股份有限公司 Method, device, equipment and storage medium for determining executable program of model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100169870A1 (en) * 2008-12-29 2010-07-01 David Dice System and Method for Reducing Transactional Abort Rates Using Compiler Optimization Techniques
US20140082597A1 (en) * 2012-09-14 2014-03-20 Hassan Chafi Unifying static and dynamic compiler optimizations in source-code bases
CN103207786A (en) * 2013-04-28 2013-07-17 中国人民解放军信息工程大学 Progressive intelligent backtracking vectorization code tuning method
US20160170727A1 (en) * 2014-12-12 2016-06-16 The Regents Of The University Of Michigan Runtime Compiler Environment With Dynamic Co-Located Code Execution
CN113469322A (en) * 2020-03-31 2021-10-01 杭州海康威视数字技术股份有限公司 Method, device, equipment and storage medium for determining executable program of model
CN111880807A (en) * 2020-07-31 2020-11-03 Oppo广东移动通信有限公司 Deep learning compiling method, device, equipment and storage medium
CN112598121A (en) * 2020-12-21 2021-04-02 北京时代民芯科技有限公司 Efficient operator optimization method for deep learning compiler

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杨克峤: "Java程序优化与数据竞争检测的研究", 《WFCD》, 26 October 2010 (2010-10-26) *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115034402A (en) * 2022-06-20 2022-09-09 寒武纪行歌(南京)科技有限公司 Model reasoning performance optimization method and device and related products
CN115423089A (en) * 2022-08-19 2022-12-02 贝式计算(天津)信息技术有限公司 Compiler optimization method, device and equipment based on tensor data calculation inference
CN116755714A (en) * 2023-08-16 2023-09-15 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for operating deep neural network model
CN116755714B (en) * 2023-08-16 2023-11-10 腾讯科技(深圳)有限公司 Method, device, equipment and storage medium for operating deep neural network model

Similar Documents

Publication Publication Date Title
Zheng et al. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system
CN114385181A (en) Data processing method, device and equipment and computer storage medium
CN114385182A (en) Data processing method, device and equipment and computer storage medium
Noronha et al. LeFlow: Enabling flexible FPGA high-level synthesis of tensorflow deep neural networks
Hwu et al. The concurrency challenge
EP3333735A1 (en) Method and computer program for determining a placement of at least one circuit for a reconfigurable logic device
Lugowski et al. A flexible open-source toolbox for scalable complex graph analysis
Liu et al. Data-flow graph mapping optimization for CGRA with deep reinforcement learning
Piscitelli et al. Design space pruning through hybrid analysis in system-level design space exploration
AU2014203218B2 (en) Memory configuration for inter-processor communication in an MPSoC
CN109791492B (en) Pipeline dependency tree query optimizer and scheduler
CN114398080A (en) Data processing method, device and equipment and computer storage medium
Clark et al. Scalable subgraph mapping for acyclic computation accelerators
CN115659281A (en) Method and device for fusing self-adaptive acceleration operators
Del Sozzo et al. A common backend for hardware acceleration on FPGA
US9594863B2 (en) Method for determining by optimization a multi-core architecture
Hua et al. Exploiting qubit reuse through mid-circuit measurement and reset
Dong et al. Acorns: A framework for accelerating deep neural networks with input sparsity
Bo et al. Automata processing in reconfigurable architectures: In-the-cloud deployment, cross-platform evaluation, and fast symbol-only reconfiguration
Kao et al. Demystifying map space exploration for NPUs
Matai et al. Enabling fpgas for the masses
Noronha et al. Leflow: Automatic compilation of tensorflow machine learning applications to fpgas
Xiao et al. Towards agile dnn accelerator design using incremental synthesis on FPGAs
CN114385183A (en) Data processing method, device and equipment and computer storage medium
Agarwal et al. Auto-vectorizing tensorflow graphs: Jacobians, auto-batching and beyond

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination