CN114385180A - Data processing method, device and equipment and computer storage medium - Google Patents

Data processing method, device and equipment and computer storage medium Download PDF

Info

Publication number
CN114385180A
CN114385180A CN202111554513.3A CN202111554513A CN114385180A CN 114385180 A CN114385180 A CN 114385180A CN 202111554513 A CN202111554513 A CN 202111554513A CN 114385180 A CN114385180 A CN 114385180A
Authority
CN
China
Prior art keywords
intermediate representation
loop
scop
expansion factor
iteration
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111554513.3A
Other languages
Chinese (zh)
Inventor
阳柳
杨强
邬轩
刘勇鹏
顾剑
李文成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Phytium Technology Co Ltd
Original Assignee
Phytium Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Phytium Technology Co Ltd filed Critical Phytium Technology Co Ltd
Priority to CN202111554513.3A priority Critical patent/CN114385180A/en
Publication of CN114385180A publication Critical patent/CN114385180A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/447Target code generation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention provides a data processing method, a device, equipment and a computer storage medium, wherein the method comprises the following steps: acquiring intermediate representation of a deep learning model; wherein the loop unroll factor is related to information of the intermediate representation when executed by the back-end hardware device and/or device information of the back-end hardware device; circularly spreading the intermediate representation according to the circularly spreading factor to obtain an optimized intermediate representation; and compiling the optimized intermediate representation to obtain an executable target code of the back-end hardware equipment, so that the back-end hardware equipment executes the target code to realize the function of the target code. By adopting the embodiment of the invention, the cycle expansion factor can be calculated through the information when the back-end hardware equipment executes and/or the equipment information of the back-end hardware equipment to obtain the more accurate cycle expansion factor, and the intermediate representation is circularly expanded through the cycle expansion factor so as to carry out instruction scheduling in a larger range, thereby improving the portability of the intermediate representation.

Description

Data processing method, device and equipment and computer storage medium
Technical Field
The present invention relates to the field of computer software technologies, and in particular, to a data processing method, apparatus, device, and computer storage medium.
Background
The new wave of artificial intelligence enthusiasm represented by machine learning and deep learning has been continuously heated for years, deep learning is a research hotspot in the field of artificial intelligence in recent years, breakthrough progress is achieved in various fields, and nowadays, a plurality of deep learning frameworks and a plurality of hardware platforms supporting the deep learning frameworks are provided. The deep learning framework and hardware diversity provides great benefits to users and is critical to maintaining the healthy development of artificial intelligence ecosystems, but supporting multiple frameworks and hardware requires enormous workload, which also provides significant challenges to artificial intelligence developers.
With the wider application of deep learning, the realization efficiency of deep learning algorithm training and reasoning on different hardware architectures is more and more concerned, and as deep learning has a plurality of different front ends and back ends, a bridge is required to effectively realize optimization and mapping between the front ends and the back ends. As an intermediary for translation between source code and object code in the program compiling process, the design of IR (Intermediate Representation) is very critical for a compiler, and the completeness of compiling from the source code to the object code, the ease of use and performance of compiling optimization need to be considered. Currently, because most of the program run time of IR is occupied by loops, it is important to optimize the loop code in the program.
Disclosure of Invention
Embodiments of the present invention provide a data processing method, an apparatus, a device, and a computer storage medium, which can optimize a loop code in an intermediate representation, thereby further improving the compiling optimization effect of the intermediate representation, implementing the universality of the intermediate representation irrelevant to a front-end language on various back-end hardware devices, and improving the portability of the intermediate representation.
In a first aspect, an embodiment of the present invention provides a data processing method, including:
acquiring intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; setting a cyclic expansion factor for the intermediate representation; the loop unroll factor relates to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device;
circularly spreading the intermediate representation according to the circular spreading factor to obtain an optimized intermediate representation;
compiling the optimized intermediate representation to obtain an executable object code of the back-end hardware equipment, so that the back-end hardware equipment executes the object code to realize the function of the object code.
Compared with the prior art, according to the data processing method provided by the first aspect of the embodiment of the present invention, the loop expansion factor can be calculated through information when the back-end hardware device executes and/or device information of the back-end hardware device, so as to obtain a more accurate loop expansion factor, and the intermediate representation is circularly expanded through the loop expansion factor, so as to perform instruction scheduling in a wider range, and better develop instruction level parallelism, thereby further optimizing loop codes in the intermediate representation, improving the compiling optimization effect of the intermediate representation, realizing the universality of the intermediate representation irrelevant to the front-end language in various back-end hardware devices, and improving the portability of the intermediate representation.
As an optional embodiment of the first aspect, the information of the intermediate representation when executed by the back-end hardware device includes at least one of M loop structures in the intermediate representation and respective corresponding iteration times of the M loop structures; m is a positive integer greater than 1.
It should be noted that, because the occupation ratio of the loop code in the intermediate representation is large, and part of the parameter information in the loop code cannot be determined at the program compiling stage, the loop unrolling factor is calculated by considering the parameter information that can be determined only when the program runs, that is, the information when the intermediate representation is executed by the back-end hardware device, so that a more accurate loop unrolling factor can be obtained, and the optimization effect of the intermediate representation is improved.
As an optional embodiment of the first aspect, the device information of the back-end hardware device includes at least one of a register parameter, a code volume parameter, and a functional unit parameter;
the register parameter is used for indicating the number of registers contained in the back-end hardware equipment; the code volume parameter is used for indicating the size of a program code volume which can be run by the back-end hardware equipment; the functional component parameter is used for indicating the number of functional components contained in the back-end hardware equipment; the functional components are used to characterize components that perform arithmetic functions.
It should be noted that, when loop unrolling is performed, if the loop unrolling factor is too small, the mineable parallelism is not enough, and if the unrolling factor is too large, the register pressure is increased, even the register overflows, so that the selection of the loop unrolling factor is very important. According to the embodiment of the invention, the equipment information of the back-end hardware equipment is introduced when the loop expansion factor is calculated, so that a more accurate loop expansion factor can be obtained, and a more appropriate loop expansion frequency is selected for loop expansion, thereby further optimizing the loop code in the intermediate representation.
As an optional embodiment of the first aspect, after the obtaining the intermediate representation of the deep learning model, the method further includes:
setting an iteration number threshold value N aiming at the intermediate representation, and judging the value obtained by calculation in the ith loop iteration process of the intermediate representationCyclic expansion factor KiAnd the loop expansion factor K obtained by calculation in the i-1 th loop iterationi-1Whether they are the same; wherein i is more than or equal to 2, and N is more than or equal to 2;
if KiAnd Ki-1If the two are the same, the loop iteration process is ended, and K is setiAnd Ki-1As a final loop unrolling factor;
if KiAnd Ki-1If not, judging whether i is satisfied<N;
If yes, performing loop iteration for the (i + 1) th time on the intermediate representation, and judging a loop expansion factor K obtained by calculation in the (i + 1) th loop iterationi+1And the loop expansion factor K obtained by calculation in the ith loop iterationiWhether the first loop expansion factor and the second loop expansion factor are the same or not is judged, the loop iteration process is finished until the first loop expansion factor obtained by calculation in the current loop iteration is the same as the second loop expansion factor obtained by calculation in the last loop iteration, and any one of the first loop expansion factor and the second loop expansion factor is taken as a final loop expansion factor;
if the difference value does not meet the preset requirement, the loop iteration process is ended, the difference values of the loop expansion factors obtained by calculation in two adjacent loop iterations are respectively calculated, the minimum difference value in the difference values is obtained, and any one of the two loop expansion factors corresponding to the minimum difference value is used as a final loop expansion factor;
then, circularly expanding the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly spreading the intermediate representation according to the final circular spreading factor to obtain the optimized intermediate representation.
It is worth to be noted that the loop expansion factor is continuously updated by using a class iterative compilation method, and a proper loop expansion factor can be selected and loop optimized by integrating various factors, so that loop codes in a program are optimized, data-level parallelism and instruction-level parallelism in the program are mined, and the program performance is remarkably improved.
As an optional embodiment of the first aspect, after the obtaining the intermediate representation in the deep learning model, the method further includes:
acquiring a dynamic SCOP and a static SCOP based on the intermediate representation, converting the dynamic SCOP into the static SCOP, and obtaining a first intermediate representation according to the acquired static SCOP and the converted static SCOP; the dynamic SCOP is used for representing a program code represented by a polyhedral model, and carries parameter information which cannot be identified when the back-end hardware equipment compiles the SCOP; the static SCOP is used for representing a program code represented by a polyhedral model, and the static SCOP does not carry parameter information which cannot be identified by the back-end hardware equipment when the SCOP is compiled;
performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation;
then, circularly expanding the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly expanding the second intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation.
It is worth to be noted that, for the loop containing the runtime parameter information, the runtime information is obtained through runtime optimization, and a more accurate loop expansion factor is obtained according to the actual requirements of the program code, so that the loop expansion efficiency is better exerted.
As an optional embodiment of the first aspect, the converting the dynamic SCOP into a static SCOP includes:
acquiring parameter information of the dynamic SCOP when the back-end hardware equipment runs;
and adjusting the parameter information which cannot be identified by the back-end hardware equipment carried by the dynamic SCOP when the SCOP is compiled according to the acquired parameter information to obtain the converted static SCOP.
It should be noted that part of the loop code in the code contains parameter information that can only be determined during runtime, and only the polyhedral tool is used for static detection, and dynamic SCOP and static SCOP cannot be detected. Since more information about the actual structure of the loop can be obtained at runtime and more loops can be optimized, runtime optimization is needed to obtain better performance optimization for obtaining more information and optimizing more loops.
As an optional embodiment of the first aspect, performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation includes:
performing polyhedral modeling on the first intermediate representation to obtain an iteration domain, a memory access function and an affine scheduling corresponding to the first intermediate representation; wherein, the iteration domain is used for characterizing a multidimensional iteration space domain of each statement in the SCOP in a multi-layer loop; the memory access function is used for representing the mapping relation between the iteration domain and the array with the affine subscript; the affine scheduling is used for representing a mapping relation between the iteration domain and the logic execution time of each statement in the SCOP;
and optimizing the first intermediate representation according to the iteration domain, the memory access function and the affine scheduling corresponding to the first intermediate representation to obtain the intermediate representation of the deep learning model which can be circularly expanded.
It is worth to be noted that compared with the method of separately using loop transformation and iterative compilation, the method combines the polyhedral model and the iterative compilation technology, excavates the optimal loop transformation sequence and the optimal transformation parameters, automatically performs code transformation and parameter search, and can obtain better performance optimization effect.
In a second aspect, an embodiment of the present invention provides a data processing method, including:
acquiring a target code and an image to be processed;
processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by the data processing method according to any of the embodiments of the first aspect.
It is worth noting that in the prior art, most of the running time of many applications is occupied by loops, and especially in applications of wireless communication and video image processing, a large number of loops are involved. Therefore, the loop code in the program is optimized, and the performance of the program can be further improved.
In a third aspect, an embodiment of the present invention provides a data processing apparatus, including a receiving unit and a processing unit:
the receiving unit is used for acquiring an intermediate representation of the deep learning model; the deep learning model is a model constructed based on a front-end model framework; setting a cyclic expansion factor for the intermediate representation; the loop unroll factor relates to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device;
the processing unit is used for circularly expanding the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation;
the processing unit is further configured to compile the optimized intermediate representation to obtain an object code executable by the back-end hardware device, so that the back-end hardware device executes the object code to implement a function of the object code.
As an optional embodiment of the third aspect, the information of the intermediate representation when executed by the back-end hardware device includes at least one of M loop structures in the intermediate representation, and respective numbers of iterations corresponding to the M loop structures; m is a positive integer greater than 1.
As an optional embodiment of the third aspect, the device information of the back-end hardware device includes at least one of a register parameter, a code volume parameter, and a functional unit parameter;
the register parameter is used for indicating the number of registers contained in the back-end hardware equipment; the code volume parameter is used for indicating the size of a program code volume which can be run by the back-end hardware equipment; the functional component parameter is used for indicating the number of functional components contained in the back-end hardware equipment; the functional components are used to characterize components that perform arithmetic functions.
As an alternative embodiment of the third aspect, after the receiving unit obtains the intermediate representation of the deep learning model, the processing unit is further configured to:
setting an iteration number threshold value N aiming at the intermediate representation, and judging a loop expansion factor K obtained by calculation in the ith loop iteration process of the intermediate representationiAnd the loop expansion factor K obtained by calculation in the i-1 th loop iterationi-1Whether they are the same; wherein i is more than or equal to 2, and N is more than or equal to 2;
if KiAnd Ki-1If the two are the same, the loop iteration process is ended, and K is setiAnd Ki-1As a final loop unrolling factor;
if KiAnd Ki-1If not, judging whether i is satisfied<N;
If yes, performing loop iteration for the (i + 1) th time on the intermediate representation, and judging a loop expansion factor K obtained by calculation in the (i + 1) th loop iterationi+1And the loop expansion factor K obtained by calculation in the ith loop iterationiWhether the first loop expansion factor and the second loop expansion factor are the same or not is judged, the loop iteration process is finished until the first loop expansion factor obtained by calculation in the current loop iteration is the same as the second loop expansion factor obtained by calculation in the last loop iteration, and any one of the first loop expansion factor and the second loop expansion factor is taken as a final loop expansion factor;
if the difference value does not meet the preset requirement, the loop iteration process is ended, the difference values of the loop expansion factors obtained by calculation in two adjacent loop iterations are respectively calculated, the minimum difference value in the difference values is obtained, and any one of the two loop expansion factors corresponding to the minimum difference value is used as a final loop expansion factor;
then, the processing unit circularly expands the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly spreading the intermediate representation according to the final circular spreading factor to obtain the optimized intermediate representation.
As an alternative embodiment of the third aspect, after the receiving unit obtains the intermediate representation of the deep learning model, the processing unit is further configured to:
acquiring a dynamic SCOP and a static SCOP based on the intermediate representation, converting the dynamic SCOP into the static SCOP, and obtaining a first intermediate representation according to the acquired static SCOP and the converted static SCOP; the dynamic SCOP is used for representing a program code represented by a polyhedral model, and carries parameter information which cannot be identified when the back-end hardware equipment compiles the SCOP; the static SCOP is used for representing a program code represented by a polyhedral model, and the static SCOP does not carry parameter information which cannot be identified by the back-end hardware equipment when the SCOP is compiled;
performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation;
then, the processing unit circularly expands the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly expanding the second intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation.
As an optional embodiment of the third aspect, the converting, by the processing unit, the dynamic SCOP into a static SCOP includes:
acquiring parameter information of the dynamic SCOP when the back-end hardware equipment runs;
and adjusting the parameter information which cannot be identified by the back-end hardware equipment carried by the dynamic SCOP when the SCOP is compiled according to the acquired parameter information to obtain the converted static SCOP.
As an optional embodiment of the third aspect, the performing, by the processing unit, polyhedral optimization on the first intermediate representation to obtain a second intermediate representation includes:
performing polyhedral modeling on the first intermediate representation to obtain an iteration domain, a memory access function and an affine scheduling corresponding to the first intermediate representation; wherein, the iteration domain is used for characterizing a multidimensional iteration space domain of each statement in the SCOP in a multi-layer loop; the memory access function is used for representing the mapping relation between the iteration domain and the array with the affine subscript; the affine scheduling is used for representing a mapping relation between the iteration domain and the logic execution time of each statement in the SCOP;
and optimizing the first intermediate representation according to the iteration domain, the memory access function and the affine scheduling corresponding to the first intermediate representation to obtain the intermediate representation of the deep learning model which can be circularly expanded.
It should be noted that, specific implementation schemes and advantageous effects of the embodiments of the data processing apparatus provided in the third aspect of the embodiment of the present invention are the same as those of the embodiments of the data processing method provided in the first aspect of the embodiment of the present invention, and are not described herein again.
In a fourth aspect, an embodiment of the present invention provides a data processing apparatus, including a receiving unit and a processing unit:
the receiving unit is used for acquiring a target code and an image to be processed;
the processing unit is used for processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by the data processing method according to any of the embodiments of the first aspect.
It is worth noting that in the prior art, most of the running time of many applications is occupied by loops, and especially in applications of wireless communication and video image processing, a large number of loops are involved. Therefore, the loop code in the program is optimized, and the performance of the program can be further improved.
In a fifth aspect, an embodiment of the present invention provides a data processing apparatus, including: a memory for storing a program and a processor for executing the program stored in the memory, wherein when the program stored in the memory is executed, the processor is configured to execute the data processing method according to any one of the embodiments of the first aspect or the data processing method according to the embodiment of the second aspect.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the data processing method according to any one of the foregoing first aspect embodiments or the data processing method according to the foregoing second aspect embodiments.
Compared with the prior art, the data processing method, the data processing device, the data processing equipment and the computer storage medium provided by the embodiment of the invention can calculate the loop expansion factor through the information executed by the back-end hardware equipment and/or the equipment information of the back-end hardware equipment to obtain a more accurate loop expansion factor, and circularly expand the intermediate representation through the loop expansion factor to perform instruction scheduling in a larger range and better develop instruction level parallelism, so that further optimization of loop codes in the intermediate representation is realized, the compiling optimization effect of the intermediate representation is improved, the universality of the intermediate representation irrelevant to the front-end language in various back-end hardware equipment is realized, and the portability of the intermediate representation is improved.
Drawings
FIG. 1 is a schematic flow chart diagram illustrating one embodiment of a data processing method according to a first aspect of the present invention;
FIG. 2 is a flow chart illustrating one embodiment of a data processing method according to a second aspect of the present invention;
FIG. 3 is a schematic block diagram of an embodiment of a data processing apparatus according to a third aspect of the present invention;
FIG. 4 is a schematic structural diagram of an embodiment of a data processing apparatus according to a fourth aspect of the present invention;
fig. 5 is a schematic structural diagram of an embodiment of a data processing apparatus according to a fifth aspect of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In a first aspect, an embodiment of the present invention provides a data processing method, please refer to fig. 1, which is a schematic flow chart of an embodiment of the data processing method provided in the first aspect of the present invention, where the method includes steps S101 to S103:
s101, acquiring intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; setting a cyclic expansion factor for the intermediate representation; the loop unroll factor relates to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device.
In step S101, it is necessary to obtain an intermediate representation of a deep learning model, where the deep learning model is a model constructed based on a front-end model framework, and a cyclic expansion factor is set in the obtained intermediate representation, and a specific value of the cyclic expansion factor is related to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device, specifically, the cyclic expansion factor may be calculated only according to the information of the intermediate representation when executed by the back-end hardware device, and accordingly obtained, or the cyclic expansion factor may be calculated only according to the device information of the back-end hardware device, and further the cyclic expansion factor may be calculated and obtained by combining the information of the intermediate representation when executed by the back-end hardware device and the device information of the back-end hardware device.
It can be understood that the new wave of artificial intelligence enthusiasm represented by machine learning and deep learning has been continuously heated for years, deep learning is a research focus in the field of artificial intelligence in recent years, and has obtained breakthrough progress in many fields, and nowadays, there are many deep learning frameworks and many hardware platforms supporting the many deep learning frameworks. With the wider application of deep learning, the realization efficiency of training and reasoning of a deep learning algorithm on different hardware architectures is more and more concerned, and because deep learning has a plurality of different front ends and back ends, a bridge is needed to effectively realize optimization and mapping between the front ends and the back ends. IR (Intermediate Representation) is an intermediary for translation between source code and object code in the program compiling process, the design of IR is very critical for a compiler, and IR considers the completeness of compiling from source code to object code, the ease of use and performance of compiling optimization. Therefore, competition of intermediate representation will be an important ring for future framework.
The loop expansion factor is the number of times the loop is copied or the number of times the loop is expanded, and determines the optimization effect of the loop expansion. The determination time of the loop expansion factor can be divided into compiling time and running time, if various parameter information in the loop code represented in the middle can be obtained through the back-end hardware equipment during compiling, the loop expansion factor can be determined during compiling, if part of parameter information of the loop can only be known through the back-end hardware equipment during running, the parameter information of the code during running needs to be collected, namely the information when the middle represents the back-end hardware equipment during running, and the loop expansion factor is determined through an iterative compiling mode, so that the more accurate loop expansion factor is calculated.
And S102, circularly expanding the intermediate representation according to the circular expansion factor to obtain the optimized intermediate representation.
In step S102, after the intermediate representation of the deep learning model and the cyclic expansion factor set in the intermediate representation are acquired according to step S101, cyclic expansion optimization is performed on the intermediate representation by the acquired cyclic expansion factor, and accordingly, an optimized intermediate representation is acquired. Since loop unrolling is an important loop optimization method in the field of compilation. Loop unrolling refers to copying loop body code multiple times or extracting one or more iterations beginning and ending the loop from the loop. The loop unrolling is closely related to the structure of a target machine system, the actual condition of loop codes, the selection of loop unrolling factors and the time of the loop unrolling, and a plurality of factors can influence the effect of the loop unrolling. Therefore, after the intermediate representation is circularly expanded according to the circular expansion factor, the circular body code can be copied for multiple times according to the selected circular expansion factor so as to expand the basic block, expand the data recombination window and discover more data recombination opportunities, thereby further optimizing and improving the compiling effect of the intermediate representation.
It should be noted that there are many kinds of cyclic transformations, including: loop unrolling, loop joining, loop distribution, loop separation, loop merging, loop tiling, loop filling, loop swapping, loop soft running water and scalar expansion, and the like. In the practical application process, a suitable cyclic transformation type can be selected according to actual requirements to perform cyclic optimization on the intermediate representation, the cyclic optimization method for performing cyclic expansion on the intermediate representation according to the cyclic expansion factor described in this embodiment is only one implementation manner, and all other embodiments implemented by applying other cyclic transformation methods obtained by a person having ordinary skill in the art without creative labor belong to the protection scope of the present invention.
It is worth noting that in the practical application process, the types of the intermediate representations vary greatly, and the difficulty of optimizing the intermediate representations increases exponentially with the increase of the scale of the expression items in the intermediate representations. Because loop unrolling can expand basic blocks and perform instruction scheduling in a larger range, instruction-level parallelism is better exploited. Meanwhile, the data reorganization window can be expanded through loop expansion, more data reorganization opportunities are explored, and the parallelism of data is better developed, so that the intermediate representation can be further optimized and the compiling effect can be improved through loop expansion of the intermediate representation through loop expansion factors.
S103, compiling the optimized intermediate representation to obtain an executable object code of the back-end hardware equipment, so that the back-end hardware equipment executes the object code to realize the function of the object code.
In step S103, after obtaining the optimized intermediate representation according to step S101 and step S102, optimizing the optimized intermediate representation into object code that can be executed by the back-end hardware device through a compiler, so as to further improve the adaptability and the general performance of the deep learning algorithm on different hardware structures and artificial intelligence development frameworks based on the optimized intermediate representation.
It can be understood that, as the application of deep learning is wider and wider, the realization efficiency of training and reasoning of the deep learning algorithm on different hardware architectures is more and more concerned. The intermediate representation is an important link in a compiler, and a good intermediate representation has the capability of accurately and unmistakably expressing the information of a source program and is convenient to transform. The intermediate representation may be operated on multiple times during the transformation and thus become very complex. Therefore, in order to improve the running performance of the compiled object code, the intermediate representation often needs to be optimized. Compared with the prior art, the data processing method provided by the embodiment of the invention can calculate the loop expansion factor through the information when the back-end hardware equipment executes and/or the equipment information of the back-end hardware equipment to obtain the more accurate loop expansion factor, and circularly expand the intermediate representation through the loop expansion factor to perform instruction scheduling in a larger range, and better develop instruction level parallelism, thereby further optimizing loop codes in the intermediate representation, improving the compiling optimization effect of the intermediate representation, realizing the universality of the intermediate representation irrelevant to the front-end language in various back-end hardware equipment, and improving the portability of the intermediate representation.
As an optional embodiment of the first aspect, the information of the intermediate representation when executed by the back-end hardware device includes at least one of M loop structures in the intermediate representation and respective corresponding iteration times of the M loop structures; m is a positive integer greater than 1.
In this embodiment, the information of the intermediate representation when executed by the back-end hardware device includes at least one of M loop structures in the intermediate representation and the number of iterations corresponding to each of the M loop structures, and accordingly, when calculating the loop expansion factor based on the information of the intermediate representation when executed by the back-end hardware device, the loop expansion factor may be calculated based on only the M loop structures included in the intermediate representation, or may be calculated based on only the number of iterations corresponding to each of the M loop structures included in the intermediate representation, or may be calculated by combining the M loop structures included in the intermediate representation and the number of iterations corresponding to each of the M loop structures, so that the calculation of the loop expansion factor based on the device information of the back-end hardware device can be performed to further improve the accuracy of the loop expansion factor, resulting in a more optimal intermediate representation.
It can be understood that, in the prior art, most of the running time of many application programs is occupied by loops, and the loop codes in the programs are optimized, so that the performance of the programs can be improved. However, the parameter information that can be obtained by loop code at compile time is limited, for example, a non-linear buffer of a loop program can obtain more parameter information of the loop actual structure at runtime. Therefore, in order to obtain more information and optimize more loops, it is necessary to optimize the loop code at runtime.
Both M and N are set according to the actual code structure of the intermediate representation, and are not limited herein.
It should be noted that, because the occupation ratio of the loop code in the intermediate representation is large, and part of the parameter information in the loop code cannot be determined at the program compiling stage, the loop unrolling factor is calculated by considering the parameter information that can be determined only when the program runs, that is, the information when the intermediate representation is executed by the back-end hardware device, so that a more accurate loop unrolling factor can be obtained, and the optimization effect of the intermediate representation is improved.
As an optional embodiment of the first aspect, the device information of the back-end hardware device includes at least one of a register parameter, a code volume parameter, and a functional unit parameter;
the register parameter is used for indicating the number of registers contained in the back-end hardware equipment; the code volume parameter is used for indicating the size of a program code volume which can be run by the back-end hardware equipment; the functional component parameter is used for indicating the number of functional components contained in the back-end hardware equipment; the functional components are used to characterize components that perform arithmetic functions.
In this embodiment, the device information of the back-end hardware device includes at least one of a register parameter, a code volume parameter, and a functional component parameter, and accordingly, when the loop expansion factor is calculated according to the device information of the back-end hardware device, the calculation may be performed only according to the register parameter in the device information of the back-end hardware device, or may be performed only according to the code volume parameter, or may be performed only according to the functional component parameter; or, combining any two parameters in the equipment information to calculate the cycle expansion factor; alternatively, the calculation of the loop unrolling factor is performed by combining all three parameters in the device information. In the actual application process, the three parameters contained in the device information can be combined in different forms according to the actual application scene and requirements to calculate different cycle expansion factors so as to perform cycle optimization in a targeted manner.
It will be appreciated that the choice of loop unrolling factor is very important for loop optimization. And when calculating the loop expansion factor, various factors including register parameters, code volume parameters and functional unit parameters need to be considered. Assuming that the number of registers available in the back-end hardware device is NR, the register parameters need to satisfy the following constraints:
Preg≤NR
1≤Ri≤NR
wherein, PregRepresenting register parameters, RiIndicating the number of variables in the i-th layer loop.
The code volume parameter needs to satisfy the following constraints:
Pcode≤IC
wherein, PcodeRepresenting a code volume parameter and the IC representing the instruction cache size available in the back-end hardware device.
For functional part parameters
Figure BDA0003418160130000104
Assume having in the back end hardware device
Figure BDA0003418160130000103
And the I-type functional unit executes the operations in parallel by the instruction level, wherein the execution cycle number required by the operation processed by the I-type functional unit is as follows:
Figure BDA0003418160130000101
then after loop unrolling, the execution overhead of the loop body is:
Figure BDA0003418160130000102
it should be noted that, when loop unrolling is performed, if the loop unrolling factor is too small, the mineable parallelism is not enough, and if the unrolling factor is too large, the register pressure is increased, even the register overflows, so that the selection of the loop unrolling factor is very important. According to the embodiment of the invention, the equipment information of the back-end hardware equipment is introduced when the loop expansion factor is calculated, so that a more accurate loop expansion factor can be obtained, and a more appropriate loop expansion frequency is selected for loop expansion, thereby further optimizing the loop code in the intermediate representation.
As an optional embodiment of the first aspect, after the step S101, the method further includes:
setting an iteration number threshold value N aiming at the intermediate representation, and judging a loop expansion factor K obtained by calculation in the ith loop iteration process of the intermediate representationiAnd the loop expansion factor K obtained by calculation in the i-1 th loop iterationi-1Whether they are the same; wherein i is more than or equal to 2, and N is more than or equal to 2;
if KiAnd Ki-1If the two are the same, the loop iteration process is ended, and K is setiAnd Ki-1As a final loop unrolling factor;
if KiAnd Ki-1Out of phaseIf yes, judging whether i is satisfied<N;
If yes, performing loop iteration for the (i + 1) th time on the intermediate representation, and judging a loop expansion factor K obtained by calculation in the (i + 1) th loop iterationi+1And the loop expansion factor K obtained by calculation in the ith loop iterationiWhether the first loop expansion factor and the second loop expansion factor are the same or not is judged, the loop iteration process is finished until the first loop expansion factor obtained by calculation in the current loop iteration is the same as the second loop expansion factor obtained by calculation in the last loop iteration, and any one of the first loop expansion factor and the second loop expansion factor is taken as a final loop expansion factor;
if the difference value does not meet the preset requirement, the loop iteration process is ended, the difference values of the loop expansion factors obtained by calculation in two adjacent loop iterations are respectively calculated, the minimum difference value in the difference values is obtained, and any one of the two loop expansion factors corresponding to the minimum difference value is used as a final loop expansion factor;
then, the step S102 specifically includes:
and circularly spreading the intermediate representation according to the final circular spreading factor to obtain the optimized intermediate representation.
In this embodiment, the intermediate representation is further provided with an iteration number threshold N, and the loop expansion factor calculated and obtained in each loop iteration can be further optimized through a loop iteration process of the intermediate representation, where a specific calculation method of the loop expansion factor may refer to the method described in the above embodiment, and is not described herein again.
Exemplarily, assuming that i is equal to 2, the intermediate representation is in the process of the second loop iteration, and the loop expansion factor K calculated in the second loop iteration is determined2And a loop expansion factor K obtained by calculation in the first loop iteration1Whether they are the same;
if K2And K1If the two are the same, the whole loop iteration process of the intermediate representation is ended, and K is added2And K1As a final loop unrolling factor;
if K2And K1If not, judging whether the current loop iteration times i meet the requirement i<N;
Assuming that the iteration time threshold N is 3, if the loop iteration time i is less than the iteration time threshold N, performing a third loop iteration on the intermediate representation, and determining a loop expansion factor K calculated and obtained in the third loop iteration3And a loop expansion factor K obtained by calculation in the second loop iteration2Whether they are the same;
if K3And K2If the two are the same, the whole loop iteration process of the intermediate representation is ended, and K is added3And K2As a final loop unrolling factor;
if K3And K2If not, judging whether the current loop iteration times i meet the requirement i<N, since the current loop iteration number is i +1 and i +1 ═ N ═ 3, the loop iteration process ends and K is calculated separately2And K1And K3And K2The minimum difference value of the difference values is obtained, and any one of the two cyclic expansion factors corresponding to the minimum difference value is used as the final cyclic expansion factor.
And after the final cycle expansion factor is obtained, circularly expanding the intermediate representation according to the final cycle expansion factor obtained through the cyclic iteration to obtain the optimized intermediate representation.
In addition, it should be added that, in the first loop iteration process, that is, in the special case where i is 1, after the loop expansion factor is obtained by calculation in the loop optimization module, the loop expansion factor is recorded, and is transferred to the polyhedral optimization module; after the polyhedron optimization module receives the circulation expansion factor and acquires the information of the intermediate representation when the intermediate representation is executed by the back-end hardware equipment, polyhedron modeling is carried out on the circulation code in the intermediate representation according to the information to obtain a program transformation space, and the program transformation space is searched and subjected to program transformation to obtain the intermediate representation after the program transformation. Further, after the intermediate representation after the program transformation is obtained, the polyhedron optimization module transfers the intermediate representation after the program transformation to the loop optimization module, and then the loop optimization module performs the next round of loop iteration according to the received intermediate representation, that is, performs the next round of calculation and detection of the loop expansion factor, and if the number of loop iteration is i +1 ═ 2, the loop iteration can be continuously performed according to the above-mentioned judgment method until the loop iteration is finished, and the final loop expansion factor is obtained.
It should be noted that the iteration number threshold is a positive integer greater than or equal to 2, and in an actual application process, the size of the iteration number threshold may be set according to an actual requirement or a test, which is not limited herein.
It should be noted that, as one of the adaptive optimization methods, iterative compilation refers to expanding a loop using a plurality of different expansion factors, running the expanded code and collecting information during running, and such iterative operations are repeated to obtain a better loop expansion factor. The loop expansion factor is continuously updated by using a similar iterative compilation method, and a proper loop expansion factor can be selected by integrating various factors and loop optimization is carried out, so that loop codes in a program are optimized, data-level parallelism and instruction-level parallelism in the program are mined, and the program performance is remarkably improved.
As an optional embodiment of the first aspect, after step S102, the method further includes:
acquiring a dynamic SCOP and a static SCOP based on the intermediate representation, converting the dynamic SCOP into the static SCOP, and obtaining a first intermediate representation according to the acquired static SCOP and the converted static SCOP; the dynamic SCOP is used for representing a program code represented by a polyhedral model, and carries parameter information which cannot be identified when the back-end hardware equipment compiles the SCOP; the static SCOP is used for representing a program code represented by a polyhedral model, and the static SCOP does not carry parameter information which cannot be identified by the back-end hardware equipment when the SCOP is compiled;
performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation;
then, the step S102 specifically includes:
and circularly expanding the second intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation.
In this embodiment, after obtaining the optimized intermediate representation according to step S102, it is required to detect the type of the SCOP included in the intermediate representation, where the type of the SCOP includes a dynamic SCOP and a static SCOP, both the dynamic SCOP and the static SCOP are used to represent the program code represented by the polyhedral model, the dynamic SCOP carries parameter information that cannot be identified when the back-end hardware device compiles the SCOP, and the static SCOP does not carry parameter information that cannot be identified when the back-end hardware device compiles the SCOP; further, if it is detected in the intermediate representation that there is no parameter information that cannot be identified by the back-end hardware device when compiling the SCOP, it indicates that the intermediate representation only contains the static SCOP and does not contain the dynamic SCOP, and the intermediate representation can be directly optimized; on the other hand, if the parameter information that cannot be identified by the back-end hardware device when compiling the SCOP, for example, the parameter information such as the alias and the array of the pointer, is detected in the intermediate representation, it is explained that the dynamic SCOP is included in the intermediate representation. At this time, the detected dynamic SCOP needs to be converted into a static SCOP, and after only the static SCOP exists in the intermediate representation, a first intermediate representation is obtained according to the static SCOP initially existing in the intermediate representation and the converted static SCOP;
further, after the first intermediate representation is obtained, performing polyhedral optimization on the first intermediate representation to obtain an optimized intermediate representation, that is, a second intermediate representation, and performing cyclic expansion on the second intermediate representation according to the cyclic expansion factor calculated by the method described in the above embodiment to obtain the optimized intermediate representation. It is understood that SCOP refers to a program segment that can be represented by a polyhedron. Static SCOP refers to a program segment that can be represented by a basic polyhedral model, can be analyzed and transformed at compile time, and can be detected directly by an existing polyhedral tool. The dynamic SCOP is a program segment which can be represented by a basic polyhedral model, but needs run-time information to detect the SCOP, and cannot be directly detected by a polyhedral tool.
It is worth to be noted that, for the loop containing the runtime parameter information, the runtime information is obtained through runtime optimization, and a more accurate loop expansion factor is obtained according to the actual requirements of the program code, so that the loop expansion efficiency is better exerted.
As an optional embodiment of the first aspect, the converting the dynamic SCOP into a static SCOP includes:
acquiring parameter information of the dynamic SCOP when the back-end hardware equipment runs;
and adjusting the parameter information which cannot be identified by the back-end hardware equipment carried by the dynamic SCOP when the SCOP is compiled according to the acquired parameter information to obtain the converted static SCOP.
In this embodiment, after the presence of the dynamic SCOP is detected in the intermediate representation, the dynamic SCOP needs to be converted into the static SCOP, specifically, since part of the parameter information included in the dynamic SCOP can be identified only at run-time, only the parameter name can be used for replacement in the compiling stage. Therefore, after the parameter value corresponding to the parameter information is obtained at the stage of the back-end hardware device running the SCOP, the known parameter value corresponding to the parameter information can be obtained by substituting the parameter value into the source code corresponding to the intermediate representation. Furthermore, the dynamic SCOP in the source code corresponding to the intermediate representation can be converted into the static SCOP by the known parameter values, so that parameter information which cannot be identified by the back-end hardware device carried by the dynamic SCOP when the SCOP is compiled is adjusted, and the converted static SCOP is obtained.
It should be noted that part of the loop code in the code contains parameter information that can only be determined during runtime, and only the polyhedral tool is used for static detection, and dynamic SCOP and static SCOP cannot be detected. Since more information about the actual structure of the loop can be obtained at runtime and more loops can be optimized, runtime optimization is needed to obtain better performance optimization for obtaining more information and optimizing more loops.
As an optional embodiment of the first aspect, performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation includes:
performing polyhedral modeling on the first intermediate representation to obtain an iteration domain, a memory access function and an affine scheduling corresponding to the first intermediate representation; wherein, the iteration domain is used for characterizing a multidimensional iteration space domain of each statement in the SCOP in a multi-layer loop; the memory access function is used for representing the mapping relation between the iteration domain and the array with the affine subscript; the affine scheduling is used for representing a mapping relation between the iteration domain and the logic execution time of each statement in the SCOP;
and optimizing the first intermediate representation according to the iteration domain, the memory access function and the affine scheduling corresponding to the first intermediate representation to obtain the intermediate representation of the deep learning model which can be circularly expanded.
In this embodiment, the process of performing polyhedral optimization on the intermediate representation mainly includes cyclic analysis detection, parameter reception, polyhedral modeling, and program transformation. Specifically, the polyhedron optimization module firstly analyzes and detects the type of the SCOP contained in the intermediate representation, and after the first intermediate representation is obtained by the method of the above embodiment, obtains the information of the intermediate representation when being executed by the back-end hardware device and receives the loop expansion factor transmitted by the loop optimization module, and carries out polyhedron modeling on the loop code in the intermediate representation according to the information to obtain the iteration domain, the memory access function and the affine scheduling of each statement. And after the polyhedron modeling is finished, performing program transformation on the intermediate representation after the polyhedron modeling is finished to obtain a program transformation space, searching the program transformation space and performing program transformation, wherein the intermediate representation after the program transformation is the second intermediate representation after the polyhedron optimization.
It should be noted that, several common basic concepts in the polyhedral model are described herein, specifically as follows:
affine function: if regarding one or more variables x1,x2,x3,…,xnCan be expressed as oneA constant plus a constant multiplied by the sum of these variables, i.e. c0x1+c1x2+c2x3+…+cnxnWherein c is1,c2,c3,…,cnAre all constants, then this function is affine.
Convex polyhedron-the iteration space of a loop nesting structure is defined as the combination of index variables of all loops in the nesting structure. A loop nesting structure with depth d can be modeled as a d-dimensional space. The dimensions of the space are ordered, with the kth dimension representing the kth loop from the outermost loop in the nested structure. One point (x) in this space1,x2,x3,…,xd) Representing the value of all cycle indices, the value of the outermost cycle index being x1The value of the subscript of the second cycle is x2By analogy, the value of the innermost cyclic subscript is xd. Not every point in this space represents a lower set of indices that actually appear when the loop nesting structure is executed. As an affine function of the subscripts of the outer loops, the upper and lower bounds of each loop define an inequality, dividing the space into two halves: the portion corresponding to loop iteration (i.e., the positive half-space) and the portion not corresponding to iteration (i.e., the negative half-space). The intersection of all linear inequalities (logical AND) represents the intersection of the positive half-spaces, which defines a convex polyhedron, which is called the iteration space of the loop-nested structure. The convex polyhedron has the following properties: if two points are within the polyhedron, then all points on the line between them are within the polyhedron. The polyhedron is described using a loop bound inequality, each iteration of a loop can be represented by a point in the polyhedron with integer coordinates, and each integer point within the polyhedron represents an iteration of the loop nesting structure performed at a time.
An iteration domain: one execution of a statement is referred to as a statement instance, with one statement outside the loop having only one instance, and one statement within the loop having multiple instances. Iterative domain table in loop nesting with depth dShown as follows:
Figure BDA0003418160130000141
where Z represents a set of integers, A is a d integer matrix,
Figure BDA0003418160130000142
is an integer vector of length d,
Figure BDA0003418160130000143
is a vector consisting of d 0 s.
Memory access function:
Figure BDA0003418160130000144
affine scheduling:
Figure BDA0003418160130000145
it will further be appreciated that the original program and its transformations must be analyzed in depth before any program transformations are performed, ensuring that this transformation does not change the semantics of the original program. In the polyhedron model, if all accesses and memorizations are carried out through an array with affine subscripts, and the affine subscripts depend on outer-layer loop iteration factors and global parameters, accurate analysis can be carried out on the outer-layer loop iteration factors and the global parameters.
Illustratively, the iteration field is mapped to a multidimensional vector representing the logical execution time of the sentence, which is used to record the logical execution time of the sentence, and the dictionary order is used to compare the execution order of the sentence. The affine schedule is represented as:
Figure BDA0003418160130000146
a loop nesting structure performs each iteration in its iteration space one by one in ascending dictionary order. A vector
Figure BDA0003418160130000148
Ordering less than another vector by lexicon
Figure BDA0003418160130000147
Is marked as i<if and only if there is one m<min (n, n'), such that
Figure BDA0003418160130000149
m may be equal to 0.
The linear inequalities drawn from the upper and lower bounds of a loop body define a set of iterations on a convex polyhedron, and this representation does not assume any order of execution between iterations in the iteration domain. The original program imposes a serial order on the iteration, i.e. a dictionary ordering of the values of the cyclic index variables arranged in an outer-to-inner manner. However, the iterations in this space can be performed in any order, as long as the data dependencies between them are observed (i.e., the order of write/read operations performed on any array element by different assignment statements in the loop nesting structure is not changed). How to select an order that both respects data dependencies and optimizes data locality and parallelism is a complex problem. If no ordering information is provided, this means that the statement instances may be executed in any order, but some statement instances may be associated with other statement instances where execution in a given order is critical and therefore requires additional information. When there is a dependency between two statement instances, the program transformation cannot violate the dependency, which would otherwise result in the dependency statement instances being executed in the wrong order or in parallel. In order to be able to compute data dependencies, all domain, scheduling and memory relationship constraints must be affine functions.
For example, if the scheduling function is θsIf (i, j) ═ j, i), then the corresponding transform is a circular exchange.
Multidimensional affine scheduling is very useful for scheduling of original programs. The idea is to build an abstract syntax tree of the program and read in the schedule of each statement. These schedules are dependent on the iteration factor and give an order of execution for each instance of each statement, and new code can be regenerated using such affine schedules.
It is worth to be noted that compared with the method of separately using loop transformation and iterative compilation, the method combines the polyhedral model and the iterative compilation technology, excavates the optimal loop transformation sequence and the optimal transformation parameters, automatically performs code transformation and parameter search, and can obtain better performance optimization effect.
In specific applications, taking image processing as an example, in a second aspect, an embodiment of the present invention provides a data processing method, please refer to fig. 2, which is a flowchart illustrating an embodiment of the data processing method according to the second aspect of the present invention, where the data processing method includes steps S201 to S202:
s201, acquiring a target code and an image to be processed;
s202, processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by the data processing method according to any of the embodiments of the first aspect.
Specifically, the present embodiment may be executed by an image processor (GPU), and after obtaining the object code and the image to be processed, the image processor processes the image to be processed according to the object code, and accordingly obtains the image processing result. Wherein, the object code can be obtained by optimizing by using the data processing method described in any of the above embodiments, and the image processing can include, but is not limited to, any one of convolution processing, classification processing, and grayscale processing. The intermediate representation is optimized and then compiled to obtain the object code, so that the running efficiency of the object code can be improved, and the image processing efficiency is improved.
It can be understood that the heat tide of the new wave artificial intelligence represented by machine learning and deep learning has been continuously raised for years, and deep learning is a research focus in the field of artificial intelligence in recent years, and has achieved breakthrough progress in many fields. Today, there are many deep learning frameworks and many hardware platforms that support them. The deep learning framework and hardware diversity provides great benefits to users and is critical to maintaining the healthy development of artificial intelligence ecosystems, but supporting multiple frameworks and hardware requires enormous workload, which also provides significant challenges to artificial intelligence developers. Deep learning now has many different front-ends and many different back-ends, requiring a bridge to efficiently implement optimizations and mappings between them. With the wider application of deep learning, the realization efficiency of training and reasoning of the deep learning algorithm on different hardware architectures is more and more concerned. Competition for intermediate representations will be an important ring in the future framework.
It is worth noting that in the prior art, most of the running time of many applications is occupied by loops, and especially in applications of wireless communication and video image processing, a large number of loops are involved. Therefore, the loop code in the program is optimized, and the performance of the program can be further improved.
It is noted that while for simplicity of explanation, the foregoing method embodiments have been described as a series of acts or combination of acts, it will be appreciated by those skilled in the art that the present disclosure is not limited by the order of acts, as some steps may, in accordance with the present disclosure, occur in other orders and concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are exemplary embodiments and that acts and modules referred to are not necessarily required by the disclosure.
It should be further noted that, taking fig. 1 as an example, although the steps in the flowchart of fig. 1 are shown in sequence as indicated by the arrows, the steps are not necessarily executed in sequence as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a portion of the steps in fig. 1 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least a portion of the sub-steps or stages of other steps.
Correspondingly, in a third aspect, an embodiment of the present invention further provides a data processing apparatus, which is capable of implementing all the flows of the data processing method in the foregoing embodiment.
Referring to fig. 3, which is a schematic structural diagram of an embodiment of a data processing apparatus according to a third aspect of the present invention, the data processing apparatus includes a receiving unit 301 and a processing unit 302:
the receiving unit 301 is configured to obtain an intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; setting a cyclic expansion factor for the intermediate representation; the loop unroll factor relates to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device;
the processing unit 302 is configured to perform cyclic expansion on the intermediate representation according to the cyclic expansion factor to obtain an optimized intermediate representation;
the processing unit 302 is further configured to compile the optimized intermediate representation to obtain an object code executable by the back-end hardware device, so that the back-end hardware device executes the object code to implement a function of the object code.
It is worth to be noted that, compared with the prior art, the data processing apparatus provided in the embodiment of the present invention can calculate the loop expansion factor through information when the back-end hardware device executes and/or device information of the back-end hardware device, to obtain a more accurate loop expansion factor, and perform loop expansion on the intermediate representation through the loop expansion factor, so as to perform instruction scheduling in a wider range, and better develop instruction level parallelism, thereby further optimizing loop codes in the intermediate representation, improving a compilation optimization effect of the intermediate representation, realizing universality of the intermediate representation unrelated to the front-end language on various back-end hardware devices, and improving portability of the intermediate representation.
As an optional embodiment of the third aspect, the information of the intermediate representation when executed by the back-end hardware device includes at least one of M loop structures in the intermediate representation, and respective numbers of iterations corresponding to the M loop structures; m is a positive integer greater than 1.
As an optional embodiment of the third aspect, the device information of the back-end hardware device includes at least one of a register parameter, a code volume parameter, and a functional unit parameter;
the register parameter is used for indicating the number of registers contained in the back-end hardware equipment; the code volume parameter is used for indicating the size of a program code volume which can be run by the back-end hardware equipment; the functional component parameter is used for indicating the number of functional components contained in the back-end hardware equipment; the functional components are used to characterize components that perform arithmetic functions.
As an alternative embodiment of the third aspect, after the receiving unit obtains the intermediate representation of the deep learning model, the processing unit 302 is further configured to:
setting an iteration number threshold value N aiming at the intermediate representation, and judging a loop expansion factor K obtained by calculation in the ith loop iteration process of the intermediate representationiAnd the loop expansion factor K obtained by calculation in the i-1 th loop iterationi-1Whether they are the same; wherein i is more than or equal to 2, and N is more than or equal to 2;
if KiAnd Ki-1If the two are the same, the loop iteration process is ended, and K is setiAnd Ki-1As a final loop unrolling factor;
if KiAnd Ki-1If not, judging whether i is satisfied<N;
If yes, performing loop iteration for the (i + 1) th time on the intermediate representation, and judging a loop expansion factor K obtained by calculation in the (i + 1) th loop iterationi+1And the loop expansion factor K obtained by calculation in the ith loop iterationiWhether the first loop expansion factor and the second loop expansion factor are the same or not is judged, the loop iteration process is ended until the first loop expansion factor obtained by calculation in the current loop iteration is the same as the second loop expansion factor obtained by calculation in the last loop iteration, and any one of the first loop expansion factor and the second loop expansion factor is taken as the final oneA loop unroll factor;
if the difference value does not meet the preset requirement, the loop iteration process is ended, the difference values of the loop expansion factors obtained by calculation in two adjacent loop iterations are respectively calculated, the minimum difference value in the difference values is obtained, and any one of the two loop expansion factors corresponding to the minimum difference value is used as a final loop expansion factor;
then, the processing unit 302 performs cyclic expansion on the intermediate representation according to the cyclic expansion factor to obtain an optimized intermediate representation, specifically:
and circularly spreading the intermediate representation according to the final circular spreading factor to obtain the optimized intermediate representation.
As an alternative embodiment of the third aspect, after the receiving unit obtains the intermediate representation of the deep learning model, the processing unit 302 is further configured to:
acquiring a dynamic SCOP and a static SCOP based on the intermediate representation, converting the dynamic SCOP into the static SCOP, and obtaining a first intermediate representation according to the acquired static SCOP and the converted static SCOP; the dynamic SCOP is used for representing a program code represented by a polyhedral model, and carries parameter information which cannot be identified when the back-end hardware equipment compiles the SCOP; the static SCOP is used for representing a program code represented by a polyhedral model, and the static SCOP does not carry parameter information which cannot be identified by the back-end hardware equipment when the SCOP is compiled;
performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation;
then, the processing unit 302 performs cyclic expansion on the intermediate representation according to the cyclic expansion factor to obtain an optimized intermediate representation, specifically:
and circularly expanding the second intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation.
As an optional embodiment of the third aspect, the processing unit 302 converts the dynamic SCOP into a static SCOP, including:
acquiring parameter information of the dynamic SCOP when the back-end hardware equipment runs;
and adjusting the parameter information which cannot be identified by the back-end hardware equipment carried by the dynamic SCOP when the SCOP is compiled according to the acquired parameter information to obtain the converted static SCOP.
As an optional embodiment of the third aspect, the performing, by the processing unit 302, polyhedral optimization on the first intermediate representation to obtain a second intermediate representation includes:
performing polyhedral modeling on the first intermediate representation to obtain an iteration domain, a memory access function and an affine scheduling corresponding to the first intermediate representation; wherein, the iteration domain is used for characterizing a multidimensional iteration space domain of each statement in the SCOP in a multi-layer loop; the memory access function is used for representing the mapping relation between the iteration domain and the array with the affine subscript; the affine scheduling is used for representing a mapping relation between the iteration domain and the logic execution time of each statement in the SCOP;
and optimizing the first intermediate representation according to the iteration domain, the memory access function and the affine scheduling corresponding to the first intermediate representation to obtain the intermediate representation of the deep learning model which can be circularly expanded.
In addition, it should be noted that specific implementation schemes and advantageous effects of the embodiments of the data processing apparatus provided in the third aspect of the embodiment of the present invention are the same as those of the embodiments of the data processing method provided in the first aspect of the embodiment of the present invention, and are not described herein again.
In a specific application, the embodiment takes image processing as an example, and in a fourth aspect, the embodiment of the present invention provides a data processing apparatus, please refer to fig. 4, which is a schematic structural diagram of an embodiment of a data processing apparatus according to the fourth aspect of the present invention. The data processing apparatus includes a receiving unit 401 and a processing unit 402;
the receiving unit 401 is configured to obtain a target code and an image to be processed;
the processing unit 402 is configured to process the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by the data processing method according to any of the embodiments of the first aspect.
Specifically, the present embodiment may be executed by an image processor (GPU), and after obtaining the object code and the image to be processed, the image processor processes the image to be processed according to the object code, and accordingly obtains the image processing result. Wherein, the object code can be obtained by optimizing by using the data processing method described in any of the above embodiments, and the image processing can include, but is not limited to, any one of convolution processing, classification processing, and grayscale processing. The intermediate representation is optimized and then compiled to obtain the object code, so that the running efficiency of the object code can be improved, and the image processing efficiency is improved.
It can be understood that the heat tide of the new wave artificial intelligence represented by machine learning and deep learning has been continuously raised for years, and deep learning is a research focus in the field of artificial intelligence in recent years, and has achieved breakthrough progress in many fields. Today, there are many deep learning frameworks and many hardware platforms that support them. The deep learning framework and hardware diversity provides great benefits to users and is critical to maintaining the healthy development of artificial intelligence ecosystems, but supporting multiple frameworks and hardware requires enormous workload, which also provides significant challenges to artificial intelligence developers. Deep learning now has many different front-ends and many different back-ends, requiring a bridge to efficiently implement optimizations and mappings between them. With the wider application of deep learning, the realization efficiency of training and reasoning of the deep learning algorithm on different hardware architectures is more and more concerned. Competition for intermediate representations will be an important ring in the future framework.
It is worth noting that in the prior art, most of the running time of many applications is occupied by loops, and especially in applications of wireless communication and video image processing, a large number of loops are involved. Therefore, the loop code in the program is optimized, and the performance of the program can be further improved.
In addition, it should be noted that specific implementation schemes and advantageous effects of the embodiments of the data processing apparatus provided in the fourth aspect of the embodiment of the present invention are the same as those of the embodiments of the data processing method provided in the second aspect of the embodiment of the present invention, and are not described herein again.
In a fifth aspect, an embodiment of the present invention provides a data processing apparatus, please refer to fig. 5, which is a schematic structural diagram of an embodiment of the data processing apparatus according to the fifth aspect of the present invention. The data processing apparatus comprises a memory 501 and a processor 502, wherein the memory 501 is used for storing programs, the processor 502 is used for executing the programs stored in the memory 501, and when the programs stored in the memory 501 are executed, the processor 502 is used for executing the data processing method according to any one of the embodiments of the first aspect or the data processing method according to the embodiment of the second aspect.
As an alternative embodiment, the computer program may be divided into one or more modules/units (e.g. computer program 1, computer program 2, … …) which are stored in the memory 501 and executed by the processor 502 to implement the invention. The one or more modules/units may be a series of computer program instruction segments capable of performing certain functions, which are used to describe the execution of the computer program in the data processing device.
The Processor 502 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, a discrete hardware component, etc., the general purpose Processor may be a microprocessor, or the Processor 502 may be any conventional Processor, the Processor 502 is a control center of the data Processing apparatus, and various interfaces and lines are used to connect various parts of the data Processing apparatus.
The memory 501 mainly includes a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function, and the like, and the data storage area may store related data and the like. In addition, the memory 501 may be a high speed random access memory, a non-volatile memory such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card), and the like, or the memory 501 may be other volatile solid state memory devices.
It should be noted that the data processing device may include, but is not limited to, a processor and a memory, and those skilled in the art will understand that the structural diagram of fig. 5 is only an example of the data processing device, and does not constitute a limitation of the data processing device, and may include more or less components than those shown, or combine some components, or different components.
In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, and when the computer program runs, the apparatus on which the computer-readable storage medium is located is controlled to execute the data processing method according to any one of the foregoing first aspect embodiments or the data processing method according to the foregoing second aspect embodiments.
Based on the understanding that the constituent modules of the above-described apparatus, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium, and that, in essence, a part of the technical solution of the present application or all or part of the technical solution may be embodied in the form of a software product, and the computer product is stored in the computer-readable storage medium.
The computer readable storage medium may be an internal storage unit of the device of the foregoing embodiment, such as a hard disk or a memory. The computer readable storage medium may be an external storage device of the above-described apparatus, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the computer-readable storage medium may include both an internal storage unit and an external storage device of the device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the apparatus. The above-described computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.
It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the above embodiments of the methods when the computer program is executed. And the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, a division of a unit is merely a logical division, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
Units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The integrated unit implemented in the form of a software functional unit may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
While the foregoing is directed to embodiments of the present invention, it will be appreciated by those skilled in the art that various changes may be made in the embodiments without departing from the principles of the invention, and that such changes and modifications are intended to be included within the scope of the invention.

Claims (18)

1. A data processing method, comprising:
acquiring intermediate representation of a deep learning model; the deep learning model is a model constructed based on a front-end model framework; setting a cyclic expansion factor for the intermediate representation; the loop unroll factor relates to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device;
circularly spreading the intermediate representation according to the circular spreading factor to obtain an optimized intermediate representation;
compiling the optimized intermediate representation to obtain an executable object code of the back-end hardware equipment, so that the back-end hardware equipment executes the object code to realize the function of the object code.
2. The data processing method of claim 1, wherein the information of the intermediate representation when executed by the back-end hardware device comprises at least one of M loop structures in the intermediate representation, and respective numbers of iterations corresponding to the M loop structures; m is a positive integer greater than 1.
3. The data processing method of claim 1, wherein the device information of the back-end hardware device comprises at least one of a register parameter, a code volume parameter, and a functional component parameter;
the register parameter is used for indicating the number of registers contained in the back-end hardware equipment; the code volume parameter is used for indicating the size of a program code volume which can be run by the back-end hardware equipment; the functional component parameter is used for indicating the number of functional components contained in the back-end hardware equipment; the functional components are used to characterize components that perform arithmetic functions.
4. The data processing method of any of claims 1-3, wherein after the obtaining the intermediate representation of the deep learning model, the method further comprises:
setting an iteration number threshold value N aiming at the intermediate representation, and judging a loop expansion factor K obtained by calculation in the ith loop iteration process of the intermediate representationiAnd the loop expansion factor K obtained by calculation in the i-1 th loop iterationi-1Whether they are the same; wherein i is more than or equal to 2, and N is more than or equal to 2;
if KiAnd Ki-1If the two are the same, the loop iteration process is ended, and K is setiAnd Ki-1As a final loop unrolling factor;
if KiAnd Ki-1If not, judging whether i is satisfied<N;
If yes, performing loop iteration for the (i + 1) th time on the intermediate representation, and judging a loop expansion factor K obtained by calculation in the (i + 1) th loop iterationi+1And the loop expansion factor K obtained by calculation in the ith loop iterationiWhether the first loop expansion factor and the second loop expansion factor are the same or not is judged, the loop iteration process is finished until the first loop expansion factor obtained by calculation in the current loop iteration is the same as the second loop expansion factor obtained by calculation in the last loop iteration, and any one of the first loop expansion factor and the second loop expansion factor is taken as a final loop expansion factor;
if the difference value does not meet the preset requirement, the loop iteration process is ended, the difference values of the loop expansion factors obtained by calculation in two adjacent loop iterations are respectively calculated, the minimum difference value in the difference values is obtained, and any one of the two loop expansion factors corresponding to the minimum difference value is used as a final loop expansion factor;
then, circularly expanding the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly spreading the intermediate representation according to the final circular spreading factor to obtain the optimized intermediate representation.
5. The data processing method of claim 1, wherein after the obtaining the intermediate representation in the deep learning model, the method further comprises:
acquiring a dynamic SCOP and a static SCOP based on the intermediate representation, converting the dynamic SCOP into the static SCOP, and obtaining a first intermediate representation according to the acquired static SCOP and the converted static SCOP; the dynamic SCOP is used for representing a program code represented by a polyhedral model, and carries parameter information which cannot be identified when the back-end hardware equipment compiles the SCOP; the static SCOP is used for representing a program code represented by a polyhedral model, and the static SCOP does not carry parameter information which cannot be identified by the back-end hardware equipment when the SCOP is compiled;
performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation;
then, circularly expanding the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly expanding the second intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation.
6. The data processing method according to claim 5, wherein said converting said dynamic SCOP to a static SCOP comprises:
acquiring parameter information of the dynamic SCOP when the back-end hardware equipment runs;
and adjusting the parameter information which cannot be identified by the back-end hardware equipment carried by the dynamic SCOP when the SCOP is compiled according to the acquired parameter information to obtain the converted static SCOP.
7. The data processing method of claim 5, wherein the polyhedral optimizing the first intermediate representation to obtain a second intermediate representation comprises:
performing polyhedral modeling on the first intermediate representation to obtain an iteration domain, a memory access function and an affine scheduling corresponding to the first intermediate representation; wherein, the iteration domain is used for characterizing a multidimensional iteration space domain of each statement in the SCOP in a multi-layer loop; the memory access function is used for representing the mapping relation between the iteration domain and the array with the affine subscript; the affine scheduling is used for representing a mapping relation between the iteration domain and the logic execution time of each statement in the SCOP;
and optimizing the first intermediate representation according to the iteration domain, the memory access function and the affine scheduling corresponding to the first intermediate representation to obtain the intermediate representation of the deep learning model which can be circularly expanded.
8. A data processing method, comprising:
acquiring a target code and an image to be processed;
processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by a data processing method according to any one of claims 1 to 7.
9. A data processing apparatus, characterized by comprising a receiving unit and a processing unit:
the receiving unit is used for acquiring an intermediate representation of the deep learning model; the deep learning model is a model constructed based on a front-end model framework; setting a cyclic expansion factor for the intermediate representation; the loop unroll factor relates to information of the intermediate representation when executed by a back-end hardware device and/or device information of the back-end hardware device;
the processing unit is used for circularly expanding the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation;
the processing unit is further configured to compile the optimized intermediate representation to obtain an object code executable by the back-end hardware device, so that the back-end hardware device executes the object code to implement a function of the object code.
10. The data processing method of claim 9, wherein the information of the intermediate representation when executed by the back-end hardware device comprises at least one of M loop structures in the intermediate representation, and respective numbers of iterations corresponding to the M loop structures; m is a positive integer greater than 1.
11. The data processing method of claim 9, wherein the device information of the back-end hardware device comprises at least one of a register parameter, a code volume parameter, and a functional component parameter;
the register parameter is used for indicating the number of registers contained in the back-end hardware equipment; the code volume parameter is used for indicating the size of a program code volume which can be run by the back-end hardware equipment; the functional component parameter is used for indicating the number of functional components contained in the back-end hardware equipment; the functional components are used to characterize components that perform arithmetic functions.
12. The data processing method of any of claims 9-11, wherein after the receiving unit obtains the intermediate representation of the deep learning model, the processing unit is further configured to:
setting an iteration number threshold value N aiming at the intermediate representation, and judging a loop expansion factor K obtained by calculation in the ith loop iteration process of the intermediate representationiAnd the loop expansion factor K obtained by calculation in the i-1 th loop iterationi-1Whether they are the same; wherein i is more than or equal to 2, and N is more than or equal to 2;
if KiAnd Ki-1If the two are the same, the loop iteration process is ended, and K is setiAnd Ki-1As a final loop unrolling factor;
if KiAnd Ki-1If not, judging whether i is satisfied<N;
If yes, performing loop iteration for the (i + 1) th time on the intermediate representation, and judging a loop expansion factor K obtained by calculation in the (i + 1) th loop iterationi+1And the loop expansion factor K obtained by calculation in the ith loop iterationiWhether the first loop expansion factor and the second loop expansion factor are the same or not is judged, the loop iteration process is finished until the first loop expansion factor obtained by calculation in the current loop iteration is the same as the second loop expansion factor obtained by calculation in the last loop iteration, and any one of the first loop expansion factor and the second loop expansion factor is taken as a final loop expansion factor;
if the difference value does not meet the preset requirement, the loop iteration process is ended, the difference values of the loop expansion factors obtained by calculation in two adjacent loop iterations are respectively calculated, the minimum difference value in the difference values is obtained, and any one of the two loop expansion factors corresponding to the minimum difference value is used as a final loop expansion factor;
then, the processing unit circularly expands the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly spreading the intermediate representation according to the final circular spreading factor to obtain the optimized intermediate representation.
13. The data processing method of claim 9, wherein after the receiving unit obtains the intermediate representation of the deep learning model, the processing unit is further configured to:
acquiring a dynamic SCOP and a static SCOP based on the intermediate representation, converting the dynamic SCOP into the static SCOP, and obtaining a first intermediate representation according to the acquired static SCOP and the converted static SCOP; the dynamic SCOP is used for representing a program code represented by a polyhedral model, and carries parameter information which cannot be identified when the back-end hardware equipment compiles the SCOP; the static SCOP is used for representing a program code represented by a polyhedral model, and the static SCOP does not carry parameter information which cannot be identified by the back-end hardware equipment when the SCOP is compiled;
performing polyhedral optimization on the first intermediate representation to obtain a second intermediate representation;
then, the processing unit circularly expands the intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation, specifically:
and circularly expanding the second intermediate representation according to the circular expansion factor to obtain an optimized intermediate representation.
14. The data processing method of claim 13, wherein the processing unit converts the dynamic SCOP to a static SCOP, comprising:
acquiring parameter information of the dynamic SCOP when the back-end hardware equipment runs;
and adjusting the parameter information which cannot be identified by the back-end hardware equipment carried by the dynamic SCOP when the SCOP is compiled according to the acquired parameter information to obtain the converted static SCOP.
15. The data processing method of claim 13, wherein the performing polyhedral optimization on the first intermediate representation by the processing unit to obtain a second intermediate representation comprises:
performing polyhedral modeling on the first intermediate representation to obtain an iteration domain, a memory access function and an affine scheduling corresponding to the first intermediate representation; wherein, the iteration domain is used for characterizing a multidimensional iteration space domain of each statement in the SCOP in a multi-layer loop; the memory access function is used for representing the mapping relation between the iteration domain and the array with the affine subscript; the affine scheduling is used for representing a mapping relation between the iteration domain and the logic execution time of each statement in the SCOP;
and optimizing the first intermediate representation according to the iteration domain, the memory access function and the affine scheduling corresponding to the first intermediate representation to obtain the intermediate representation of the deep learning model which can be circularly expanded.
16. A data processing apparatus, characterized by comprising a receiving unit and a processing unit:
the receiving unit is used for acquiring a target code and an image to be processed;
the processing unit is used for processing the image to be processed according to the target code to obtain an image processing result; wherein the object code is obtained by a data processing method according to any one of claims 1 to 7.
17. A data processing apparatus, characterized by comprising: a memory for storing a program and a processor for executing the program stored by the memory, the processor being adapted to perform the data processing method of any one of claims 1-7 or the data processing method of claim 8 when the program stored by the memory is executed.
18. A computer-readable storage medium, comprising a stored computer program, wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform a data processing method according to any one of claims 1-7 or a data processing method according to claim 8.
CN202111554513.3A 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium Pending CN114385180A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111554513.3A CN114385180A (en) 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111554513.3A CN114385180A (en) 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium

Publications (1)

Publication Number Publication Date
CN114385180A true CN114385180A (en) 2022-04-22

Family

ID=81197635

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111554513.3A Pending CN114385180A (en) 2021-12-17 2021-12-17 Data processing method, device and equipment and computer storage medium

Country Status (1)

Country Link
CN (1) CN114385180A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115469931A (en) * 2022-11-02 2022-12-13 北京燧原智能科技有限公司 Instruction optimization method, device, system, equipment and medium of loop program

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115469931A (en) * 2022-11-02 2022-12-13 北京燧原智能科技有限公司 Instruction optimization method, device, system, equipment and medium of loop program

Similar Documents

Publication Publication Date Title
Zheng et al. Flextensor: An automatic schedule exploration and optimization framework for tensor computation on heterogeneous system
US11803404B2 (en) Deep learning algorithm compiling method, device, and related product
Kennedy et al. Automatic data layout for distributed-memory machines
Ansel et al. Petabricks: A language and compiler for algorithmic choice
Kim et al. Efficient SIMD code generation for irregular kernels
US9823911B2 (en) Method and apparatus for compiling code based on a dependency tree
Zerrell et al. Stripe: Tensor compilation via the nested polyhedral model
US9256437B2 (en) Code generation method, and information processing apparatus
CN113283613B (en) Deep learning model generation method, optimization method, device, equipment and medium
KR102013582B1 (en) Apparatus and method for detecting error and determining corresponding position in source code of mixed mode application program source code thereof
WO2021000971A1 (en) Method and device for generating operation data and related product
CN107851002A (en) A kind of code compiling method and code encoder
Falk et al. Source code optimization techniques for data flow dominated embedded software
CN114385180A (en) Data processing method, device and equipment and computer storage medium
Yang et al. Auto-tuning fixed-point precision with TVM on RISC-V packed SIMD extension
CN114398080A (en) Data processing method, device and equipment and computer storage medium
Di Martino et al. Two program comprehension tools for automatic parallelization
US11262989B2 (en) Automatic generation of efficient vector code with low overhead in a time-efficient manner independent of vector width
US9367291B2 (en) Apparatus and method for generating vector code
Shin et al. Exploiting superword-level locality in multimedia extension architectures
Custers Algorithmic species: Classifying program code for parallel computing
Liu et al. Improving the performance of OpenMP by array privatization
Patabandi et al. SWIRL++: Evaluating performance models to guide code transformation in convolutional neural networks
Bai et al. Gtco: Graph and tensor co-design for transformer-based image recognition on tensor cores
US11782706B1 (en) Reconfigurable neural network processing based on subgraph recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination