WO2017035748A1 - 一种代码编译方法及代码编译器 - Google Patents

一种代码编译方法及代码编译器 Download PDF

Info

Publication number
WO2017035748A1
WO2017035748A1 PCT/CN2015/088637 CN2015088637W WO2017035748A1 WO 2017035748 A1 WO2017035748 A1 WO 2017035748A1 CN 2015088637 W CN2015088637 W CN 2015088637W WO 2017035748 A1 WO2017035748 A1 WO 2017035748A1
Authority
WO
WIPO (PCT)
Prior art keywords
digital signal
signal processor
code
matrix
configuration information
Prior art date
Application number
PCT/CN2015/088637
Other languages
English (en)
French (fr)
Inventor
曾德军
王继辉
宁洪
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2015/088637 priority Critical patent/WO2017035748A1/zh
Priority to CN201580081768.9A priority patent/CN107851002A/zh
Publication of WO2017035748A1 publication Critical patent/WO2017035748A1/zh

Links

Images

Definitions

  • the embodiments of the present invention relate to the field of communications technologies, and in particular, to a code compiling method and a code compiler.
  • the digital signal processor (Digital Signal Processor in English) is widely used in signal processing, communication, radar, automatic control and other fields. It can transform and filter the baseband signal by using its powerful computing resources. Real-time processing such as valuation, compression identification, etc.
  • the digital signal processing performed by DSP is generally based on a matrix (matrix in English). To implement digital signal processing in the DSP, first perform matrix modeling and simulation, and then manually convert the matrix operations into specific DSP instructions.
  • DSPs There are many types of DSPs, and their DSP instruction architectures are different. If there is a need for the same algorithm for the same matrix to be implemented on different DSPs, existing solutions need to develop DSP instructions for each type of DSP. As shown in Figure 1, for the same matrix operation implemented in three different types of DSPs, for three different DSPs, if the core architecture and DSP instruction language are quite different, you need to manually write 3 sets of DSP instructions.
  • Embodiments of the present invention provide a code compiling method and a code compiler for implementing an efficient and general code compiling process across a digital processor platform.
  • a code compilation method including:
  • the code is a user-oriented high-level programming language program code
  • the performing the dimension reduction on the matrix according to the configuration information of the digital signal processor includes:
  • the matrix is reduced to K vectors, one vector includes P scalars, each scalar has a length of Q bits, and K is an integer greater than or equal to 1.
  • the description information according to the matrix operation in the source code and the configuration of the digital signal processor Information, generating digital signal processor instructions for processing the operational object comprising:
  • the method further includes: acquiring an optimization rule
  • a code compiler comprising:
  • a first obtaining module configured to acquire source code for describing a matrix operation and a matrix participating in the matrix operation, the source code being a user-oriented high-level programming language program code;
  • a second obtaining module configured to acquire configuration information of a digital signal processor running the target code
  • a code compiling module configured to generate, according to the source code and configuration information of the digital signal processor, an object code running on the digital signal processor, where the target code includes digital signal processor instructions, the number Signal processor instructions are operative to perform the matrix operations on the matrix.
  • the code compiling module is specifically configured to:
  • the code compiling module is specifically configured to:
  • the matrix is reduced to K vectors, one vector includes P scalars, each scalar has a length of Q bits, and K is an integer greater than or equal to 1.
  • the code compiling module is specifically configured to:
  • the code compiling module is specifically configured to:
  • a computer device in a third aspect, can include a processor, a memory, an input/output device, and a bus architecture.
  • the processor is responsible for managing the bus architecture and the usual processing, and the memory can store the data that the processor uses when performing operations.
  • Input/output devices are used to receive and output data under the control of the processor.
  • the input/output devices include, but are not limited to, a display, a mouse, a keyboard, and the like.
  • the bus architecture may include any number of interconnected buses and bridges, specifically linked by one or more processors represented by the processor and various circuits of memory represented by the memory.
  • the bus architecture can also link various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art and, therefore, will not be further described herein.
  • the bus architecture provides an interface.
  • the processor is responsible for managing the bus architecture and the usual processing, and the memory can store the data that the processor uses when performing operations.
  • the code compilation process disclosed in the embodiment of the present invention may be applied to a processor or implemented by a processor.
  • the steps of the code compilation process can be completed by the integrated logic circuit of the hardware in the processor or the instruction in the form of software.
  • the processor may be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or may implement or perform the embodiments of the present invention.
  • a general purpose processor can be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented as a hardware processor, or may be performed by a combination of hardware and software modules in the processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory, and the processor reads the information in the memory and combines the steps of the hardware to complete the code compilation process.
  • source code for describing a matrix operation and a matrix participating in the matrix operation, and configuration information of a digital signal processor running the target code are acquired, and then according to the source code and the number
  • the configuration information of the signal processor generates an object code running on the digital signal processor, and the object code includes digital signal processor instructions, and the digital signal processor instructions are operable to perform the matrix operation on the matrix.
  • the source code is a user-oriented high-level programming language program code, it is independent of the various types of digital signal processors, ie independent of the mathematical signal processor, and on the other hand compiles the source code into object code.
  • the requirements implemented in different types of digital signal processors for the same matrix operation may be based on the same according to the above embodiments.
  • the code compilation process improves the versatility and efficiency of code compilation.
  • FIG. 1 is a schematic diagram of a cross-platform code migration scheme in the prior art
  • FIG. 2 is a schematic diagram of a code compilation process according to an embodiment of the present invention.
  • step 203 in FIG. 2 is an implementation flow of step 203 in FIG. 2;
  • FIG. 4 is a schematic diagram of a cross-platform code migration solution according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a code compiler according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
  • the embodiment of the invention provides a code compiling scheme, which can obtain object codes suitable for different digital signal processors according to configuration information of different digital signal processors based on the same set of source codes, thereby In contrast, for different digital signal processor platforms, a common code compilation process is implemented, which improves the versatility and efficiency of code compilation.
  • the digital signal processor in the embodiment of the present invention is a generalized concept, and refers to a device that performs digital signal processing, and does not refer to a device named after the DSP alone.
  • the technical solution of the embodiment of the present invention can be extended to support all processors that perform digital signal processing.
  • a general-purpose CPU Central Processing Unit
  • a GPU Graphics Processing Unit
  • FIG. 2 is a schematic diagram of a code compilation process according to an embodiment of the present invention.
  • the process may be coded. Compiler implementation. As shown, the process can include the following steps:
  • Step 201 Acquire a source code for describing a matrix operation and a matrix participating in the matrix operation, the source code being a user-oriented high-level programming language program code.
  • High-level programming languages are generally user-oriented languages that are largely independent of the type and structure of the computer. Its greatest strengths include: formal proximity to arithmetic and natural language, conceptually close to the concepts people usually use. A command in a high-level programming language can replace several, dozens, or even hundreds of assembly language instructions. Therefore, high-level languages are easy to learn, easy to use, versatile, and widely used. There are many types of high-level programming languages. For example, C language and PASCAL language are all high-level programming languages.
  • Embodiments of the present invention do not limit the high-level programming language program code to which the source code describing the matrix operation belongs.
  • the above description of the source code based on the C language is described as an example, and the principle can be applied to other types of high-level programming language program source code.
  • the embodiment of the invention proposes a cross-platform programming language for matrix expansion based on C language.
  • the programming language is abbreviated as CM (C with Matrix) language, that is, matrix-based C language, corresponding
  • CM C with Matrix
  • the source code written in CM language for describing matrix operations is called CM source code.
  • the CM source code can only describe the matrix algorithm, which can be a fixed point algorithm or a floating point algorithm.
  • the CM source code is independent of the digital signal processor and does not reflect the characteristics of the digital signal processor platform.
  • CM language is a matrix-level high-level abstract language. It defines various algorithms and operations of the matrix, describes the matrix model of the algorithm, and provides a rich matrix operation syntax and operation library. Using these matrix arithmetic grammars and arithmetic libraries, users can easily describe matrix digital signal processing algorithms.
  • the grammar provided by the CM language can newly add the syntax of the matrix operation when the grammar of the C language is unchanged.
  • the syntax definitions of matrix operations provided by several CM languages are described below.
  • the method may include matrix addition (operator is +), matrix subtraction (operator is -), matrix multiplication (operator is *), and matrix reversal (algorithm descriptor is RECIP) operation.
  • the specific ones may include:
  • the complex number is conjugated (the algorithm descriptor is CONJ), the conjugate transpose (the algorithm descriptor is CTRAN), and the modulo square (the algorithm descriptor is MODU).
  • the method may include: transposition (the algorithm descriptor is TRAN), conjugate transposition (the algorithm descriptor is CTRAN), and matrix element inversion (the algorithm descriptor is ELEREV).
  • a hash operation of the matrix a matrix split operation, a matrix merge operation, and the like may be included.
  • the method may include: a matrix inversion operation operation (the algorithm descriptor is INVERSE), and a matrix eigenvalue eigenvector decomposition operation.
  • various decomposition operations of the matrix such as Cholesky decomposition, LU decomposition, QR decomposition, etc., may be included.
  • the source code can be written in CM language according to the required matrix operation.
  • the code compiler can automatically disassemble the digital signal processor instructions according to the operational meaning represented by the CM language. Users only need to focus on describing the matrix operations using the CM language, without having to care about which or which digital signal processor platform these operations operate on. Therefore, the user only needs to master the CM language without having to master a variety of digital signal processor instruction languages, the development cycle is shortened, and the efficiency is improved.
  • Step 202 Acquire configuration information of a digital signal processor running the target code.
  • the configuration information of the digital signal processor refers to information related to the digital signal processor platform (system), which can reflect the platform characteristics of the digital signal processor.
  • the configuration information of the digital signal processor may include parameters required for the code compilation process based on the digital processor, such as a vectorization scheme applicable to the digital signal processor and related parameters, such as a vectorization length parameter, and The set of instructions associated with this digital processor type or platform. Digital letter from different platforms (systems) Number processor, the instruction set is not the same.
  • the correspondence between the configuration information of the digital signal processor and the type of the digital signal processor can be established in advance, so that only the description information of the digital signal processor type, such as the digital signal processor model, can be input to the code compiler.
  • the code compiler can get the configuration information of the corresponding digital signal processor.
  • Step 203 Generate, according to the source code and configuration information of the digital signal processor, an object code that is executed by the digital signal processor, where the target code includes digital signal processor instructions, and the digital signal processor Instructions are used to implement the matrix operations on the matrix.
  • Digital signal processor instructions are usually instructions written in assembly language and conform to assembly language statement formats.
  • the assembly language statement format can contain four parts: the label field, the instruction field, the operand field, and the comment field. Taking the mnemonic instruction as an example, the assembly language statement format is as follows:
  • the instruction field contains the opcode
  • the operand field contains the operand
  • the assembly language allows you to specify a constant, symbol, or expression as an address, immediate, or indirect.
  • SIMD single instruction stream single data stream
  • IMD single instruction stream Data stream
  • a digital signal processor instruction can fetch multiple elements of a matrix operand for operation.
  • step 203 may include the following steps 2031 to 2032:
  • Step 2031 Perform dimension reduction on the matrix participating in the matrix operation according to the configuration information of the digital signal processor, and obtain an operation object (also referred to as an operand) of the digital signal processor instruction.
  • Matrix dimension reduction refers to placing the operands of a multidimensional (N0*N1*....*Nm) matrix into X one-dimensional vectors according to the characteristics of the operations performed, and X is an integer greater than or equal to 1.
  • the operation on the multidimensional matrix can be equivalent to the result of the operation on the X one-dimensional vectors.
  • the method of matrix dimensionality reduction is closely related to the characteristics of matrix operation operations.
  • the dimensionality reduction methods for each matrix operation eg, +, *, summation, etc.
  • the code compiler in the embodiment of the present invention can identify the optimal type according to the operation type + operand dimension information of the input matrix + the digital signal processor platform vectorization feature information + the instruction template of the digital signal processor platform.
  • Vectorization method for matrix dimensionality reduction can be used.
  • the instruction template of the digital signal processor platform can be used to convert the matrix operations described by the CM source code into a digital signal processor instruction set that matches the type of the digital signal processor platform. For example, for a matrix multiplied operation statement in the CM source code, one or more DSP instructions that are summed and summed in a certain way according to the template may be converted into elements in the matrix.
  • the corresponding vectorization scheme may be configured in advance for the digital signal processor.
  • the vectorization scheme defines parameters such as the vectorization length.
  • the instruction operation conforming to the instruction of the digital signal processor may be obtained. Object.
  • the specific implementation process of step 2031 may include: acquiring a vectorization scheme according to configuration information of the digital signal processor, the vector in the vectorization scheme
  • the length parameter is expressed as P ⁇ Q
  • P represents the number of operation objects processed by the digital signal processor instruction of a single instruction stream multiple data stream type
  • P and Q are integers greater than or equal to 1, respectively
  • the vectorization length reduces the matrix to K vectors, one vector includes P scalars, each scalar has a length of Q bits, and K is an integer greater than or equal to 1.
  • a complete matrix operation is split into multiple sub-operations. Each sub-operation can be implemented with a specific digital signal processor instruction.
  • Step 2032 According to the matrix operation description information in the source code and the digital signal processor Information is generated to generate digital signal processor instructions that process the operational object.
  • the instruction set of the digital signal processor is obtained according to the configuration information of the digital signal processor, and the digital signal for processing the operation object is generated according to the description information of the operation operation of the matrix and the instruction set of the digital signal processor.
  • the processor instructions that the opcode and instruction format in the generated digital signal processor instructions are adapted to the instructions in the instruction set of the digital signal processor.
  • a DSP instruction that needs to generate an addition operation may acquire an instruction for implementing an addition operation in a digital signal processor instruction set, and then determine a content of the data field in the instruction according to the data object obtained by reducing the dimension in step 2031.
  • Other portions of the instruction (such as an opcode) may remain unchanged, resulting in a DSP instruction that adds the data object in step 2031.
  • the code compiler can complete the generation of target code and code optimization.
  • step of acquiring an optimization rule may be further included.
  • the target code running on the digital signal processor may be generated according to the source code and the adaptation information of the digital signal processor and the optimization rule.
  • the optimization rule may include an efficiency priority rule, a performance priority rule, and a space priority rule.
  • the generated digital signal processor instructions may also be different. For example, if a performance priority rule is adopted, the generated digital signal is generated. Processor instructions have better performance but may take up more storage space. Optimization rules can also specify the number of loop expansions, whether to perform loop merges, and so on.
  • the source code for describing the matrix operation and the matrix participating in the matrix operation, and the configuration information of the digital signal processor running the target code are first acquired, and then according to the Deriving source code and configuration information of the digital signal processor to generate an object code running on the digital signal processor, the object code including digital signal processor instructions, the digital signal processor instructions being usable to implement the matrix Perform the matrix operation.
  • the source code is a user-oriented high-level programming language program code, and thus is independent of various types of digital signal processors, ie, does not rely on a mathematical signal processor, and on the other hand
  • the configuration information of the digital signal processor running the target code is used as the basis for compiling, so that, by the above embodiment, the operations are performed on different types of digital signal processors for the same matrix operation.
  • the requirements can be based on the same set of source code, and according to the configuration information of different digital signal processors, the target codes suitable for different digital signal processors are obtained, so that different digital signal processing is compared with the prior art.
  • the platform implements a common code compilation process, which improves the versatility and efficiency of code compilation.
  • the CM language since the CM language only describes the algorithm and does not involve the underlying digital signal processor code, it belongs to the general language. The user only needs to write the source code of the CM language.
  • the source code of the CM voice can be converted into The digital signal processor instructions of the corresponding DSP platform are shown in Figure 3.
  • the cross-platform function can be automatically realized, and on the other hand, the cross-platform porting cost can be reduced.
  • a mobile communication system can be used for DSP implementation in a relay device such as a base station, a user equipment, or a repeater of a communication system.
  • a relay device such as a base station, a user equipment, or a repeater of a communication system.
  • the current wireless communication system evolves faster, and the CM language can quickly implement DSP implementation of various new technologies and algorithms.
  • an image processing algorithm implementation may be implemented by using an embodiment of the present invention.
  • a radar processing algorithm implementation or a transmitter/receiver or other field that may involve a DSP implementation may be implemented by using an embodiment of the present invention.
  • the matrix B is a matrix having a number of rows and columns of 24 ⁇ 12
  • the matrix C is a matrix having a number of rows and columns of 12 ⁇ 24.
  • the source code is obtained by programming in CM language.
  • the operand B and the operand C can be defined as a two-dimensional matrix, which are respectively represented as B[s][n] and C[n][m], and the matrix operation result is expressed.
  • Is A[s][m] as follows:
  • the prefix matrix is an identifier of the CM language, and the operand is identified as a matrix type.
  • Half indicates that the type of the operand matrix element is a semi-precision type.
  • MULXYCR is a matrix multiplication operation defined by CM.
  • the above source code is input to the code compiler, and the code compiler implements the conversion process of the source code to the DSP instruction.
  • the code compiler first reduces the dimension of the matrix, and divides the multiplication of the 2-dimensional matrix into a multiply-accumulate process of multiple one-dimensional vectors. For example, you can reduce the dimension by the number of rows and columns (see the for loop statement in the following code). For example, after the dimension reduction of the A matrix, each vector is 3 vectors. A total of 24 lines, each vector contains 8 scalars, each scalar 16 bits long. Then, the MULXYCR is subjected to a vectorization decomposition operation, and finally adapted to the DSP platform to generate a DSP instruction code.
  • the generated DSP instruction code can be as follows:
  • HFMULA_R_8X16 is a vectorized multiply and accumulate operation; LV16 and SV16 are the load of the operand and the store operation of the result respectively.
  • the code compiler can automatically optimize, such as optimization pass analysis, load/store elimination, loop invariant extraction, loop unroll/merge, and branch elimination. This process does not require human involvement, which frees up manpower and shortens the development cycle.
  • an embodiment of the present invention further provides a code compiler.
  • FIG. 5 is a schematic structural diagram of a code compiler according to an embodiment of the present invention.
  • the code compiler may include: a first obtaining module 501, a second obtaining module 502, and a code compiling module 503, where:
  • a first obtaining module 501 configured to acquire source code for describing a matrix operation and a matrix participating in the matrix operation, where the source code is a user-oriented high-level programming language program code;
  • a second obtaining module 502 configured to acquire a configuration signal of a digital signal processor running the target code interest
  • a code compiling module 503 configured to generate, according to the source code and configuration information of the digital signal processor, an object code that is executed by the digital signal processor, where the target code includes a digital signal processor instruction, Digital signal processor instructions are operative to perform the matrix operations on the matrix.
  • the code compiling module 503 is specifically configured to: perform dimension reduction on the matrix according to configuration information of the digital signal processor, to obtain an operation object of the digital signal processor instruction; And generating, according to the matrix operation description information in the source code and the configuration information of the digital signal processor, a digital signal processor instruction to process the operation object.
  • the code compiling module 503 is specifically configured to: obtain a vectorization scheme according to configuration information of the digital signal processor, where a vectorization length parameter in the vectorization scheme is represented as P ⁇ Q, P represents the number of operation objects processed by a single instruction stream multiple data stream type digital signal processor instruction, P and Q are respectively integers greater than or equal to 1; according to the vectorization length, the matrix is The dimensionality reduction is K vectors, one vector includes P scalars, each scalar has a length of Q bits, and K is an integer greater than or equal to 1.
  • the code compiling module 503 is specifically configured to: acquire an instruction set of the digital signal processor according to configuration information of the digital signal processor; and operate according to the operation of the matrix Decoding information and an instruction set of the digital signal processor, generating digital signal processor instructions to process the operational object, an opcode and an instruction format in the digital signal processor instructions and an instruction set of the digital signal processor The instructions are adapted.
  • the code compiling module 503 is specifically configured to: obtain an optimization rule, generate, according to the source code and configuration information of the digital signal processor, and the optimization rule, run the number The target code of the signal processor.
  • an embodiment of the present invention further provides a computer device.
  • FIG. 6 is a schematic structural diagram of a computer device according to an embodiment of the present invention.
  • the computer device can include a processor 601, a memory 602, an input/output device 603, and a bus architecture 604.
  • the processor 601 is responsible for managing the bus architecture and the usual processing, and the memory 602 can store the processing.
  • the input/output device 603 is for receiving and outputting data under the control of the processor 601.
  • the input/output device 603 includes, but is not limited to, a display, a mouse, a keyboard, and the like.
  • the bus architecture may include any number of interconnected buses and bridges, specifically linked by one or more processors represented by processor 601 and various circuits of memory represented by memory 602.
  • the bus architecture can also link various other circuits such as peripherals, voltage regulators, and power management circuits, which are well known in the art and, therefore, will not be further described herein.
  • the bus architecture provides an interface.
  • the processor 601 is responsible for managing the bus architecture and general processing, and the memory 602 can store data used by the processor 601 in performing operations.
  • the code compilation process disclosed in the embodiment of the present invention may be applied to the processor 601 or implemented by the processor 601.
  • each step of the code compilation process may be completed by an integrated logic circuit of the hardware in the processor 601 or an instruction in the form of software.
  • the processor 601 can be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, and can implement or perform the embodiments of the present invention.
  • a general purpose processor can be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiments of the present invention may be directly implemented as a hardware processor, or may be performed by a combination of hardware and software modules in the processor.
  • the software module can be located in a conventional storage medium such as random access memory, flash memory, read only memory, programmable read only memory or electrically erasable programmable memory, registers, and the like.
  • the storage medium is located in the memory 602, and the processor 601 reads the information in the memory 602 and completes the steps of the code compilation process in conjunction with its hardware.
  • the processor 601 is configured to read a program in the memory 602 and perform the following process:
  • source code for describing a matrix operation and a matrix participating in the matrix operation, the source code being a user-oriented high-level programming language program code;
  • the digital signal processor instructions are configured to perform the matrix operation on the matrix.
  • the processor 601 is specifically configured to: perform dimension reduction on the matrix according to configuration information of the digital signal processor, to obtain an operation object of the digital signal processor instruction; and describe according to a matrix operation in the source code
  • the information and the configuration information of the digital signal processor generate digital signal processor instructions that process the operational object.
  • the processor 601 is specifically configured to: obtain a vectorization scheme according to configuration information of the digital signal processor, where a vectorization length parameter in the vectorization scheme is represented as P ⁇ Q, and P represents a single instruction stream.
  • the number of operation objects processed by the digital signal processor instruction of the data stream type, P and Q are respectively integers greater than or equal to 1; according to the vectorization length, the matrix is reduced to K vectors, and one vector includes P scalars, each scalar is Q bits in length, and K is an integer greater than or equal to 1.
  • the processor 601 is specifically configured to: acquire an instruction set of the digital signal processor according to configuration information of the digital signal processor; description information according to an operation operation of the matrix, and the digital signal processor An instruction set generates digital signal processor instructions for processing the operational object, the opcodes and instruction formats in the digital signal processor instructions being adapted to instructions in the instruction set of the digital signal processor.
  • the processor 601 is further configured to: obtain an optimization rule, and generate, according to the source code and configuration information of the digital signal processor, and the optimization rule, an object code that is executed by the digital signal processor.
  • embodiments of the present invention can be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or a combination of software and hardware. Moreover, the invention can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) including computer usable program code.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • the computer program instructions can also be stored in a computer readable memory that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer readable memory produce an article of manufacture comprising the instruction device.
  • the apparatus implements the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing device such that a series of operational steps are performed on a computer or other programmable device to produce computer-implemented processing for execution on a computer or other programmable device.
  • the instructions provide steps for implementing the functions specified in one or more blocks of a flow or a flow and/or block diagram of the flowchart.

Landscapes

  • Devices For Executing Special Programs (AREA)

Abstract

一种代码编译方法及代码编译器,首先获取用于描述矩阵运算以及参与所述矩阵运算的矩阵的源代码(201),以及获取运行目标代码的数字信号处理器的配置信息(202),然后根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算(203)。一方面,所述源代码为面向用户的高级编程语言程序代码,可独立于各类数字信号处理器,另一方面在将所述源代码编译为目标代码时,将运行目标代码的数字信号处理器的配置信息作为编译的依据,这样,可以基于相同的一套源代码,得到适用于不同数字信号处理器的目标代码,实现了通用的代码编译过程,提高了代码编译的通用性和效率。

Description

一种代码编译方法及代码编译器 技术领域
本发明实施例涉及通信技术领域,尤其涉及一种代码编译方法及代码编译器。
背景技术
数字信号处理器(英文为Digital Signal Processor,简称为DSP)在信号处理、通信、雷达、自动控制等各个领域得到了广泛应用,它利用自身强大的运算资源,可以完成对基带信号的变换、滤波、估值、压缩识别等实时处理。
DSP进行的数字信号处理一般都是基于矩阵(英文为matrix)的。要在DSP中实现数字信号处理,首先进行矩阵建模仿真,然后以人工方式将矩阵运算转换为特定的DSP指令。
DSP类型多种多样,其DSP指令架构也不尽相同。如果存在针对相同矩阵的同一种算法在不同的DSP上实现的需求时,现有的方案需要针对每种类型的DSP编写开发DSP指令。如图1所示,针对同一种矩阵运算分别在3种不同类型的DSP中实现的情况下,针对3种不同的DSP,如果其核架构和DSP指令语言差异较大,则需要以人工方式编写3套DSP指令。
由此可见,目前亟需一种高效和高通用性的能够实现跨DSP平台的代码编译方案。
发明内容
本发明实施例提供了一种代码编译方法和代码编译器,用以实现跨数字处理器平台的高效和通用的代码编译过程。
第一方面,提供一种代码编译方法,包括:
获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源 代码为面向用户的高级编程语言程序代码;
获取运行目标代码的数字信号处理器的配置信息;
根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
结合第一方面,在第一方面的一种可能的实现方式中,所述根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,包括:
根据所述数字信号处理器的配置信息,对所述矩阵进行降维,得到数字信号处理器指令的操作对象;
根据所述源代码中的矩阵运算描述信息以及所述数字信号处理器的配置信息,生成处理所述操作对象的数字信号处理器指令。
结合第一方面的第一种可能的实现方式,在第一方面的第二种可能的实现方式中,所述根据所述数字信号处理器的配置信息,对所述矩阵进行降维,包括:
根据所述数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;
根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
结合第一方面的第一种可能的实现方式,在第一方面的第三种可能的实现方式中,所述根据所述源代码中的矩阵运算的描述信息以及所述数字信号处理器的配置信息,生成对所述操作对象进行处理的数字信号处理器指令,包括:
根据所述数字信号处理器的配置信息获取所述数字信号处理器的指令集;
根据所述矩阵的运算操作的描述信息以及所述数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所述数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。
结合第一方面或者第一方面的第一种至第三种可能的实现方式中的任意一种,在第一方面的第四种可能的实现方式中,还包括:获取优化规则;
根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,包括:
根据所述源代码以及所述数字信号处理器的配置信息,以及所述优化规则,生成运行于所述数字信号处理器的目标代码。
第二方面,提供一种代码编译器,包括:
第一获取模块,用于获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源代码为面向用户的高级编程语言程序代码;
第二获取模块,用于获取运行目标代码的数字信号处理器的配置信息;
代码编译模块,用于根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
结合第二方面,在第二方面的第一种可能的实现方式中,所述代码编译模块,具体用于:
根据所述数字信号处理器的配置信息,对所述矩阵进行降维,得到数字信号处理器指令的操作对象;
根据所述源代码中的矩阵运算描述信息以及所述数字信号处理器的配置信息,生成处理所述操作对象的数字信号处理器指令。
结合第二方面的第一种可能的实现方式,在第二方面的第二种可能的实现方式中,所述代码编译模块,具体用于:
根据所述数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字 信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;
根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
结合第二方面的第一种可能的实现方式,在第二方面的第三种可能的实现方式中,所述代码编译模块,具体用于:
根据所述数字信号处理器的配置信息获取所述数字信号处理器的指令集;
根据所述矩阵的运算操作的描述信息以及所述数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所述数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。
结合第二方面或者第二方面的第一种至第三种可能的实现方式中的任意一种,在第二方面的第四种可能的实现方式中,所述代码编译模块,具体用于:
获取优化规则,根据所述源代码以及所述数字信号处理器的配置信息,以及所述优化规则,生成运行于所述数字信号处理器的目标代码。
第三方面,提供一种计算机设备。该计算机设备可包括:处理器、存储器、输入/输出装置以及总线架构。
处理器负责管理总线架构和通常的处理,存储器可以存储处理器在执行操作时所使用的数据。输入/输出装置用于在处理器的控制下接收和输出数据。所述输入/输出装置包括但不限于:显示器、鼠标、键盘等。
总线架构可以包括任意数量的互联的总线和桥,具体由处理器代表的一个或多个处理器和存储器代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线架构提供接口。处理器负责管理总线架构和通常的处理,存储器可以存储处理器在执行操作时所使用的数据。
本发明实施例揭示的代码编译流程,可以应用于处理器中,或者由处理器实现。在实现过程中,代码编译流程的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。处理器可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成代码编译流程的步骤。
本发明的上述实施例中,首先获取用于描述矩阵运算以及参与所述矩阵运算的矩阵的源代码,以及运行目标代码的数字信号处理器的配置信息,然后根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,目标代码中包含数字信号处理器指令,该数字信号处理器指令可用于实现对所述矩阵进行所述矩阵运算。由于一方面,所述源代码为面向用户的高级编程语言程序代码,因此独立于各类数字信号处理器,即不依赖于数学信号处理器,另一方面在将所述源代码编译为目标代码时,将运行目标代码的数字信号处理器的配置信息作为编译的依据,这样,通过上述实施例,在针对同一种矩阵运算分别在不同类型的数字信号处理器中实现的需求,可以基于相同的一套源代码,通过依据不同的数字信号处理器的配置信息,得到适用于不同数字信号处理器的目标代码,从而与现有技术相比,针对不同的数字信号处理器平台,实现了通用的代码编译过程,提高了代码编译的通用性和效率。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中 所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为现有技术中跨平台代码移植方案示意图;
图2为本发明实施例提供的代码编译流程示意图;
图3为图2中的步骤203的实现流程;
图4为本发明实施例中跨平台代码移植方案示意图;
图5为本发明实施例提供的代码编译器的结构示意图;
图6为本发明实施例提供的计算机设备的结构示意图。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部份实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
本发明实施例提供了一种代码编译方案,可以基于相同的一套源代码,通过依据不同的数字信号处理器的配置信息,得到适用于不同数字信号处理器的目标代码,从而与现有技术相比,针对不同的数字信号处理器平台,实现了通用的代码编译过程,提高了代码编译的通用性和效率。
本发明实施例中的数字信号处理器是广义的概念,是指完成数字信号处理的器件,并非单指以DSP命名的器件。本发明实施例的技术方案可以扩展为支持所有进行数字信号处理的处理器。例如,完成数字信号处理功能的通用CPU(Central Processing Unit,中央处理器)和GPU(Graphics Processing Unit,图形处理器)也属于本发明适用的范畴。
下面结合附图对本发明实施例进行详细描述。
参见图2,为本发明实施例提供的代码编译流程示意图,该流程可由代码 编译器实现。如图所示,该流程可包括以下步骤:
步骤201:获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源代码为面向用户的高级编程语言程序代码。
高级编程语言一般是面向用户的、基本上独立于计算机种类和结构的语言。其最大的优点包括:形式上接近于算术语言和自然语言,概念上接近于人们通常使用的概念。高级编程语言的一个命令可以代替几条、几十条甚至几百条汇编语言的指令。因此,高级语言易学易用、通用性强、应用广泛。高级编程语言种类繁多,比如,C语言、PASCAL语言等均属于高级编程语言。
本发明实施例对于描述矩阵运算的源代码属于何种高级编程语言程序代码不做限制。下面仅以基于C语言实现的上述源代码为例描述,其原理可推而广之地应用到其他类型的高级编程语言程序源代码。
本发明实施例提出了一种在C语言基础上进行矩阵扩展的跨平台编程语言,为描述方便,将该编程语言简写为CM(英文为C with Matrix)语言,即基于矩阵的C语言,相应地,将以CM语言编写的用于描述矩阵运算的源代码称为CM源代码。CM源代码可以仅描述矩阵算法,可以是定点算法或者浮点算法,CM源代码独立于数字信号处理器,不体现数字信号处理器平台特点。
CM语言是一种矩阵级的高级抽象语言,对矩阵的各类算法和操作进行定义,对算法的矩阵模型进行描述,可提供丰富的矩阵运算语法和运算库。利用这些矩阵运算语法和运算库,用户可以方便地对矩阵数字信号处理算法进行描述。
CM语言提供的相关语法,在继承C语言已有语法不变的情况下,可新增加矩阵运算的语法。作为一个例子,下面介绍几种CM语言提供的矩阵运算的语法定义。
(1)点运算操作
比如具体可包括矩阵加法(运算符为+),矩阵减法(运算符为-),矩阵乘法(运算符为*)和矩阵求倒(算法描述符为RECIP)操作。
(2)复数矩阵相关操作
比如具体可包括:
复数分拆:求实部(算法描述符为REAL),求虚部(算法描述符为IMAG)。
复数合并:虚部实部组合(算法描述符为COMPLEX)。
复数求共轭(算法描述符为CONJ)、共轭转置(算法描述符为CTRAN)、求模方(算法描述符为MODU)。
(3)矩阵变形操作
比如具体可包括:转置(算法描述符为TRAN),共轭转置(算法描述符为CTRAN),矩阵元素反转(算法描述符为ELEREV)。
此外,还可包括矩阵的散列操作、矩阵分拆操作、矩阵合并操作等。
(4)其他操作
比如具体可包括:矩阵求逆运算操作(算法描述符为INVERSE),矩阵特征值特征向量分解操作。
此外,还可包括矩阵的各类分解操作,比如,Cholesky分解、LU分解、QR分解等。
通过以上语法定义,可以涵盖几乎所有的数字信号处理场景需求。
实际应用中,针对一种基于矩阵运算实现的数据处理任务,可首先根据所需的矩阵运算,采用CM语言编写源代码。代码编译器可以根据CM语言所表示的操作意义,自动拆解出数字信号处理器指令。用户只需要关注利用CM语言描述矩阵运算过程,而不必关心这些操作具体在哪个或哪种数字信号处理器平台上运行。因此,用户只需要掌握CM语言即可,而不必掌握多种数字信号处理器指令语言,开发周期缩短,效率得到了提升。
步骤202:获取运行目标代码的数字信号处理器的配置信息。
数字信号处理器的配置信息是指与数字信号处理器平台(系统)相关的信息,可以反映该数字信号处理器的平台特点。具体来说,数字信号处理器的配置信息可以包括基于该数字处理器进行代码编译过程所需的参数,比如适用于该数字信号处理器的矢量化方案以及相关参数,比如矢量化长度参数,以及与该数字处理器类型或平台相关的指令集。不同平台(系统)的数字信 号处理器,器指令集不尽相同。
在实际应用中,可预先建立数字信号处理器的配置信息与数字信号处理器类型的对应关系,这样,可以仅向代码编译器输入数字信号处理器类型的描述信息,比如数字信号处理器型号,即可使代码编译器获取到对应的数字信号处理器的配置信息。
步骤203:根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
数字信号处理器指令通常为汇编语言编写的指令,符合汇编语言语句格式。汇编语言语句格式可以包含4个部分:标号域、指令域、操作数域和注释域。以助记符指令为例,汇编语言语句格式如下:
[标号][:]指令[操作数列表][;注释]
其中,[]内的部分是可选项。指令域中包含操作码,操作数域中包含操作对象,汇编语言允许指定常数、符号或表达式作为地址、立即数或间接寻址。
数字信号处理器架构有两种,一种是采用单指令流单数据流(SISD),一条指令只能取到操作数(操作对象)中的一个矩阵元素进行实现;另一种是单指令流多数据流(SIMD)并行技术,其典型代表是向量处理器(Vector Processor)和阵列处理器(Array Processor)。在这种架构下,一条数字信号处理器指令可以取出矩阵操作数中的多个元素进行运算。
以单指令流多数据流(SIMD)架构的数字信号处理器架构为例,如图3所示,步骤203的实现过程可包括以下步骤2031至步骤2032:
步骤2031:根据数字信号处理器的配置信息,对参与矩阵运算的矩阵进行降维,得到数字信号处理器指令的操作对象(也称操作数)。
由于数字信号处理器不能识别矩阵运算,需要将矩阵运算转化成向量运算。向量操作可直接对应到数字信号处理器的指令上,即,向量可以是数字信号处理器指令的操作对象。因此,需要将N维度(N>=1)矩阵进行降维, 这个过程是一个等价运算的变换过程。矩阵降维是指将多维(N0*N1*....*Nm)矩阵的操作数,按照所执行运算的特性,放到X个一维向量中,X为大于或等于1的整数。对多维矩阵的操作在结果上可等效于对该X个一维向量的操作结果。
矩阵降维的方法,与矩阵操作运算的特性密切相关的。每种矩阵运算(比如,+,*,求和等)的降维方法可能不一致。本发明实施例中的代码编译器可根据对输入的矩阵的操作类型+操作数维度信息+数字信号处理器平台矢量化特征信息+数字信号处理器平台的指令模板进行识别,自动化寻找最优的矩阵降维的矢量化方法。
其中,数字信号处理器平台的指令模板可用来将CM源代码所描述的矩阵操作转换为与数字信号处理器平台类型相匹配的数字信号处理器指令集。比如,对于CM源代码中的矩阵相乘的运算语句,可根据该模板转换为矩阵中的元素按照某种方式进行累加求和的一条或多条DSP指令。
由于数字信号处理器平台的不同,数字信号处理器指令的格式、长度,操作对象(操作数)的要求也有所不同。可预先针对数字信号处理器配置相应的矢量化方案,矢量化方案中规定了矢量化长度等参数,在对矩阵进行降维时依据该参数,可得到符合该数字信号处理器指令要求的指令操作对象。
以单指令流多数据流(SIMD)架构的数字信号处理器架构为例,步骤2031的具体实现过程可包括:根据数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;然后,根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
在经过了矩阵降维后,一个完整的矩阵运算被分拆成多个子操作。每条子操作可以用一条特定的数字信号处理器指令来实现。
步骤2032:根据源代码中的矩阵运算描述信息以及数字信号处理器的配 置信息,生成处理所述操作对象的数字信号处理器指令。
具体实施时,可根据数字信号处理器的配置信息获取该数字信号处理器的指令集,根据矩阵的运算操作的描述信息以及该数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所生成的数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。作为一个例子,比如需要生成加法运算的DSP指令,则可以获取数字信号处理器指令集中实现加法运算的指令,然后根据在步骤2031中降维得到的数据对象确定该指令中的数据域的内容,该指令中的其他部分(比如操作码)可以保持不变,从而得到对步骤2031中的数据对象进行加法运算的DSP指令。
在本发明的上述实施例的基础上,进一步地,如果在进行代码编译时需要实现特定的优化,还可以在代码编译前添加一些自定义的优化规则。根据优化规则,代码编译器可完成目标代码的生成和代码优化。
具体地,还可包括获取优化规则的步骤,在步骤203中,可根据源代码以及数字信号处理器的适配信息,以及所述优化规则,生成运行于该数字信号处理器的目标代码。
其中,优化规则可包括效率优先规则、性能优先规则、空间优先规则,根据不同的优化规则,所生成的数字信号处理器指令也会有所差异,比如若采用性能优先规则,则生成的数字信号处理器指令具有较好的性能,但可能占用较多的存储空间。优化规则还可以指定循环展开次数、是否进行循环合并等。
通过以上描述可以看出,本发明的上述实施例中,首先获取用于描述矩阵运算以及参与所述矩阵运算的矩阵的源代码,以及运行目标代码的数字信号处理器的配置信息,然后根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,目标代码中包含数字信号处理器指令,该数字信号处理器指令可用于实现对所述矩阵进行所述矩阵运算。由于一方面,所述源代码为面向用户的高级编程语言程序代码,因此独立于各类数字信号处理器,即不依赖于数学信号处理器,另一方面在将所 述源代码编译为目标代码时,将运行目标代码的数字信号处理器的配置信息作为编译的依据,这样,通过上述实施例,在针对同一种矩阵运算分别在不同类型的数字信号处理器中实现的需求,可以基于相同的一套源代码,通过依据不同的数字信号处理器的配置信息,得到适用于不同数字信号处理器的目标代码,从而与现有技术相比,针对不同的数字信号处理器平台,实现了通用的代码编译过程,提高了代码编译的通用性和效率。
本发明实施例中,由于CM语言只描述了算法,不涉及底层的数字信号处理器代码,因此属于通用语言。用户只需要编写一次CM语言的源代码,当需要在不同的DSP平台切换时,只需要向代码编译器输入不同的DSP平台配置信息(比如型号),即可将该CM语音的源代码转换为相应DSP平台的数字信号处理器指令,如图3所示。移植到不同的DSP平台时,仅需更换输入到代码编译器的DSP配置信息即可生成相应DSP平台的数字信号处理器指令。与现有技术相比,一方面可自动实现跨平台功能,另一方面可以降低跨平台移植成本。
本发明实施例可适用于多种利用DSP实现信号处理算法实现的场景。如下列举几种典型场景:
(1)无线通信系统数字信号处理算法实现
在无线通信系统中,比如GSM(Global System for Mobile Communication,全球移动通信系统)或UMTS(Universal Mobile Telecommunications System,通用移动通信系统)或LTE(Long Term Evolution,长期演进)系统或5G(第五代移动通信)系统,可用于通信系统的基站、用户设备或直放站等中继设备中的DSP实现。同时,目前无线通信系统演进的速度较快,CM语言能够快速实现各种新技术新算法的DSP实现。
(2)其他可能涉及到信号处理的领域和场景
例如,图像处理算法实现、雷达处理算法实现或者发射机/接收机等其他可能涉及DSP实现的领域均可以采用本发明实施例实现。
为了更清楚地理解本发明实施例,下面以一个具体例子来说明本发明上 述实施例的实现过程。
本例子中,对于给定矩阵B=(bij)sxn和给定矩阵C=(cjk)nxm,需要实现矩阵B和矩阵C的乘法运算,结果为A=(aik)sxm。表示为:
A=BC
其中,s=24;n=12;m=24。即,矩阵B为行列数为24×12的矩阵,矩阵C为行列数为12×24的矩阵。
首先使用CM语言编程得到源代码,其过程中,可定义操作数B和操作数C为二维矩阵,分别表示为B[s][n]和C[n][m],矩阵运算结果表示为A[s][m],如下所示:
matrix half A[24][24];
matrix half B[24][12];
matrix half C[12][24];
A=MULXYCR(B,C);//矩阵相乘
其中,前缀matrix为CM语言的标识符,标识该操作数为矩阵类型。Half表示操作数矩阵元素的类型为半精度类型。MULXYCR为CM定义的矩阵乘法操作。
上述源代码被输入代码编译器,由代码编译器实现源代码到DSP指令的转换过程。代码编译器首先对矩阵进行降维,将2维矩阵的乘法划分为多个一维向量的乘累加过程。比如,可按照行数和列数循环进行降维(参见以下代码中的for循环语句),例如,A矩阵降维后,每行为3个矢量。总共24行,每个向量包含8个标量,每个标量16比特长度。然后对MULXYCR进行矢量化分解操作,最后与DSP平台适配,产生DSP指令代码。
生成的DSP指令代码可如下所示:
Figure PCTCN2015088637-appb-000001
Figure PCTCN2015088637-appb-000002
上述DSP指令代码中,HFMULA_R_8X16为矢量化的乘累加操作;LV16、SV16分别为操作数的载入(load)和结果的存回(store)操作,在不同平台的DSP中,这些指令的格式会有所差异。代码编译器可以自动进行优化,如优化遍分析、load/store消除、循环不变量外提、循环展开/合并,以及分支消除等优化手段。这个过程不需人工参与,从而解放了人力,缩短了开发周期。
基于相同的技术构思,本发明实施例还提供了一种代码编译器。
参见图5,为本发明实施例提供的代码编译器的结构示意图,该代码编译器可包括:第一获取模块501、第二获取模块502和代码编译模块503,其中:
第一获取模块501,用于获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源代码为面向用户的高级编程语言程序代码;
第二获取模块502,用于获取运行目标代码的数字信号处理器的配置信 息;
代码编译模块503,用于根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
在一种可能的实现方式中,优选地,代码编译模块503可具体用于:根据所述数字信号处理器的配置信息,对所述矩阵进行降维,得到数字信号处理器指令的操作对象;根据所述源代码中的矩阵运算描述信息以及所述数字信号处理器的配置信息,生成处理所述操作对象的数字信号处理器指令。
在一种可能的实现方式中,优选地,代码编译模块503可具体用于:根据所述数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
在一种可能的实现方式中,优选地,代码编译模块503可具体用于:根据所述数字信号处理器的配置信息获取所述数字信号处理器的指令集;根据所述矩阵的运算操作的描述信息以及所述数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所述数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。
在一种可能的实现方式中,代码编译模块503可具体用于:获取优化规则,根据所述源代码以及所述数字信号处理器的配置信息,以及所述优化规则,生成运行于所述数字信号处理器的目标代码。
基于相同的技术构思,本发明实施例还提供了一种计算机设备。
参见图6,为本发明实施例提供的计算机设备的结构示意图。该计算机设备可包括:处理器601、存储器602、输入/输出装置603以及总线架构604。
处理器601负责管理总线架构和通常的处理,存储器602可以存储处理 器601在执行操作时所使用的数据。输入/输出装置603用于在处理器601的控制下接收和输出数据。所述输入/输出装置603包括但不限于:显示器、鼠标、键盘等。
总线架构可以包括任意数量的互联的总线和桥,具体由处理器601代表的一个或多个处理器和存储器602代表的存储器的各种电路链接在一起。总线架构还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路链接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线架构提供接口。处理器601负责管理总线架构和通常的处理,存储器602可以存储处理器601在执行操作时所使用的数据。
本发明实施例揭示的代码编译流程,可以应用于处理器601中,或者由处理器601实现。在实现过程中,代码编译流程的各步骤可以通过处理器601中的硬件的集成逻辑电路或者软件形式的指令完成。处理器601可以是通用处理器、数字信号处理器、专用集成电路、现场可编程门阵列或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件,可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本发明实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器602,处理器601读取存储器602中的信息,结合其硬件完成代码编译流程的步骤。
具体地,处理器601,用于读取存储器602中的程序,执行下列过程:
获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源代码为面向用户的高级编程语言程序代码;
获取运行目标代码的数字信号处理器的配置信息;
根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所 述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
优选地,处理器601可具体用于:根据所述数字信号处理器的配置信息,对所述矩阵进行降维,得到数字信号处理器指令的操作对象;根据所述源代码中的矩阵运算描述信息以及所述数字信号处理器的配置信息,生成处理所述操作对象的数字信号处理器指令。
优选地,处理器601可具体用于:根据所述数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
优选地,处理器601可具体用于:根据所述数字信号处理器的配置信息获取所述数字信号处理器的指令集;根据所述矩阵的运算操作的描述信息以及所述数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所述数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。
优选地,处理器601还可用于:获取优化规则,根据所述源代码以及所述数字信号处理器的配置信息,以及所述优化规则,生成运行于所述数字信号处理器的目标代码。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程 和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器,使得通过该计算机或其他可编程数据处理设备的处理器执行的指令可实现流程图中的一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图的一个流程或多个流程和/或方框图的一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (10)

  1. 一种代码编译方法,其特征在于,包括:
    获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源代码为面向用户的高级编程语言程序代码;
    获取运行目标代码的数字信号处理器的配置信息;
    根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,包括:
    根据所述数字信号处理器的配置信息,对所述矩阵进行降维,得到数字信号处理器指令的操作对象;
    根据所述源代码中的矩阵运算描述信息以及所述数字信号处理器的配置信息,生成处理所述操作对象的数字信号处理器指令。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述数字信号处理器的配置信息,对所述矩阵进行降维,包括:
    根据所述数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;
    根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
  4. 如权利要求2所述的方法,其特征在于,所述根据所述源代码中的矩阵运算的描述信息以及所述数字信号处理器的配置信息,生成对所述操作对象进行处理的数字信号处理器指令,包括:
    根据所述数字信号处理器的配置信息获取所述数字信号处理器的指令集;
    根据所述矩阵的运算操作的描述信息以及所述数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所述数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。
  5. 如权利要求1至4中任一项所述的方法,其特征在于,还包括:获取优化规则;
    根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,包括:
    根据所述源代码以及所述数字信号处理器的配置信息,以及所述优化规则,生成运行于所述数字信号处理器的目标代码。
  6. 一种代码编译器,其特征在于,包括:
    第一获取模块,用于获取用于描述矩阵运算的源代码以及参与所述矩阵运算的矩阵,所述源代码为面向用户的高级编程语言程序代码;
    第二获取模块,用于获取运行目标代码的数字信号处理器的配置信息;
    代码编译模块,用于根据所述源代码以及所述数字信号处理器的配置信息,生成运行于所述数字信号处理器的目标代码,所述目标代码中包含数字信号处理器指令,所述数字信号处理器指令用于实现对所述矩阵进行所述矩阵运算。
  7. 如权利要求6所述的代码编译器,其特征在于,所述代码编译模块,具体用于:
    根据所述数字信号处理器的配置信息,对所述矩阵进行降维,得到数字信号处理器指令的操作对象;
    根据所述源代码中的矩阵运算描述信息以及所述数字信号处理器的配置信息,生成处理所述操作对象的数字信号处理器指令。
  8. 如权利要求7所述的代码编译器,其特征在于,所述代码编译模块,具体用于:
    根据所述数字信号处理器的配置信息获取矢量化方案,所述矢量化方案中的矢量化长度参数表示为P×Q,P表示一个单指令流多数据流类型的数字信号处理器指令所处理的操作对象的数量,P和Q分别为大于或等于1的整数;
    根据所述矢量化长度,将所述矩阵降维为K个向量,一个向量包括P个标量,每个标量的长度为Q个比特,K为大于或等于1的整数。
  9. 如权利要求7所述的代码编译器,其特征在于,所述代码编译模块,具体用于:
    根据所述数字信号处理器的配置信息获取所述数字信号处理器的指令集;
    根据所述矩阵的运算操作的描述信息以及所述数字信号处理器的指令集,生成处理所述操作对象的数字信号处理器指令,所述数字信号处理器指令中的操作码和指令格式与所述数字信号处理器的指令集中的指令相适配。
  10. 如权利要求6至9中任一项所述的代码编译器,其特征在于,所述代码编译模块,具体用于:
    获取优化规则,根据所述源代码以及所述数字信号处理器的配置信息,以及所述优化规则,生成运行于所述数字信号处理器的目标代码。
PCT/CN2015/088637 2015-08-31 2015-08-31 一种代码编译方法及代码编译器 WO2017035748A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2015/088637 WO2017035748A1 (zh) 2015-08-31 2015-08-31 一种代码编译方法及代码编译器
CN201580081768.9A CN107851002A (zh) 2015-08-31 2015-08-31 一种代码编译方法及代码编译器

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2015/088637 WO2017035748A1 (zh) 2015-08-31 2015-08-31 一种代码编译方法及代码编译器

Publications (1)

Publication Number Publication Date
WO2017035748A1 true WO2017035748A1 (zh) 2017-03-09

Family

ID=58186588

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/088637 WO2017035748A1 (zh) 2015-08-31 2015-08-31 一种代码编译方法及代码编译器

Country Status (2)

Country Link
CN (1) CN107851002A (zh)
WO (1) WO2017035748A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334031A (zh) * 2019-07-16 2019-10-15 腾讯科技(深圳)有限公司 内存分配代码检测方法、装置、计算机设备及存储介质
CN112306502A (zh) * 2019-07-31 2021-02-02 上海华为技术有限公司 一种代码生成方法及装置
CN113986245A (zh) * 2021-10-28 2022-01-28 平安银行股份有限公司 基于halo平台的目标代码生成方法、装置、设备及介质

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111694557B (zh) * 2019-03-15 2024-04-16 上海商汤智能科技有限公司 数据处理方法及装置、图像处理方法及装置、电子设备
CN110825380A (zh) * 2019-09-30 2020-02-21 上海寒武纪信息科技有限公司 核函数的生成方法、目标代码的生成方法和组合处理装置
CN111290759B (zh) * 2020-01-19 2023-09-19 龙芯中科技术股份有限公司 指令生成方法、装置及设备
CN113391813B (zh) * 2020-12-04 2024-08-13 腾讯科技(深圳)有限公司 程序编译方法和装置、存储介质及电子设备
CN118605850A (zh) * 2024-08-07 2024-09-06 之江实验室 一种面向Triton编译器流水线的优化系统及优化方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587445A (zh) * 2009-06-19 2009-11-25 国网电力科学研究院 一种plc编译执行方法
US20120185820A1 (en) * 2011-01-19 2012-07-19 Suresh Kadiyala Tool generator
CN103116513A (zh) * 2012-07-13 2013-05-22 北京时代民芯科技有限公司 一种异构多核处理器编译器
CN103744684A (zh) * 2014-01-24 2014-04-23 中国科学院自动化研究所 一种异构软硬件协同开发的方法及系统
CN104572234A (zh) * 2014-12-29 2015-04-29 杭州华为数字技术有限公司 生成用于并行计算架构的源代码的方法及源到源编译器

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6286134B1 (en) * 1999-04-23 2001-09-04 Sun Microsystems, Inc. Instruction selection in a multi-platform environment
US8418155B2 (en) * 2009-02-10 2013-04-09 International Business Machines Corporation Generating parallel SIMD code for an arbitrary target architecture
CN102667717A (zh) * 2009-12-21 2012-09-12 诺基亚公司 用于编译的方法、装置和系统
CN103631632B (zh) * 2013-11-29 2017-08-04 华为技术有限公司 移植方法及源到源编译器

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101587445A (zh) * 2009-06-19 2009-11-25 国网电力科学研究院 一种plc编译执行方法
US20120185820A1 (en) * 2011-01-19 2012-07-19 Suresh Kadiyala Tool generator
CN103116513A (zh) * 2012-07-13 2013-05-22 北京时代民芯科技有限公司 一种异构多核处理器编译器
CN103744684A (zh) * 2014-01-24 2014-04-23 中国科学院自动化研究所 一种异构软硬件协同开发的方法及系统
CN104572234A (zh) * 2014-12-29 2015-04-29 杭州华为数字技术有限公司 生成用于并行计算架构的源代码的方法及源到源编译器

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334031A (zh) * 2019-07-16 2019-10-15 腾讯科技(深圳)有限公司 内存分配代码检测方法、装置、计算机设备及存储介质
CN110334031B (zh) * 2019-07-16 2023-11-03 腾讯科技(深圳)有限公司 内存分配代码检测方法、装置、计算机设备及存储介质
CN112306502A (zh) * 2019-07-31 2021-02-02 上海华为技术有限公司 一种代码生成方法及装置
CN113986245A (zh) * 2021-10-28 2022-01-28 平安银行股份有限公司 基于halo平台的目标代码生成方法、装置、设备及介质

Also Published As

Publication number Publication date
CN107851002A (zh) 2018-03-27

Similar Documents

Publication Publication Date Title
WO2017035748A1 (zh) 一种代码编译方法及代码编译器
CN110689138B (zh) 运算方法、装置及相关产品
Eddelbuettel et al. RcppArmadillo: Accelerating R with high-performance C++ linear algebra
US20160321039A1 (en) Technology mapping onto code fragments
WO2017088665A1 (zh) 用于加速器的程序生成方法和系统
Özkan et al. FPGA-based accelerator design from a domain-specific language
KR20140064710A (ko) 이진 번역을 수행하기 위한 방법 및 시스템
JP2010186467A (ja) コンピュータにより実施される方法、コンピュータ可読ストレージ媒体およびシステム(simdアーキテクチャの条件付きデータ選択のための高速ベクトル・マスキング・アルゴリズム)
Pedre et al. Accelerating embedded image processing for real time: a case study
US10210137B2 (en) Binary multiplier for binary vector factorization
US20220172044A1 (en) Method, electronic device, and computer program product for deploying machine learning model
CN114443559A (zh) 可重构算子单元、处理器、计算方法、装置、设备及介质
CN113885877A (zh) 编译的方法、装置、设备及介质
Bielecki et al. Free scheduling for statement instances of parameterized arbitrarily nested affine loops
US8713039B2 (en) Co-map communication operator
Arndt et al. Performance evaluation of the Intel Xeon Phi manycore architecture using parallel video-based driver assistance algorithms
Schlägl et al. A RISC-V “V” VP: Unlocking Vector Processing for Evaluation at the System Level
CN113467828A (zh) 一种异构众核处理器中编程语言转换方法和系统
Dubrov et al. Generating pipeline integrated circuits using C2HDL converter
Acosta et al. Paralldroid: Performance analysis of gpu executions
CN111552478A (zh) 用于生成cuda程序的设备、方法和存储介质
Karuri et al. A generic design flow for application specific processor customization through instruction-set extensions (ISEs)
Song et al. Extending Neural Processing Unit and Compiler for Advanced Binarized Neural Networks
Ingole et al. Instruction set design for elementary set in tensilica xtensa
Häublein et al. Hybrid code description for developing fast and resource efficient image processing architectures

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15902555

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15902555

Country of ref document: EP

Kind code of ref document: A1