CN105242907A

CN105242907A - NEON vectorization conversion method for ARM (Advanced RISC Machine) binary code

Info

Publication number: CN105242907A
Application number: CN201510574950.XA
Authority: CN
Inventors: 梅魁志; 温哲西; 李博良; 张少愚; 刘辉; 黄雄; 高榕; 付帅; 伍健
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2015-09-10
Filing date: 2015-09-10
Publication date: 2016-01-13
Anticipated expiration: 2035-09-10
Also published as: CN105242907B

Abstract

The present invention discloses a NEON vectorization conversion method for an ARM (Advanced RISC Machine) binary code. The method comprises the following steps: step 1. carrying out disassembling; step 2. carrying out flow graph generation; step 3. carrying out cycle detection; step 4. carrying out memory analysis; step 5. carrying out instruction translation; and step 6. carrying out assembly instruction output. According to the NEON vectorization conversion method for the ARM binary code disclosed by the present invention, after disassembling the binary code of an ARM, a control flow graph is established and reaching fixed value analysis is performed, and a basic block that a target optimization object is located at is found, and an access mode in the optimized basic block is analyzed, and according to resource scheduling of a free extension register and core register on a chip, a part of repeated access results are stored in a free on-chip register, so that a time overhead of program access is reduced by accessing a high-speed register, thereby achieving the goal of speeding up.

Description

The NEON vectorization conversion method of ARM binary code

[technical field]

The invention belongs to embedded empty SIMD automatically parallelizing technical field, particularly a kind of NEON vectorization conversion method of ARM binary code, be suitable for the bottom function acceleration being applicable to image procossing, matrix computations association area.

[background technology]

Arm processor, because its high-performance and low-power consumption, becomes most popular Embedded Application processor.Along with user requires more and more harsher to ARM program execution time, the ARM program that part exists a large amount of data calculating needs to accelerate.The acceleration to ARM program to be realized when former ARM program cannot obtain, SMID unit can be utilized to set about accelerating from binary code aspect.

In the algorithm utilizing SIMD instruction to accelerate at present, SLP algorithm is comparatively perfect, and according to the test result in related article, SLP algorithm has good acceleration effect, therefore first uses SLP algorithm to accelerate ARM binary code instructions.

But find after SLP algorithm realization through test: the program after SLP algorithm optimization is not accelerated, and even consuming time to be greater than original program consuming time for part optimum results.Reason has following 3 points:

First, former SLP algorithm does not reduce cycle index.Although SLP algorithm has in local several computationses is merged into a NEON instruction, however with reduce instruction number from circulation magnitude and compare, total to perform number of times considerably less in the instruction that SLP algorithm reduces.

Secondly, be also more topmost, instruction consuming time main in ARM program is access instruction (STR/LDR does not consider special ARM instruction here), and SLP algorithm is not a large amount of to be reduced the quantity of access instruction or optimizes memory access mode in a large number.In ARM program, general computations as instructions such as ADD, SUB, LSL, MUL, the actual time much smaller than an access memory consuming time.And SLP algorithm is due to mainly based on ARM instruction, optimizing a small amount of computing statement is NEON instruction, although therefore decrease the quantity of part computations, but the data-moving instruction be the increase between the q register of NEON and the R register of ARM, finally, not reducing the total number of instruction, is all even generally the quantity that can increase overall instruction.

Finally, there is the exchange of values between a large amount of q registers and r register in SLP optimum results, decreases the producing level of streamline, adds the time that program is run.

[summary of the invention]

The object of the present invention is to provide a kind of NEON vectorization conversion method of ARM binary code, utilize SIMD framework and corresponding NEON language on ARM, automatic paralleling acceleration is carried out to program.Because the limitation of NEON linguistic function and SIMD unit are to the limitation of process data demand, NEON instruction requires comparatively strict to the reading manner of internal memory, therefore need to carry out certain analysis to binary code and judge to determine whether binary code can utilize NEON instruction to accelerate.In order to can above-mentioned functions be realized, the present invention elects optimization aim as nested the darkest circulation, the darkest nested circulation often structure comparatively simple and consuming time huge and also utilize circulation feature be beneficial to analysis memory access mode, therefore the present invention utilizes NEON instruction to be optimized for nested recursion instruction exactly, thus realizes the target to program acceleration.

To achieve these goals, the present invention adopts following technical scheme:

The NEON vectorization conversion method of ARM binary code, comprises the following steps:

The first step, dis-assembling: carry out dis-assembling to the binary code file of ARM program, obtain the former ARM command information that every bar binary code is corresponding;

Second step, flow graph generate: the controlling stream graph setting up fundamental block on the basis of ARM instruction dis-assembling, on the basis of existing controlling stream graph, carry out the analysis of arrival definite value to each fundamental block;

3rd step, cycle detection: find the maximum circulation of the nested number of plies by cycle detection, and the composition fundamental block of its innermost loop is marked;

4th step, memory analysis: find the ARM instruction needing in optimization aim to translate in groups, marked, the instruction after mark will not carry out internal memory optimization; Find the instruction completing circulatory function in instruction, marked equally and do not participate in internal memory optimization;

On above mark basis, find all access instruction, and the access instruction of access same memory is merged; According to memory analysis scheme after merging, mark the memory access type of each access instruction;

5th step, instruction translation: by recursion instruction, the instruction of translating in groups, access instruction and address offset computations, hot computations, do not translate with the instruction of circulation change;

6th step, assembly instruction export: carry out collecting and exporting according to the translation result of the 5th step.

Further, 3rd step specifically comprises the following steps: by abstract for fundamental block each in controlling stream graph for node, redirect relation between fundamental block is abstract is directed edge, by abstract for controlling stream graph be digraph, in digraph, look for all rings, and judge that whether all rings found are the rings for a circulation instead of simple skip instruction composition; Meanwhile, the circulation that entrance fundamental block is identical with outlet fundamental block is merged; After merging, judge the nest relation between circulation, obtain the circulation that nested number of plies number is maximum, i.e. innermost loop simultaneously, and the fundamental block of this circulation of mark composition is which part of circulation.

Further, the 4th step specifically comprises the following steps:

Before memory analysis carries out, first the assembly instruction realizing in objective optimization ARM program circulating is marked; Then again according to the recognition rule of interpretive order in groups, the instruction finding combination translation is also marked;

After above-mentioned mark, will the instruction of circulation be realized and need the instruction of combination translation to foreclose, all access instruction in statistics objective optimization program; According to the self-contained information of access instruction and the semantic analysis result realizing the instruction that address offset calculates before access instruction, the access instruction identical to internal memory merges; Then, memory access mode analysis is carried out to each memory address, if the stepped form of a memory access mode meets at least one condition in following three conditions:

(1) memory access mode of equidistantly stepping (comprising forward and negative sense).

(2) memory address of continuation address change is accessed.

Article (3) 2 ~ 4, the packing of orders is read continuously together or is write internal memory continuously together.

Access instruction so with regard to utilizing the access instruction of NEON to replace former ARM; Meanwhile, carry out the identification of special memory access mode and mark its interpretive scheme.

Further, LINEAR CONTINUOUS stepping, equidistantly stepping will be carried out along with circulation after memory analysis, this three classes address of odd-times memory access will be set as Dram address, and not be decided to be static address with the address of circulation change;

According to dynamic and static state address, instruction is divided into following a few class:

Dynamic instruction:

(1) instruction of dynamic address is accessed;

(2) in instruction, all data all come from the instruction of dynamic address;

Static instruction:

(1) instruction of access static internal memory;

(2) all data are immediate or come from static address;

Can be changed into dynamic instruction:

(1) result of the register defined will be stored into static address;

(2) value is only had to come from static address or undefined register in instruction.

Further, in 5th step, ARM and NEON instruction is when translating, utilize the core register of ARM number and extended register number be all 0 ~ 15 feature, utilize the compiling result of former ARM program, look for the corresponding identical and corresponding identical instruction of function of NEON with ARM order structure and directly translate; For the high double word part of extended register and low double word part carry out translation map time, 0 ~ 15 of core register is carried out mappings respectively and is translated with the low double word part of extended register.

Further, idling-resource scheduling comprises idle q register and the scheduling of idle r register;

Idle q register scheduling process:

First all extended registers are labeled as idle available;

First pass, traversal current goal optimizes all instructions in fundamental block, if present instruction meets simultaneously:

(1) instruction that offset address calculates is not participated in;

(2) not the instruction participating in circulation realization;

(3) and exist definition register;

These three conditions, extended register so of the same name with present instruction definition register is busy and marks;

Second time, traversal current goal optimizes all instructions in fundamental block, if present instruction meets simultaneously:

(1) not access instruction and present instruction;

(2) be not certain access instruction calculated address skew instruction;

(3) not the instruction having participated in circulatory function;

These three conditions, so when the extended register of the same name with front instruction use register is busy, and are labeled as busy;

3rd time, traversal current goal optimizes all instructions in fundamental block, if present instruction is the instruction of not translating, the extended register that the register so used with present instruction is of the same name is labeled as the free time;

4th time, travel through all access instruction, if access instruction has been the instruction that many packings of orders form vldn/vstn together, extended register so of the same name with these instruction definition registers has been idle;

Idle r register scheduling process:

First be available by whole r register tagging;

First pass, all access instruction in traversal fundamental block, register tagging access instruction used is busy;

Second time, all instructions in traversal fundamental block, if present instruction is the instruction having participated in memory access address offset, or participate in the instruction realizing circulatory function, then present instruction definition is all busy with the instruction used;

Be finally busy by r14 and r15 register tagging;

Idle r and q register according to above dispatching method method mark is register all idle in whole fundamental block.

Relative to prior art, the present invention has following beneficial effect: the present invention, when source program is not increased income, analyzes the binary code of ARM, utilizes SMID unit and corresponding SMID instruction, by ARM binary code through optimizing, be translated as the assembly language of ARM and NEON mixing.Optimization aim is decided to be the binary code of the darkest nested cyclic part by the present invention, in optimizing process, utilize the register mappings relation between the core register of ARM and extended register and the behavior of ARM instruction is summarized, abstract and modelling, achieving ARM instruction translation is the mixed instruction of ARM and NEON.Simultaneously, the present invention carries out scheduling of resource to the extended register of ARM and core register, achieve extended register on the sheet to the free time and carry out maximum using, decrease the access times to internal memory, make former ARM program in the NEON parallelization ensureing to complete under the prerequisite that logical consequence is correct part ARM instruction, achieve the acceleration to original program.

2, by achieving the management and running of register on sheet, the result repeating memory access temporarily being left in idle register, when making subsequent access same memory, access memory need not be gone again, but the temporal data in access extended register.Which reduce the access to internal memory in circulation, greatly reduce the expense of time.Under the mapping relations mentioned in claim 1, its feature is by judging that ARM instruction in current basic block takies situation to core register (r register), by being mapped to extended register (q register), what just obtain extended register takies situation.

Wherein mapping relations are:

Existing ARM is utilized to compile result, by the register that hot computations uses and defines

3, subprogram ARM instruction due to its memory access mode existence condition redirect or memory access be not to increase progressively or production decline law carries out address increase, so can not carry out SIMD parallelization process.Therefore before ARM is translated as NEON instruction, need to carry out certain analysis to the instruction behavior of target program, thus determine whether the memory access mode of the access instruction (comprising ldr/str, ldrb/strb, ldrh/strh) in program can carry out NEON translation, the internal memory for many instruction access is optimized according to the dispatching principle of register idle on sheet that utilizes proposed in claim 2.According to the translation scheme proposed in the translation rule of special memory access mode and claim 1, ARM optimization is translated as NEON instruction for only having the internal memory of an instruction access.

[accompanying drawing explanation]

Illustrate below in conjunction with accompanying drawing and with embodiment, the present invention to be described in further detail.

Fig. 1 is Resources on Chip scheduling schematic diagram;

Fig. 2 is the direct translation schematic diagram of ARM instruction to NEON instruction;

Fig. 3 is translated the rear NEON instruction schematic diagram needing to carry out data processing;

Fig. 4 is general ARM instruction translation schematic diagram;

Fig. 5 is the NEON vectorization modular converter figure of ARM binary code;

Fig. 6 is the NEON vectorization conversion overview flow chart of ARM binary code;

Fig. 7 is the process flow diagram of disassembler;

Fig. 8 is that controlling stream graph is set up and the process flow diagram arriving definite value;

Fig. 9 is the process flow diagram that nested circulation identifies;

Figure 10 is the process flow diagram of memory analysis and internal memory optimization;

Figure 11 is the process flow diagram of translation process;

Figure 12 is assembler process flow diagram.

[embodiment]

The NEON vectorization conversion method of a kind of ARM binary code of the present invention, its core is:

In the ARM program of one, objective optimization, part access instruction can not be translated as NEON instruction because its memory access mode exists the non-isometric reason such as memory access, the memory access (access instruction of namely having ready conditions) that performs more than the memory access of NOEN access instruction stepping restriction, conditionality that increases progressively or successively decrease in address non-linear increasing or the memory access of successively decreasing, address.Before ARM instruction translation is NEON instruction by the present invention, first the instruction that the instruction and needing that forms circulation is translated in groups can be analyzed, carried out the burden peeling off to reduce the analysis of follow-up memory access mode, then all access instruction in objective optimization program are added up, judge the similarities and differences of access instruction reference address according to the information of access instruction itself and the behavior of its relevant instruction, and then the access instruction of access same physical internal memory is merged.Then, on the basis of existing statistics, then the memory access mode special to some is analyzed, identify address negative sense stepping memory access, odd even not homogeneous memory access, compartment memory access this three class of pattern commonly use and special memory access mode.

Two, there is many specified registers in ARM instruction, general ARM instruction can not relate to use, such as PC (r15) register, LR (r14) register, SP (r13) register.And core register (r register) limited amount own, and core register r0 ~ r3 is often used as numerical value transmission, therefore the operation behavior taking a broad view of ARM instruction is not difficult to find: in the extended register (q register) that the core register (r register) that ARM instruction uses uses corresponding to NEON instruction, q register there is the free time.If used by register idle like this, serve as the buffer memory of part instructs calculating or memory access result in current basic block, can reduce the memory access time, thus reach the object of program acceleration, its schematic diagram is as Fig. 1.

Three, the register range that ARM instruction and NEON instruction use is all 0-15, can map translation one by one when register translation.In addition, ARM instruction and NEON instruction have identical function instruction, if if so the corresponding identical and command function of ARM instruction and NEON instruction order format is also corresponding identically just can directly translate, as shown in Figure 2.But the form of NEON instruction and command function limitation are very large, and therefore when being NEON instruction by ARM instruction translation, an ARM instruction may be translated as many NEON instructions, as shown in Figure 3.In order to ensure that digital independent is correct and it is correct to calculate, part instructs needs to increase NEON instruction, carries out subsequent processes, as shown in Figure 4.According to ARM instruction difference in functionality in a program, there are following 6 class translation rules to it:

(1) recursion instruction translation: only the immediate of stepping part is changed to degree of parallelism, other instruction all remains unchanged.

(2) combined command translation: instruction refers to needs at least two or more than two instructions of together translating in groups, at present the judgement result of calculation of instruction only in YUV2RGB process and the comparison of 255 in groups.In order to this comparison is expanded to general modfel, namely random depositing compares with random value.

Optimization generally, be not namely under YUV pattern to the parallelization optimization of comparing.This optimization uses q register less as far as possible to ensure and ensures translation rule general applicability, therefore translation result is longer, from measurement result, such translation is really very consuming time, if occur in source program that this quasi-sentence will make acceleration effect obviously decline or not accelerate in a large number.1) condition perform statement

CMPRn, #imme1 are translated as: VMOVQ15, #imme1

MOV-cond,Rm,#imme2VC-condQ11,Q15,Qn

VMOVQ15,#imme2

VANDQ15,Q15,Q11

VMOVNQ11,Q11

VANDQm,Qm,Q11

VORRQm,Qm,Q15

The vector length of instruction of serving as interpreter is 16 and immediate is the maximal value 255 of 8bit, such as, in YUV2RGB example, can utilize the skill that length transforms, do following translation to this situation:

CmpRn, #255 are translated as: VQMOVUN.vectoer_sizeD (n*2), Qn

movgeRn,#255

2) condition stores

CMPRn, Rm/imme are translated as: ADDRx, address

STR(greatorequal)Rn1,addressVMAXQ15,Q(n1),Q(n2)

STR(less)Rn2,addressVST1Q15,Rx

(3) general memory translation rule model (providing partial translation process here):

1) following translation rule is had for the instruction that there is not calculated address offset commands group:

1>STR/LDRRd,[Rn,Rm]

Be translated as: ADDRn, Rn, Rm (and by ADDRn, Rn, Rm in advance)

VST1/VLD1Qd,[Rn]！

2>STR/LDRRd,[Rn,imme]

Be translated as: ADDRx, Rn, imme (and by ADDRn, Rn, imme in advance)

VST1/VLD1Qd,[Rx]！

2) following translation rule is had for the instruction that there is calculated address offset commands group (called after mem_calcu_list):

1>STR/LDRRd,[Rn,Rm]

Be translated as: ADDRn, Rn, Rm (and by mem_calcu_list and ADDRn, Rn, Rm in advance)

VST1/VLD1Qd,[Rn]！

2>STR/LDRRd,[Rn,imme]

Be translated as: ADDRx, Rn, imme (and by mem_calcu_list and ADDRn, Rn, imme in advance)

VST1/VLD1Qd,[Rx]！

3) equidistantly stepping memory access:

STR/LDRRd, [Rn], imme is translated as: MOVRx, Rn

MOVRy, more than imme two will shift to an earlier date

VLD1.vector_size{D(x*2)[0]},[Rx],Ry

(4) special internal memory translation model (providing partial translation process here)

1) the odd-times memory access YUV of similar 422 (memory access)

If at loop head fundamental block, BIC instruction is not translated, and address offset computations can shift to an earlier date

LDR/STRRn, [Rm, #1] is translated as: VLD1.u8{Dd*2}, [Rm]

LDR/STRRd,[Rt,Rf]VMOV.s16Qn,Qd

VUZP.s16Qd,Qn

VIZP.s16D(d*2),D(d*2+1)

VZIP.s16D(n*2),D(d*2+1)

VLD1.u8{D(d*2)},[Rm]！

VMOV.u8{D(d*2+1)},#0

VIZP.u8{D(d*2)},{D(d*2+1)}

2) negative sense stepping

STR/LDRRn, [Rm], #-imme is translated as: MOVRn, #-imme*NEON_size

ADDRm,Rm,(NEON_size-1)*(#-imme)

(above two instructions need in advance)

VLD4.s(vector_size)Qn,[Rm],Rn

VREV.S(vector_size)Qn,Qn

VEXT.s(vector_size)Qn,Qn,Qn,#1

3) most according to depositing (vstn) and the vector length calculated is the twice of memory access stepping length simultaneously.

1>vst2

STRRx, [Rn] is translated as: VQMOVUN. (vector_size/2) D2k, Qx

STRRy,[Rn,#stride]VQMOVUN.(vector_size/2)D2k+2,Qy

ADDRm,Rm,#stride*2VST2.(vector_size/2){D(k*2),D(k*2+2)},[Rx]！

2>vst3

STRRx, [Rm] is translated as: VQMOVUN. (vector_size/2) D2k, Qx

STRRy,[Rm,#stride]VQMOVUN.(vector_size/2)D2K+2,Qy

STRRz,[Rm,#stride*2]VQMOVUN.(vector_size/2)D2k+4,Qz

ADDRm,Rm,#stride*2VST3.(vector_size/2){D(k*2),D(k*2+2),D(k*2+4)},[Rx]！

3>vst4

STRRw, [Rm] is translated as: VQMOVUN. (vector_size/2) D2k, Qx

STRRx,[Rm,#stride]VQMOVUN.(vector_size/2)D2K+2,Qy

STRRy,[Rm,#stride*2]VQMOVUN.(vector_size/2)D2k+4,Qz

STRRz,[Rm,#stride*3]VQMOVUN.(vector_size/2)D2k+6,Qz

ADDRm,Rm,#stride*4VST4.(vector_size/2){D(k*2),D(k*2+2),D(k*2+4),D(k*2+6)},[Rx]！

4) most according to depositing (vstn) and the vector length calculated is equal with memory access stepping length simultaneously.

1>vst2

STRRx, [Rn] is translated as: ADDRx, Rn, #imme

STRRy,[Rn,#stride]VMOV.vector_sizeD2k,Qx

ADDRm,Rm,#stride*2VMOV.vector_sizeD2k+2,Qy

VST2.vector_size{D(k*2),D(k*2+2)},[Rx]！

2>vst3

STRRx, [Rm] is translated as: ADDRx, Rm, #0 (shifting to an earlier date)

STRRy,[Rm,#stride]VMOV.ector_sizeD2k,Qx

STRRz,[Rm,#stride*2]VMOV.vector_sizeD2K+2,Qy

ADDRm,Rm,#stride*2VMOV.vector_sizeD2k+4,Qz

VST3.vector_size{D(k*2),D(k*2+2),D(k*2+4)},[Rx]！

3>vst4

STRRw, [Rn] is translated as: ADDRx, Rn, #0 (shifting to an earlier date)

STRRx,[Rn,#stride]VMOV.vector_sizeD2k,Qx

STRRy,[Rn,#stride*2]VMOV.vector_sizeD2K+2,Qy

STRRz,[Rn,#stride*3]VMOV.vector_sizeD2k+4,Qz

ADDRm,Rm,#stride*4VMOV.vector_sizeD2k+6,Qz

VST4.vector_size{D(k*2),D(k*2+2),D(k*2+4),D(k*2+6)},[Rx]！

5) most according to reading (vldn) and internal memory stepping length is the half of compute vector length simultaneously

The vld2 that 1> mono-kind is special

If imme1 is greater than zero and internal memory number of steps is 8bit and compute vector is 16bit, then Rdx is high register,

LDRBRdx, [Rn, imme1] is translated as: VLD3.u8{D (x*2), D ((x+1) * 2) }, [Rn]

LDRBRdy,[Rn],imme2VMOV.u8D(x*2+1),#0

VZIP.u8D(x*2),D(x*2+1)

VMOV.u8D((x+1)*2+1),#0

VZIP.u8D((x+1)*2),D((x+1)*2+1)

If it is low register that imme1 is less than zero, Rdx

LDRBRdx, [Rn, imme1] is translated as: ADDRx, Rn, imme1 (this sentence must shift to an earlier date)

LDRBRdy,[Rn],imme2VLD2.(vector_size/2){D(x*2),D((x+1)*2)},[Rn]

VMOV.(vector_size/2)D(x*2+1),#0

VZIP.(vector_size/2)D(x*2),D(x*2+1)

VMOV.(vector_size/2)D((x+1)*2+1),#0

VZIP.(vector_size/2)D((x+1)*2),D((x+1)*2+1)

The general vld2 of 2>

LDRRw,[Rn]

LDRRy,[Rn,#stride]

ADDRm,Rm,#stride*2

Be translated as:

ADDRt, Rn, #0 (this sentence must shift to an earlier date)

VLD2.(vector_size/2){D(x*2),D((x+1)*2)},[Rt]

VMOV.(vector_size/2)D(x*2+1),#0

VZIP.(vector_size/2)D(x*2),D(x*2+1)

VMOV.(vector_size/2)D((x+1)*2+1),#0

VZIP.(vector_size/2)D((x+1)*2),D((x+1)*2+1)

The general vld3 of 3>

LDRRw,[Rn]

LDRRx,[Rn,#stride]

LDRRy,[Rn,#stride*2]

ADDRm,Rm,#stride*3

Be translated as:

ADDRt, Rn, #0 (this sentence must shift to an earlier date)

VLD3.(vector_size/2){D(k*2),D(k*2+2),D(k*2+4)},[Rt]

VMOV.(vector_size/2)D(k*2+1),#0

VZIP.(vector_size/2)D(k*2),D(k*2+1)

VMOV.(vector_size/2)D((k+1)*2+1),#0

VZIP.(vector_size/2)D((k+1)*2),D((k+1)*2+1)

VMOV.(vector_size/2)D((k+2)*2+1),#0

VZIP.(vector_size/2)D((k+2)*2),D((k+2)*2+1)

VMOV.(vector_size/2)D((k+1)*2+1),#0

VZIP.(vector_size/2)D((k+1)*2),D((k+1)*2+1)

VMOV.(vector_size/2)D((k+2)*2+1),#0

VZIP.(vector_size/2)D((k+2)*2),D((k+2)*2+1)

The general vld4 of 4>

LDRRw,[Rn]

LDRRx,[Rn,#stride]

LDRRy,[Rn,#stride*2]

LDRRz,[Rn,#stride*3]

ADDRm,Rm,#stride*4

Be translated as:

ADDRx, Rn, #0 (this sentence must shift to an earlier date)

VLD4.(vector_size/2){D(k*2),D((k*2+2),D(k*2+4),D(k*2+6)},[Rt]

VMOV.(vector_size/2)D(k*2+1),#0

VZIP.(vector_size/2)D(k*2),D(k*2+1)

VMOV.(vector_size/2)D((k+1)*2+1),#0

VZIP.(vector_size/2)D((k+1)*2),D((k+1)*2+1)

VMOV.(vector_size/2)D((k+2)*2+1),#0

VZIP.(vector_size/2)D((k+2)*2),D((k+2)*2+1)

VMOV.(vector_size/2)D((k+3)*2+1),#0

VZIP.(vector_size/2)D((k+3)*2),D((k+3)*2+1)

6) most according to reading (vldn) and internal memory stepping length is that compute vector length is equal simultaneously

The general vld2 of 1>

LDRRw,[Rn]

LDRRy,[Rn,#stride]

ADDRm,Rm,#stride*2

Be translated as:

ADDRt, Rn, #0 (this sentence must shift to an earlier date)

VLD2.vector_size{D(k*2),D(k+2)},[Rt]

VLD2.vector_size{D(k*2+1),D((k+1)*2+1)},[Rt]

The general vld3 of 2>

LDRRw,[Rn]

LDRRx,[Rn,#stride]

LDRRy,[Rn,#stride*2]

ADDRm,Rm,#stride*3

Be translated as:

ADDRt, Rn, #0 (this sentence must shift to an earlier date)

VLD3.vector_size{D(k*2),D(k*2+2),D(k*2+4)},[Rt]

VLD3.vector_size{D(k*2),D(k*2+2+1),D(k*2+4+1)},[Rt]

The general vld4 of 3>

LDRRw,[Rn]

LDRRx,[Rn,#stride]

LDRRy,[Rn,#stride*2]

LDRRz,[Rn,#stride*3]

ADDRm,Rm,#stride*4

Be translated as: ADDRx, Rn, #0 (this sentence must shift to an earlier date)

VLD4.vector_size{D(k*2),D(k*2+2),D(k*2+4),D(k*2+6)},[Rt]

VLD4.vector_size{D(k*2+1),D(k*2+2+1),D(k*2+4+1),D(k*2+6+1)},[Rt]

(5) static instruction translation

When Rd is " non-dynamic ", its instruction is " dynamic " instruction, and current procedure only identifies four kinds of situations:

OPRd (non-dynamic), Rn, Rm, Ra, Imme

OPRd (non-dynamic), Rd (non-dynamic), Rm, Ra, imme

OPRd (non-dynamic), Rn, Rd (non-dynamic), Ra, imme

OPRd (non-dynamic), Rn, Rm, Rd (non-dynamic), imme

By traveling through all fundamental blocks, finding which instruction and employing variable in the fundamental block of current optimization, needing to use current static command calculations result Rd if there is fundamental block, then judge whether it is STR or " hot computations ".

If " hot computations " and vector magnitude is 32bit, then present instruction is translated according to normal NEON, and in backward fundamental block, increase " reduction " operate.

VADD.s32D31,D(Rd*2),D(Rd*2+1)

VADD.s32D30,D(Rd*2),D(Rd*2+1)

VTRN.s32D30,D31

VADD.s32D(Rd*2),D30,D31

VMOV.s32R(Rd),D(Rd*2)[0]

If " STR " instruction, do not translate present instruction, increase " reduction " operation in backward fundamental block after, once add former STR instruction.

VADD.s32D31,D(Rd*2),D(Rd*2+1)

VADD.s32D30,D(Rd*2),D(Rd*2+1)

VTRN.s32D30,D31

VADD.s32D(Rd*2),D30,D31

VMOV.s32R(Rd),D(Rd*2)[0]

STRRd,[mem_address]

If fundamental block does not use result of calculation later, then judge whether it is STR or " hot computations "

1) if present instruction is then translated according to normal NEON by " hot computations ".

2) if " STR " instruction, do not translate present instruction, increase " reduction " operation in backward fundamental block after, once add former STR instruction.

(6) hot computations translation

1) if can find the NEON instruction that corresponding function is completely equal with order structure in ARM program, then register number is constant, and r register is become q register, and ARM instruction name becomes NEON instruction name.

2) if the instruction of band skew, if the name of this ARM instruction is OP, be so translated as

OPRd, Rn, Rm, shift, imme are translated as: SHIFT.vector_sizeQ15, Qm, imme

VOP.vector_sizeQd，Qn，Q15

3) if there is immediate in instruction

OPRd, Rn, imme are translated as: VMOV.vector_sizeQ15, imme

The OPRd that NEON instruction is corresponding, Rn, Q15

4) special ARM instruction translation

1>MALRd, Rn, Rm, Ra are translated as: VMUL.vector_sizeQ15, Rn, Rm

VADD.vector_sizeQd,Q15,Qa

2>RSBRd, Rn, Rm are translated as: VSUB.vector_sizeQd, Rm, Rn

3>RSBRd, Rn, imme are translated as: VMOVQ15, imme

VUSBQd,Q15,Qn

Core of the present invention identifies optimizable internal memory exactly, be NEON instruction according to memory analysis result by ARM instruction translation, redundancy memory access optimization and recursion instruction quantity optimization are carried out to translation result simultaneously, and rational management is carried out, to make full use of high speed Resources on Chip to the ARM extended register (q register) of free time.Simultaneously, according to the mapping relations of ARM program register number and NEON program register number and the abstract and modeling to part ARM instruction behavior, be the mixed instruction of ARM and NEON by ARM instruction translation, as shown in Figure 5, its overview flow chart as shown in Figure 6 for the module of the NEON vectorization conversion designs of ARM binary code.

The NEON vectorization conversion method of a kind of ARM binary code of the present invention, comprises the following steps:

The first step, carry out dis-assembling to the binary code file of ARM program of input, obtain the former ARM command information that every bar binary code is corresponding, the process flow diagram of this part is as Fig. 7; Be specially: TXT file is read line by line to the binary code of ARM, command information reading is carried out to binary code, obtain the former ARM instruction corresponding with binary code;

Second step, in order to effectively describe ARM program execution flow, need the controlling stream graph setting up fundamental block on the basis of its ARM instruction dis-assembling, on the basis of existing controlling stream graph, carry out the analysis of arrival definite value to each fundamental block, the program flow diagram of this part is illustrated in fig. 8 shown below.Be specially: fundamental block division is carried out to the former ARM instruction of gained, then sets up the redirect relation of each fundamental block, the analysis of arrival definite value carried out to each fundamental block and obtains ARM instruction control flow figure.

3rd step, by abstract for fundamental block each in above-mentioned controlling stream graph be node, redirect relation between fundamental block is abstract is directed edge, therefore controlling stream graph is just abstract is digraph, in digraph, look for all rings, and judge that whether all rings found are the rings for a circulation instead of simple skip instruction composition.Meanwhile, the circulation that entrance fundamental block is identical with outlet fundamental block is merged.After merging, judge the nest relation between circulation, obtain the circulation that nested number of plies number is maximum, i.e. innermost loop simultaneously, and the fundamental block of this circulation of mark composition is which part of circulation.Nested circulation identification process is illustrated in fig. 9 shown below.Be specially: in ARM instruction control flow figure, look for ring, and the circulation identical with " outlet " to " entrance " merges; Outside the loop exit fundamental block that judgement has been found, whether other fundamental blocks also exist redirect; If other fundamental blocks have redirect, then abandon optimizing this circulation; Otherwise, find the circulation of nested innermost layer, circulation part mark carried out to each fundamental block in the circulation found, obtains objective optimization fundamental block.

4th step, memory analysis.The ARM instruction of subprogram due to its memory access mode existence condition redirect or memory access be not to increase progressively or production decline law carries out address increase, so can not carry out SIMD parallelization process.Therefore before ARM is translated as NEON instruction, need to carry out certain analysis to the instruction behavior of target program, thus determine whether the memory access mode of the access instruction (comprising ldr/str, ldrb/strb, ldrh/strh) in program can carry out NEON translation, the internal memory for many instruction access is optimized according to utilizing the dispatching method of register idle on sheet.According to the translation scheme of special memory access mode, ARM optimization is translated as NEON instruction for only having the internal memory of an instruction access.Before memory analysis carries out, in order to allow the more accurate of memory analysis: can first the assembly instruction realizing in objective optimization ARM program circulating be marked before 1, analyzing; 2, then again according to the recognition rule of interpretive order in groups, the instruction finding combination translation is also marked.

After above-mentioned mark, will the instruction of circulation be realized and need the instruction of combination translation to foreclose, all access instruction in statistics objective optimization program.According to the self-contained information of access instruction and the semantic analysis result realizing the instruction that address offset calculates before access instruction, the access instruction (comprise read and store) identical to internal memory merges.Then, memory access mode analysis is carried out to each memory address, if the stepped form of a memory access mode meets at least one condition in following three conditions:

(2) memory address of continuation address change is accessed.

The access instruction of NEON (vld1/vst1, vld2/vst2/, vld3/vst3, vld4/vst4) so just can be utilized to replace the access instruction of former ARM.Meanwhile, carry out the identification of special memory access mode (comprising 2 ~ 4 instructions combining reading continuously or write internal memory together continuously, negative sense stepping access instruction, radix time memory access) and mark its interpretive scheme.Above memory analysis process as shown in Figure 10.Be specially: objective optimization fundamental block found to the defined instruction of each register of every bar instruction and use instruction; Then recursion instruction is found; Then interpretive order is in groups found; Then in except recursion instruction and all access instruction in groups except interpretive order, address offset computations is found; Access instruction is added up, optimizes identical address access instruction; Dynamic address (following the address of circulation change) identifies; Special (comprising 2 ~ 4 instructions combining reading continuously or write internal memory together continuously, negative sense stepping access instruction, radix time memory access) memory access mode identification; Analyze the number of steps of memory address; Each command synchronization is analyzed; Obtain the fundamental block after optimizing.To find except recursion instruction, in groups interpretive order, access instruction, address computation offset commands and in NEON instruction, to have the ARM instruction of corresponding function, and being hot computations to the ARM cue mark found.Except the above-mentioned instruction found, remaining instruction is all labeled as not with the instruction of circulation change.

5th step, according to above-mentioned memory analysis, first translation process can carry out idling-resource scheduling to (core register) register of r on sheet and q (extended register) register, to the free time or idle register can be become mark.Then according to above-mentioned 5 class translation rules, the recursion instruction, the instruction of translating in groups, access instruction and address offset computations, the hot computations that find step 4, not translate with the instruction of circulation change, the flow process of this module is as Figure 11.

Idling-resource scheduling comprises idle q register and the scheduling of idle r register;

Idle q register scheduling process:

First all extended registers are labeled as idle available;

(1) instruction that offset address calculates is not participated in;

(2) not the instruction participating in circulation realization;

(3) and exist definition register;

(1) not access instruction and present instruction;

(2) be not certain access instruction calculated address skew instruction;

(3) not the instruction having participated in circulatory function;

Idle r register scheduling process:

First be available by whole r register tagging;

Be finally busy by r14 and r15 register tagging;

The result finally translated all by assembler, can produce real assembly language, outputs in output.txt.The process flow diagram of assembler is as Figure 12.Be specially: the result finally translated carries out the compilation of ARM instruction assembler and the compilation of NEON instruction assembler.ARM instruction assembler carry out ARM data processing compilation, ARM access instruction compilation, ARM jump instruction compilation, block operations instruction compilation obtain ARM assembly instruction.NEON instruction assembler carries out non-memory-reference instruction translation and access instruction translation, obtains NEON assembly instruction.Assembly instruction is mixed by obtaining ARM and NEON after ARM assembly instruction and the mixing of NEON assembly instruction; Perform this instruction, just effectively can accelerate the execution of ARM program.

Test result:

Program test envirnment is the UTV210 development board of coretex-A8, and sound instruction integrates as ARMV7.

Table 1 test result contrast table

As can be seen from Table 1, the result applied after the conversion of the present invention's vector conversion method has certain acceleration effect.

Claims

The NEON vectorization conversion method of 1.ARM binary code, is characterized in that, comprise the following steps:

The first step, dis-assembling: carry out dis-assembling to the binary code file of ARM program, obtain the former ARM command information that every bar binary code is corresponding;

Second step, flow graph generate: the controlling stream graph setting up fundamental block on the basis of ARM instruction dis-assembling, on the basis of existing controlling stream graph, carry out the analysis of arrival definite value to each fundamental block;

3rd step, cycle detection: find the maximum circulation of the nested number of plies by cycle detection, and the composition fundamental block of its innermost loop is marked;

4th step, memory analysis: find the ARM instruction needing in optimization aim to translate in groups, marked, the instruction after mark will not carry out internal memory optimization; Find the instruction completing circulatory function in instruction, marked equally and do not participate in internal memory optimization; On above mark basis, find all access instruction, and the access instruction of access same memory is merged; According to memory analysis scheme after merging, mark the memory access type of each access instruction;

5th step, instruction translation: by recursion instruction, the instruction of translating in groups, access instruction and address offset computations, hot computations, do not translate with the instruction of circulation change;

6th step, assembly instruction export: carry out collecting and exporting according to the translation result of the 5th step.
2. the NEON vectorization conversion method of ARM binary code according to claim 1, it is characterized in that, 3rd step specifically comprises the following steps: by abstract for fundamental block each in controlling stream graph for node, redirect relation between fundamental block is abstract is directed edge, by abstract for controlling stream graph be digraph, in digraph, look for all rings, and judge that whether all rings found are the rings for a circulation instead of simple skip instruction composition; Meanwhile, the circulation that entrance fundamental block is identical with outlet fundamental block is merged; After merging, judge the nest relation between circulation, obtain the circulation that nested number of plies number is maximum, i.e. innermost loop simultaneously, and the fundamental block of this circulation of mark composition is which part of circulation.
3. the NEON vectorization conversion method of ARM binary code according to claim 1, is characterized in that, the 4th step specifically comprises the following steps:

Before memory analysis carries out, first the assembly instruction realizing in objective optimization ARM program circulating is marked; Then again according to the recognition rule of interpretive order in groups, the instruction finding combination translation is also marked;

After above-mentioned mark, will the instruction of circulation be realized and need the instruction of combination translation to foreclose, all access instruction in statistics objective optimization program; According to the self-contained information of access instruction and the semantic analysis result realizing the instruction that address offset calculates before access instruction, the access instruction identical to internal memory merges; Then, memory access mode analysis is carried out to each memory address, if the stepped form of a memory access mode meets at least one condition in following three conditions:

(1) memory access mode of equidistantly stepping (comprising forward and negative sense);

(2) memory address of continuation address change is accessed;

Article (3) 2 ~ 4, the packing of orders is read continuously together or is write internal memory continuously together;

Access instruction so with regard to utilizing the access instruction of NEON to replace former ARM; Meanwhile, carry out the identification of special memory access mode and mark its interpretive scheme.
4. the NEON vectorization conversion method of ARM binary code according to claim 3, it is characterized in that, LINEAR CONTINUOUS stepping, equidistantly stepping will be carried out along with circulation after memory analysis, this three classes address of odd-times memory access will be set as Dram address, and not be decided to be static address with the address of circulation change;

According to dynamic and static state address, instruction is divided into following a few class:

Dynamic instruction:

(1) instruction of dynamic address is accessed;

(2) in instruction, all data all come from the instruction of dynamic address;

Static instruction:

(1) instruction of access static internal memory;

(2) all data are immediate or come from static address;

Can be changed into dynamic instruction:

(1) result of the register defined will be stored into static address;

(2) value is only had to come from static address or undefined register in instruction.
5. the NEON vectorization conversion method of ARM binary code according to claim 1, it is characterized in that, in 5th step, ARM and NEON instruction is when translating, utilize the core register of ARM number and extended register number be all 0 ~ 15 feature, utilize the compiling result of former ARM program, look for the corresponding identical and corresponding identical instruction of function of NEON with ARM order structure and directly translate; For the high double word part of extended register and low double word part carry out translation map time, 0 ~ 15 of core register is carried out mappings respectively and is translated with the low double word part of extended register.
6. the NEON vectorization conversion method of ARM binary code according to claim 1, is characterized in that, idling-resource scheduling comprises idle q register and the scheduling of idle r register;

Idle q register scheduling process:

First all extended registers are labeled as idle available;

First pass, traversal current goal optimizes all instructions in fundamental block, if present instruction meets simultaneously:

(1) instruction that offset address calculates is not participated in;

(2) not the instruction participating in circulation realization;

(3) and exist definition register;

These three conditions, extended register so of the same name with present instruction definition register is busy and marks;

Second time, traversal current goal optimizes all instructions in fundamental block, if present instruction meets simultaneously:

(1) not access instruction and present instruction;

(2) be not certain access instruction calculated address skew instruction;

(3) not the instruction having participated in circulatory function;

These three conditions, so when the extended register of the same name with front instruction use register is busy, and are labeled as busy;

3rd time, traversal current goal optimizes all instructions in fundamental block, if present instruction is the instruction of not translating, the extended register that the register so used with present instruction is of the same name is labeled as the free time;

4th time, travel through all access instruction, if access instruction has been the instruction that many packings of orders form vldn/vstn together, extended register so of the same name with these instruction definition registers has been idle;

Idle r register scheduling process:

First be available by whole r register tagging;

First pass, all access instruction in traversal fundamental block, register tagging access instruction used is busy;

Second time, all instructions in traversal fundamental block, if present instruction is the instruction having participated in memory access address offset, or participate in the instruction realizing circulatory function, then present instruction definition is all busy with the instruction used;

Be finally busy by r14 and r15 register tagging;

Idle r and q register according to above dispatching method method mark is register all idle in whole fundamental block.
7. the NEON vectorization conversion method of ARM binary code according to claim 1, it is characterized in that, idling-resource scheduling is by temporarily leaving in idle register to the management and running of register on sheet by the result repeating memory access, when making subsequent access same memory, access memory need not be gone again, but the temporal data in access extended register.