CN107729118A - Towards the method for the modification Java Virtual Machine of many-core processor - Google Patents

Towards the method for the modification Java Virtual Machine of many-core processor Download PDF

Info

Publication number
CN107729118A
CN107729118A CN201710871869.7A CN201710871869A CN107729118A CN 107729118 A CN107729118 A CN 107729118A CN 201710871869 A CN201710871869 A CN 201710871869A CN 107729118 A CN107729118 A CN 107729118A
Authority
CN
China
Prior art keywords
java
instruction
vector
virtual machine
many
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710871869.7A
Other languages
Chinese (zh)
Inventor
张为华
李弋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN201710871869.7A priority Critical patent/CN107729118A/en
Publication of CN107729118A publication Critical patent/CN107729118A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45508Runtime interpretation or emulation, e g. emulator loops, bytecode interpretation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/443Optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45516Runtime code conversion or optimisation
    • G06F9/4552Involving translation to a different instruction set architecture, e.g. just-in-time translation in a JVM
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects

Abstract

The invention belongs to field of computer technology, and in particular to the method that one kind designs Java Virtual Machine (JVM) for many-core processor.The present invention may be such that Java Virtual Machine makes full use of the computing capability of many-core platform, so as to improve the performance of java applet.The present invention devises the model of a semi-automatic vectorization, carrys out the code that can be handled in discovery procedure with vector calculation unit with reference to the Java frontend compiler of modification.Postpone to make full use of EMS memory access bandwidth and reduce, the present invention devises a kind of method of data pre-fetching in Java Virtual Machine for many-core processor.

Description

Towards the method for the modification Java Virtual Machine of many-core processor
Technical field
The invention belongs to field of computer technology, and in particular to a kind of modification Java Virtual Machine towards many-core processor Method.
Background technology
Java is one of most popular programming language, is widely used in each practical field.It is small to arrive Embedded Application, Greatly to large-scale application on computer cluster.The operation of java applet depends on Java Virtual Machine, a cross-platform intermediate layer Runtime environment.Because of its powerful and complete functional characteristic, all outside Java multilingual is also performed with Java Virtual Machine.
Now, Java is in numerical computations (Numerical computing), scientific algorithm (Scientific Computing), the calculating field such as Distributed Calculation (Distributed computing) increasingly attracts attention, this and JVM Calculating performance, outstanding built-in multithreading and the network communication mechanism increasingly lifted is inseparable.Specifically, compared to tradition High-performance program language, such as C/C++ and Fortran, Java are in HPC (High Performance Computing, Gao Xing Can calculate) there is some advantages, including portability, reliability, development efficiency and adaptive run-time optimizing in terms of exploitation Etc..
Sustainable development and Java due to JVM increasingly have competitiveness, many practical necks in terms of high-performance calculation The high-performance scientific research in domain or engineering project all select to be developed using Java, and which includes universe astronomy, molecular physics With the field such as biomedicine.
In order to improve HPC computing capability, many supercomputers are widely used for the acceleration equipment dedicated for calculating, Such as many-core computing chip as GPGPU (general GPU) and Intel Xeon Phi.Java high-performance calculations so far are led The research in domain can not make full use of the hardware resource of rapid development, in existing main flow many-core platform --- including maturation GPGPU and emerging MIC, lack Java HPC or even Java research work, cause the high-performance meter using Java exploitations Calculation project loses the chance of further improving performance.
Nvidia companies are developed CUDA programming models and calculated to support to apply using GPU, in order to support java applet to use GPU is calculated, there is provided CUDA Java access interfaces jCUDA.Java applet only needs the function for calling jCUDA to provide, you can Calculating is put on GPU and performed.On Intel Xeon Phi, Java is needed by JNI (Java Native Interface) Interface calls the function in Intel MKL (Math Kernel Library) storehouse, so as to utilize Phi computing capability.
By way of calling bottom function library, Java applications can utilize the computing capability of many-core platform, but also limit Make using the computing capability for giving full play to many-core platform.Many-core platform is more laid particular emphasis in consideration when designing memory system Deposit bandwidth problem, and higher Memory accessing delay (Memory latency) is introduced also therefore inevitably as compromise.
Computation-intensive array manipulation circulation in Java high-performance calculation programs often largely be present, in order to overcome memory access to prolong Late generally all using the scheme prefetched.The existing solution that prefetches all is generally to determine by compiler static analysis mostly Behavior specifically is prefetched, but data when this kind of method is due to being difficult to obtain important dynamic operation, so as to can not accurately control Perform the parameter prefetched;And some other dynamic forecasting method is not directed to Java characteristics yet, or do not have in view of step-length and journey Relation during sort run between data, so can not be directed to, different Java Run-time scenarios offers is optimal accordingly to prefetch scheme.
A kind of many-core processor that the present invention accelerates towards calculating, it is proposed that method for changing Java Virtual Machine so that Java application can make full use of the computing capability of many-core platform.
The content of the invention
It is an object of the invention to provide a kind of method of the modification Java Virtual Machine towards many-core processor, to improve The performance of Java Virtual Machine.
Java Virtual Machine is the core enforcement engine of Java platform, while is also to ensure upper strata java application and bottom The unrelated most important functional part of environment.Java Virtual Machine defines a set of special instruction, Java bytecode.Java applet After Java bytecode must being compiled into, it could perform on a java virtual machine.The basic structure of Java Virtual Machine is as shown in Figure 1. Java bytecode is finally explained to perform or be translated as machine code by instant compiler by interpreter and performed.Java provides standard DLL (JNI), virtual machine is supported to directly invoke the nation method related to platform (generally by language such as C, C++ or compilations Write).
The running environment of java applet, it is made up of virtual machine and Java core class libraries.Java Virtual Machine is a kind of using solution Release pattern and the engine performed is mixed with pattern compiler.But no matter which kind of pattern Java Virtual Machine performs, and system finally must all transport The related local code of row platform.For specific architecture, a compilation storehouse (Assembly is contained in virtual machine Library), the binary machine code implementation of assembly instruction in particular architecture is defined.
Multi-thread mechanism built in Java Virtual Machine can make full use of the computing resource on many-core platform.However, some The java applet of single thread embodies huge gap between the performance on many-core platform and CPU, exposes Java Virtual Machine Insufficient present on many-core framework.First, to the SIMD supports optimized there is the defects of larger in Java Virtual Machine, which results in Many JavaHPC programs can not make full use of the vector location on many-core platform.Secondly, the relatively Gao Yan that many-core platform data accesses When the characteristics of, also reduce the utilization of computing capability.
In order to make full use of the computing capability of many-core platform, the present invention proposes the method for changing Java Virtual Machine, including two Individual part:1st, the model of semi-automatic vectorization is designed, changes Java frontend compiler, coming in discovery procedure can be with to gauge Calculate the code of cell processing;2nd, by the way of dynamically prefetching, EMS memory access bandwidth is made full use of, reduces delay.
1st, the model of semi-automatic vectorization is designed, changes Java frontend compiler, that vector can be used in discovery procedure The code of computing unit processing;Idiographic flow is as follows:
Towards calculating on the many-core processor accelerated, the vector of ultra-wide is typically all provided, such as Intel MIC platforms are just The vector of 512 bit wides is provided, realizes single-instruction multiple-data (SIMD) operation, so as to improve treatment effeciency.
The realization of some Java Virtual Machines introduces " SuperWord " mechanism, and JIT compiler operationally fit by automatic identification The array circulation for carrying out vectorization is closed, compiles it as the related local vectorization instruction of platform.But the mechanism of this realization The simple array circulate operation of identification is typically only capable to, has no idea to identify for the form JIT of some changes.
In order to more efficiently utilize the vector processing unit on many-core processor, the present invention designs a kind of automanual first Need that a note can be being added so that the array circulation of vectorization is preceding when vectorization model, i.e. programmer encode.The model is by vector The power to make decision of change gives programmer, optimizes chance with the SIMD in cognizance code more fully hereinafter.
In order to support semi-automatic vectorization model, it is necessary to change Java front-end compiler.Compiler is in processing Java sources During code, vector circulant is identified according to note, array manipulation therein is then compiled into a series of new bytecodes, we Referred to as Vector Byte code.According to the width of vector, to control the span of loop iteration, that is, increase the number of element.In addition, by There is the requirement of address align in access of the vector location on many-core processor for data, and Java Virtual Machine is not strict Alignment requirements, therefore perform comprising Vector Byte code high performance application when, it is ensured that the initial address of array meets The alignment requirements of vector processing unit.
Scientific algorithm largely uses Floating-point Computation, in local storehouse of collecting, for floating-point operation using CPU to Amount instruction, as Intel SSE and AVX are instructed.Different many-core platforms, define the exclusive vectorial computations of platform.This hair It is bright that mainly the floating point instruction coding in Java Virtual Machine locally compilation storehouse is modified by following two ways, it is flat to generate The self-defined vector gather instruction of platform:
(1) equivalencing:Looking up function and semantic identical vector instruction in platform vector instruction set, for substituting original Some SSE and AVX instructions.If not complying fully with the independent instruction of requirement, consider to realize using some instructing combinations;
(2) software is realized:The floating-point arithmetic operation that can not be covered for platform instruction set, it is real using the floating point instruction of platform Existing (in Java Virtual Machine specification, all floating-point operations all follow the standards of IEEE 754).
2nd, by the way of dynamically prefetching, EMS memory access bandwidth is made full use of, reduces delay;Particular content is as follows:
Dynamic, which prefetches, is divided into two parts:The repetend in step-length dynamic analysis and JIT compiler in Java interpreter Point traversal.When java applet, which is run to core calculations, to be circulated, analyzer can be triggered and start dynamic collection loop iteration Number.After analyzer successfully obtains length of the cycle, need to further determine that one it is initial prefetch step-length, the step-length prefetched depends on The feature of platform specific.Pre- race mechanism in subsequent analyzer can receive the initial step length, then be formulated according to the initial step length It is a series of it is different prefetch step value, and corresponding prefetched instruction operation is inserted in ensuing some groups of operation iteration.Its In, the scope where every group of operation step-length follows following equation:
S2'=[S2*(1-P),S2*(1+P)]
S1'=[S1*(1-P),S1*(1+P)]。
Wherein, S2And S1MEM- is represented respectively>L2And L2->L1It is at different levels initially prefetch step-length, the setting of P values depends on Specific many-core platform, the pre- device that runs are directed to all S1’<S2' step size combination can all perform loop test.When test loop is completed Afterwards, analyzer records the most short step size combination of run time, and is passed to the jit compiling that next will perform co mpiler optimization In device.
When program enters the JIT co mpiler optimization processes of JIT compiler, the cyclic node spider module that the present invention passes through setting All nodes to the circulation before generation phase is instructed are traveled through (optimization is based on preferable graph structure).In ergodic process In, JIT identifies memory read-write operation all in circulation and prefetches node in its front and rear insertion first, for those to identical Address carries out the repetition prefetched instruction of pre- extract operation, and walker is found out and eliminated, and judged simultaneously by the scanning of iteration Which memory access mode is needed to perform the pre- extract operation of exclusive formula.After the searching loop terminates, on-the-flier compiler enters code building In the stage, JIT will include the local instruction stream including prefetched instruction according to the node diagram generation after optimization, so as to complete this operation When the policy development that prefetches of dynamic and instruction generation work.
Brief description of the drawings
Fig. 1 is Java Virtual Machine structure.
Embodiment
In order that the objects, technical solutions and advantages of the present invention are clearer, below in conjunction with drawings and Examples, specifically Bright of the invention is preferable to carry out.Before this it should be noted that term used in present specification and claims or Word is unable to limited interpretation as the implication in common implication or dictionary, and should be based in order to illustrate its hair in the best way The principle that a person of good sense can suitably be defined to the concept of term is construed to the implication and concept for meeting the technology of the present invention thought.With It, the structure represented in the embodiment and accompanying drawing described in this specification is one of preferred embodiment, can not be complete Quan represents the technological thought of the present invention, it will therefore be appreciated that for the present invention may be each in the presence of that can be replaced Kind equivalent and variation.
Exemplified by the JAVA virtual machine HotSpot of the invention for selecting the Xeon Phi many-core processors of Intel Company and increasing income, To illustrate how to improve the performance of Java Virtual Machine based on the method for the present invention.
Xeon Phi are the many-core platforms of a new architecture towards high-performance calculation.Xeon Phi employ a set of The vector gather instruction of extension, the coded format of instruction have differences with general X86 platforms.Up to 60 orders of Xeon Phi perform The physical core of type, a scalar processing unit and vector processing unit are integrated with respectively in each core.Wherein, scalar is handled Unit (including FPU Float Point Unit FPU) is compatible with traditional X86-64 processor architectures, what vector processing unit used It is then a set of 512 brand-new bit instruction collection, referred to as IMCI (Intel Initial Many Core Instructions).IMCI Vector gather instruction is the extension of x86-64 instruction set, but the another set of 512 Bits Expanding instruction set of itself and Intel issue in 2013 AVX-512 is different, and IMCI does not support traditional SIMD instruction, and which includes Intel MMX, AVX and any version SSE is instructed.Compared with common x86-64 is instructed, IMCI vector instructions mainly have following difference:
(1) it is 512 using 32 brand-new ZMM vector registors, each ZMM register widths, 16 can be encapsulated The floating-point or integer element of individual 32 or 8 64.Compared to 16 128 XMM/256 positions on common X86-64 platforms YMM, Xeon Phi instruction set design can provide efficient instruction level parallelism, and reduce the delay in instruction pipeline;
(2) ternary operational order is supported, which includes two source operands and the 3rd different target operand.This Outside, IMCI instruction set also supports that long-pending and molten plus computing (Fused Multiply-Add, FMA) --- every instruction contains three Source operand, one of source operand are also target operand simultaneously;
(3) 8 vector mask registers (Vector Mask Register) are with the addition of, it, which is acted on, is referred in execution vector Condition execution when making to different elements in same ZMM registers is controlled.
In addition to instruction architecture, the coded format of MIC spread vectors instruction also has ten with common Intel X86 platforms Divide obvious difference.Two kinds of special prefix of MVEX and VEC are introduced in IMCI instructions.Wherein, VEX prefixes are applied to operation vector The instruction of mask register, and MVEX prefixes are then to be directed to the instruction of operation ZMM vector registors, and bag is further specify in MVEX Include a series of important command informations including ternary register operand, mask register value etc..
In addition, the memory address skew of IMCI vector instructions is regular also entirely different with common X86 platforms.Due to MIC's Vector operations have very high requirement to internal memory address align, and (memory address in such as double-precision floating point operational order must is fulfilled for 64 Byte-aligned, single-precision floating point are then 32 byte-aligneds), therefore also addressing skew is pressed in its instruction set design Contracting, it is referred to as disp8*N modes of deflection.For example, in double vector instructions internal memory operation number address, its using The value of original skew/64 enters line displacement coding, is then that the rest may be inferred using former skew/32 for float, so as to save pair Align the unnecessary coding of address low level.
The realization of MIC vector instructions
Table 1 gives instruction extension scheme of the HotSpot virtual machines compilation storehouse on MIC many-core platforms.It is normal for part Operation MIC extended instructions are concentrated and contain corresponding vector instruction, such as the adding of floating-point operation number, subtract, multiplication and soft Part prefetched instruction etc..
Extend HotSpot compilation storehouse when, IMCI vector instructions require to internal memory address align, and x86 instruction set and HotSpot memory management does not have the limitation of this respect.In order to meet MIC alignment requirements, the present invention is in instruction-level aspect reality It is existing.
Divided by operand type, more than binary and binary instruction can be divided into two classes:It is a kind of only to include register operand, It is another kind of, contain register and internal memory operation number.For the former, do not have to consider alignment problem when replacing using vector instruction; For the latter, the present invention makees interim storage using idle ZMM vector registors, and specific method is as follows:
(1) floating-point internal memory operation number is copied to interim ZMM registers;
(2) former vector instruction is called by operand of the interim ZMM.
Because ZMM registers are twice of simd register on X86 platforms in MIC platform, compatible scheme has enough Temporary storage registers.
But corresponding branch is not provided for some other command operating less conventional in vectorization scene, MIC Hold, such as floating point type conversion, condition data copy instruction etc., for this kind of instruction, the present invention is needed according to its actual finger Make logic and design suitable command sequence to realize.
Here, we provide the realization that floating-point size in MIC platform compares instruction.When comparing, x87 instructions are used to realize Two operands need to be respectively loaded in floating point processing unit (FPU) register stack first, then call x87 floating-points to compare Instruction.The instruction can set FPU state code according to comparative result, and we then need to set EFLAGS states manually according to the conditional code Register, shown in specific code as annex 1.
HotSpot interpreters directive generation module corresponding with JIT compiler can be based on HotSpot compilation storehouse with The compatible scheme modifying of MIC instruction set.
HotSpot interpreters are not directly using writing, and which employs a kind of implementation based on template. HotSpot template interpreter (Template interpreter) is that every Java bytecode both provides corresponding instruction mould Plate.Code shown in annex 2, give tri- bytecodes of dadd, dsub, dmul and (perform double-precision floating point respectively to add, subtract, multiply Method computing) explanation perform template realization.
On x86 platforms, realizing for this three bytecodes has used addsd, movdbl (movsd envelope respectively in template Dress), subsd and mulsd instruction, these be HotSpot compilation storehouse provided in x86 platform assembly instructions.In order to allow the mould Plate interpreter can use the HotSpot compilation storehouse vector instructions after renewal, and we correspondingly need to only distinguish this four instructions The replacement of byte code mask can be completed by being revised as vaddpd, vmovapd, vsubpd and vmulpd.
So far, HotSpot virtual machines will initialize according to updated compilation template to interpreter on startup.When Virtual machine explains that it will be entered by interpreter and jump into this method, then start to perform this one by one when performing a Java method Method corresponding new MIC byte code sequences in template interpreter.
During HotSpot is explained and performed a Java method, when this method call number exceedes certain threshold value When can trigger jit compiling, the bytecode compiling in Java method can be optimized to a series of local assembly codes by following JIT, And finally and using the coding rule in HotSpot compilations storehouse generate corresponding binary machine code.
Conversion from bottom IR to local code depends on the related framework of a platform and describes file (Architecture Description File, Adfile), Adfile is similar to template, defines from bottom machine code section Matched rule of the point to the storehouse instruction of HotSpot compilations.
Annex 3 gives the realization of the rule description in Adfile.In code, Set dst (MulDdst (LoadDsrc)) represent to use pattern of the LoadD nodes as input in a MulD node.On x86 platforms, the finger of matching Order rule is to use XMM to be instructed with mulsd of the memory address as operand.The scheme provided according to table 1, corresponding to mulsd Instruction is vmulpd.Therefore, the code revision in ins_encode only need to be that vmulpd can be double precision in JIT by we Floating-point operation generates corresponding MIC vector instructions.
By being modified to all relevant matches rules in Adfile, you can complete to generate system to JIT local codes In the customization of MIC frameworks.
Semi-automatic vectorization
Semi-automatic vector model is that HotSpot virtual machines introduce new Vector Byte code.Therefore, the present invention needs to repair Change checker (Verifier) to solve the verifying work of class file Java bytecode;Meanwhile in HotSpot bytecodes storehouse with converging Compile in storehouse and add corresponding Vector Byte code and MIC locals vector instruction.
Using the architecture based on stack in HotSpot template interpreters.In order to improve interpreter performance, HotSpot employs a kind of technology for being referred to as " stack top caches (Top-of-Stack Caching, ToS) " in the implementation, passes through The stack top element frequently accessed is stored in hardware register to reduce the access expense to internal memory operand stack.
MIC Vector Bytes code be typically operated in units of 8 double or 16 float, therefore its with it is common Bytecode all has differences in terms of the size and type of stack top element.By taking double types as an example, our volumes in HotSpot It is outer to the addition of vdtos to represent the ToS states of stack top caching double vector operands.Meanwhile we are in the template of interpreter The template definition to new bytecode vdmul is with the addition of in table (TemplateTable), which includes vdmul in bytecode storehouse In sequence number, corresponding handling routine, and it is into and out when ToS states etc.:
def(Bytecodes::_vdmul,____|____|____|____,vdtos,vdtos,vdop2,mul);
The processing of similar vdmul bytecodes, semi-automatic vector model can introduce other Vector Byte codes.
Vector Byte code is added in JIT, then needs to be respectively modified:1) in compiler public service module, specific aim Add the processing method of Vector Byte code in ground;2) in jit compiling optimization module, realization includes solutions of the JIT to Vector Byte code Analysis and Optimization Work afterwards.Vectorization model is directed to computation-intensive array operation circulation, and this module need to ensure newly Vectorial node can be correctly identified in the process of loop optimization.3) in local code generation module, JIT passes through BURS (Bottom-up rewrite system) completes to instruct the work of selection, including the related framework of platform describes file Adfile and platform-independent architecture description language compiler (Adlc).In order to generate local vector instruction, we are necessary first The resolution logic to ideograph vector node is added in Adlc, corresponding matched rule is then added in Adfile.
Data dynamic prefetches
Semi-automatic vectorization model can generate into new Vector Byte code vif_icmpge and vgoto to represent designated layer Circulation.Based on new Vector Byte code, the length of the cycle of system-computed vector.Then, in interpreter triggering HotSpot virtual machines Safepoint mechanism, original byte code table is dynamically substituted for the new byte code table that can generate prefetched instruction, then touched Send out pre- race mechanism, operation and the circulation execution time for counting the different step size combinations of use, optimal prefetch is obtained according to the shortest time Step-length is simultaneously transmitted to JIT compiler.
JIT is optimized based on ideograph to bytecode.In order to which the pre- extract operation for being directed to MIC platform formulation is added JIT on-the-flier compiler processes, we with the addition of in ideograph new prefetches node and corresponding machine node (Machine Node), these nodes contain the specifying information of pre- extract operation.Meanwhile our meeting bases are prefetched by what dynamic analyzer transmitted The information such as step-length, it is final to determine to prefetch the prefetched instruction and parameter in node.After having added and prefetching node, newly-generated ideal Figure may proceed to the JIT unfinished optimizing phases, finally in the code generation phase according to the machine node newly added and corresponding rule Then generation includes the local code of prefetched instruction.
Table 1
Annex 1
Annex 2
Annex 3

Claims (3)

1. the method for the modification Java Virtual Machine towards many-core processor, it is characterised in that including two parts:(1)Design half The model of vectorization is automated, changes Java frontend compiler, carrys out the generation that can be handled in discovery procedure with vector calculation unit Code;(2)By the way of dynamically prefetching, EMS memory access bandwidth is made full use of, reduces delay;Wherein:
Step(1)The model of the semi-automatic vectorization of design, changes Java frontend compiler, comes in discovery procedure to use The code of vector calculation unit processing;Idiographic flow is as follows:
A kind of automanual vectorization model is designed, i.e., can add a note so that the array circulation of vectorization is preceding when programmer encodes Solution, gives the power to make decision of vectorization to programmer, optimizes chance with the SIMD in cognizance code more fully hereinafter;
In order to support semi-automatic vectorization model, it is necessary to change Java front-end compiler;Compiler is in processing Java source code When, vector circulant is identified according to note, array manipulation therein is then compiled into a series of new bytecodes, referred to as to Measure bytecode;According to the width of vector, to control the span of loop iteration, that is, increase the number of element;Further, since at many-core Access of the vector location for data on reason device has the requirement of address align, and Java Virtual Machine does not have strict alignment and wanted When asking, therefore performing the high performance application comprising Vector Byte code, it is ensured that the initial address of array meets Vector Processing The alignment requirements of unit.
2. the method for the modification Java Virtual Machine according to claim 1 towards many-core processor, it is characterised in that step (1)In, for local storehouse of collecting, the vector instruction for floating-point operation using CPU, by following two ways to Java Floating point instruction coding in the compilation storehouse of virtual machine local is modified, with the self-defined vector gather instruction of generating platform:
(1) equivalencing:Looking up function and semantic identical vector instruction, original for substituting in platform vector instruction set SSE and AVX instructions;If not complying fully with the independent instruction of requirement, realized using some instructing combinations;
(2) software is realized:The floating-point arithmetic operation that can not be covered for platform instruction set, realized using the floating point instruction of platform.
3. the method for the modification Java Virtual Machine according to claim 1 or 2 towards many-core processor, it is characterised in that Step(2)It is described to make full use of EMS memory access bandwidth by the way of dynamic prefetches, reduce delay;Particular content is as follows:
Dynamic, which prefetches, is divided into two parts:The cyclic node time in step-length dynamic analysis and JIT compiler in Java interpreter Go through;When Java programs, which are run to core calculations, to be circulated, analyzer is triggered and starts dynamic collection loop iteration number;When After analyzer successfully obtains length of the cycle, need to further determine that one it is initial prefetch step-length, the step-length prefetched is dependent on specific The feature of platform;Pre- race device in subsequent analyzer receives the initial step length, then a series of not according to initial step length formulation Same prefetches step value, and corresponding prefetched instruction operation is inserted in ensuing some groups of operation iteration;Wherein, every group of fortune Scope where row step-length follows following equation:
S2′=[S2∗(1−P),S2∗(1+P)]
S1′=[S1∗(1−P),S1∗(1+P)]
Wherein, S2And S1MEM- is represented respectively>L2And L2->L1It is at different levels it is initial prefetch step-length, the setting of P values depends on specifically Many-core platform, the pre- device that runs is directed to all S1’< S2' step size combination can all perform loop test;After the completion of test loop, Analyzer records the most short step size combination of run time, and is passed to the JIT compilings that next will perform co mpiler optimization In device;
When program enters the JIT co mpiler optimization processes of JIT compiler, the cyclic node spider module by addition is instructing generation All nodes of the circulation were traveled through before stage;In ergodic process, JIT identifies that internal memory all in circulation is read first Write operation simultaneously prefetches node in its front and rear insertion, carries out the repetition prefetched instruction of pre- extract operation to identical address for those, time Scanning of the device by iteration is gone through, is found out and eliminated, and judges to need which memory access mode the exclusive formula of execution to prefetch simultaneously Operation;After the searching loop terminates, on-the-flier compiler enters the code generation phase, and JIT generates bag according to the node diagram after optimization Containing the local instruction stream including prefetched instruction, so as to the policy development dynamically prefetched when completing this operation and instruction generation work Make.
CN201710871869.7A 2017-09-25 2017-09-25 Towards the method for the modification Java Virtual Machine of many-core processor Pending CN107729118A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710871869.7A CN107729118A (en) 2017-09-25 2017-09-25 Towards the method for the modification Java Virtual Machine of many-core processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710871869.7A CN107729118A (en) 2017-09-25 2017-09-25 Towards the method for the modification Java Virtual Machine of many-core processor

Publications (1)

Publication Number Publication Date
CN107729118A true CN107729118A (en) 2018-02-23

Family

ID=61207889

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710871869.7A Pending CN107729118A (en) 2017-09-25 2017-09-25 Towards the method for the modification Java Virtual Machine of many-core processor

Country Status (1)

Country Link
CN (1) CN107729118A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275713A (en) * 2019-07-02 2019-09-24 四川长虹电器股份有限公司 A kind of improved method of Java Virtual Machine rear end compiling
CN110399124A (en) * 2019-07-19 2019-11-01 浪潮电子信息产业股份有限公司 A kind of code generating method, device, equipment and readable storage medium storing program for executing
CN111443947A (en) * 2020-03-24 2020-07-24 山东大学 Sequence comparison method and system for next-generation sequencing data based on many-core platform

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729235A (en) * 2013-12-24 2014-04-16 华为技术有限公司 Java virtual machine (JVM) and compiling method thereof
US20160274996A1 (en) * 2015-03-19 2016-09-22 International Business Machines Corporation Method to efficiently implement synchronization using software managed address translation
CN107077364A (en) * 2014-09-02 2017-08-18 起元科技有限公司 The compiling of the program specification based on figure of the automatic cluster of figure component is used based on the identification that specific FPDP is connected
CN107145344A (en) * 2014-09-02 2017-09-08 起元科技有限公司 The assignment component in the program based on figure

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103729235A (en) * 2013-12-24 2014-04-16 华为技术有限公司 Java virtual machine (JVM) and compiling method thereof
CN107077364A (en) * 2014-09-02 2017-08-18 起元科技有限公司 The compiling of the program specification based on figure of the automatic cluster of figure component is used based on the identification that specific FPDP is connected
CN107145344A (en) * 2014-09-02 2017-09-08 起元科技有限公司 The assignment component in the program based on figure
US20160274996A1 (en) * 2015-03-19 2016-09-22 International Business Machines Corporation Method to efficiently implement synchronization using software managed address translation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
YANG YU等: "OpenJDK Meets Xeon Phi: A Comprehensive Study of Java HPC on Intel Many-core Architecture", 《2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS》 *
余炀,臧斌宇: "Java 虚拟机在Intel众核架构下的动态数据预取研究与优化", 《小型微型计算机系统》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275713A (en) * 2019-07-02 2019-09-24 四川长虹电器股份有限公司 A kind of improved method of Java Virtual Machine rear end compiling
CN110399124A (en) * 2019-07-19 2019-11-01 浪潮电子信息产业股份有限公司 A kind of code generating method, device, equipment and readable storage medium storing program for executing
CN110399124B (en) * 2019-07-19 2022-04-22 浪潮电子信息产业股份有限公司 Code generation method, device, equipment and readable storage medium
CN111443947A (en) * 2020-03-24 2020-07-24 山东大学 Sequence comparison method and system for next-generation sequencing data based on many-core platform

Similar Documents

Publication Publication Date Title
Frison et al. BLASFEO: Basic linear algebra subroutines for embedded optimization
Ho et al. Exploiting half precision arithmetic in Nvidia GPUs
US8312424B2 (en) Methods for generating code for an architecture encoding an extended register specification
US9720708B2 (en) Data layout transformation for workload distribution
Porpodas et al. PSLP: Padded SLP automatic vectorization
US8893104B2 (en) Method and apparatus for register spill minimization
BR102020019657A2 (en) apparatus, methods and systems for instructions of a matrix operations accelerator
Anderson et al. Checked load: Architectural support for javascript type-checking on mobile processors
Clark et al. Liquid SIMD: Abstracting SIMD hardware using lightweight dynamic mapping
CN107729118A (en) Towards the method for the modification Java Virtual Machine of many-core processor
Metcalf The seven ages of fortran
Hallou et al. Dynamic re-vectorization of binary code
Hong et al. Improving simd parallelism via dynamic binary translation
Rohou et al. Vectorization technology to improve interpreter performance
US9298630B2 (en) Optimizing memory bandwidth consumption using data splitting with software caching
Khaldi et al. Extending llvm ir for dpc++ matrix support: A case study with intel® advanced matrix extensions (intel® amx)
Khan et al. RT-CUDA: a software tool for CUDA code restructuring
Engelke et al. Robust Practical Binary Optimization at Run-time using LLVM
Liu et al. Exploiting SIMD asymmetry in ARM-to-x86 dynamic binary translation
Singh An Empirical Study of Programming Languages from the Point of View of Scientific Computing
Kawakami et al. A Binary Translator to Accelerate Development of Deep Learning Processing Library for AArch64 CPU
WO2022174542A1 (en) Data processing method and apparatus, processor, and computing device
Tian et al. Inside the Intel 10.1 Compilers: New Threadizer and New Vectorizer for Intel Core2 Processors.
Mendelson The Architecture
Tartara et al. Parallelism and Retargetability in the ILDJIT Dynamic Compiler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180223