CN1173262C

CN1173262C - Optimized bytecode interpreter of virtual machine instructions

Info

Publication number: CN1173262C
Application number: CNB008029741A
Authority: CN
Inventors: F・里卡迪; F·里卡迪
Original assignee: Koninklijke Philips Electronics NV
Current assignee: Koninklijke Philips NV
Priority date: 1999-09-21
Filing date: 2000-09-13
Publication date: 2004-10-27
Anticipated expiration: 2020-09-13
Also published as: JP2003510681A; KR20010080525A; WO2001022213A3; EP1183598A2; CN1347525A; WO2001022213A2

Abstract

The invention relates to a method of optimizing interpreted programs, in a virtual machine interpreter of a bytecode-based language, comprising means for dynamically reconfiguring said virtual machine with macro operation codes by replacing an original sequence of simple operation codes with a new sequence of said macro operation codes. The virtual machine interpreter is coded as an indirect threading interpreter thanks to a translation table containing the implementation addresses of the operation codes for translating the bytecodes into the operation code's implementation addresses. Application: embedded systems using any bytecode-based programming language, set to box for interactive video transmissions.

Description

The optimization bytecode interpreter of virtual machine instructions

Invention field

The present invention relates to by the interpretive routine optimization of working time.Particularly relate to the method that will be optimized by interpretive routine, this method by means of with new macro operation sign indicating number with himself again the virtual machine of dynamic-configuration realize.The present invention is applicable to the programming language of any syllabified code for the basis.

Background of invention

As the intermediate language that program compiler and machine-independent executable program are represented, it is general adopting the language based on bytecode of the visible storehouse of programmer.This speech like sound provides tangible advantage to network calculations.The article of author L.Piumanta and F.Riccardi " lining up direct thread code optimization with selectivity " begins paragraph promptly to having been described a kind of technology by the optimization of interpretive routine at it, and this article is published in the 291-300 page or leaf that the program language of holding at Canadian Montreal on June 17th, 1998 designed and carried out (PLDI) 98 meetings " ACMSIGPLAN " collected works.Adopt virtual machine (VM) interpretive routine should give the credit to the VM interpreter.VM is a kind of software executive routine of representing the virtual processor architecture, carries out the application program that aims at this architecture and compile on this virtual processor.The instruction of this virtual processor/machine is referred to as bytecode.The VM interpretive routine is that part of representing the VM of bytecode execution mechanism.This bytecode is said to be and is explained by the VM interpretive routine.The bytecode execution mechanism is to realize with the infinite loop of belt switch event block now.Technology described in the above-mentioned article is applicable to direct thread interpretive routine.The thread code interpretive routine is carried out bytecode by row.The translation of each bytecode comprises quoting next bytecode.Therefore, the bytecode translation of carrying out with the thread interpretive routine does not relate to infinite loop.Although the thread interpretive routine has its feature performance benefit, but it is too slow and require too many storer and be not suitable for most embedded systems in the direct thread code interpretive routine that above-mentioned article is mentioned, each bytecode adopt their executive address to represent the VM bytecode, so can be leapt to the execution of next bytecode.Before translating operation, with the address of each bytecode of this application program one table is started, make it when the bytecode translation takes place can the execution of fast access bytecode physical address.This table allows to switch to another bytecode from a bytecode.Though directly the thread interpretive routine is quite fast, they comprise the code expansion.Bytecode is changed over direct thread code, and it is about 150% that its code size increases, and this is because operational code is replaced by the address of their actuating code.Generally speaking, the address needs 4 bytes, and operational code only needs 1 byte.Therefore, directly the thread interpretive routine increases memory consumption and very not suitable embedded system.

Summary of the invention

The present invention aims to provide a kind of method to optimize the working time of the program of being explained, this method extremely is suitable for embedded system.For example, this system can be satellite or cable communication system, and it embeds in the digital video receiver, is called set-top box usually.But the present invention is suitable for any product that its operating system is based on the programming language of bytecode equally.The present invention also can save storer and cpu resource, thereby can improve the performance of system.

According to the present invention, a kind of method that the program of explaining based on the quilt in the virtual machine interpreter of bytecode language is optimized is described, wherein virtual machine will self reconfigure to replace simple byte code sequence with new grand bytecode (or operational code) on dynamically, is encoded into thread code (threadedcode) interpreter for the run time version virtual machine interpreter of bytecode being translated into they simultaneously.According to the present invention, the thread code interpreter is encoded into indirect thread code interpreter, this should give the credit to reference table, and the executive address that this reference table contains bytecode makes it can extract the address of next bytecode and reach the purpose that can jump to next bytecode at a bytecode translate duration.

The accompanying drawing summary

The following accompanying drawing of reference is to the present invention and be used for optionally realizing that other characteristics of the present invention can come into plain view.

Fig. 1 is a calcspar, shows the characteristics by a kind of method of the present invention.

Fig. 2 is a calcspar, shows the characteristics by a kind of method of the preferred embodiment of the invention.

Fig. 3 is a schematic diagram, shows the example by a kind of receiver of the present invention.

Detailed Description Of The Invention

Now the present invention being given detailed explanation, is example with the Java language, shows to be suitable for any novelty based on the bytecode language (novel) machine optimisation strategy working time.

Usually (JIT) method that program compiler adopted is whole Java Virtual Machine (VM) interpretive routine of giving up in good time, and (so called after JIT) translates cost machine code with the bytecode of application program before carrying out.This method comprises to the understanding of former application semantics with it and is expressed as this machine form more easily again.Although this method can be a kind of effective way of obtaining its performance, on the one hand it will be a cost with big storage consumption, because it is more compact to compare to this machine code based on the language of bytecode; On the other hand, a large amount of CPU (CPU (central processing unit)) resource of necessary consumption is because the task of the Java bytecode of remapping on target machine is not easy.

The present invention also generates based on certain class dynamic code simultaneously, but its purpose is not the Java bytecode translation cost machine code with application program, but with the execution of Java VM dynamic adaptation to the specific byte code sequence of application program.Therefore former application program Java bytecode is retained, and VM is then with novel bytecode or improve its operational code (opcodes) of carrying out efficient and enriched on dynamic.

This method has several advantages:

It does not increase the size of executable code: its application program becomes the Java byte code of saving storer to be represented,

The VM execution mechanism is economical: have only an execution mechanism, so the mechanism of executive utility needn't handle many coded representation, thereby reduce its size and improved reliability,

Code Generation is quite simple: VM optimizer structure is very simple, and the bytecode analysis of its application program is disposable by tabledriven process, takies cpu resource seldom, and it directly drives the synthetic of new bytecode.

These characteristics make the present invention be suitable for embedded system.The basis of optimisation technique of the present invention is the research of a kind of the most basic mechanism of interpretive routine with respect to the expense of the application program of a class " typical case ".The correlativity of the configuration file of this application program is the profit potential that obtains from the various optimisation techniques that can pay attention to.Because target is to embed application program, the program that can be defined as " typical case " application program is, for example, and controlling application program, graphic user interface etc.

The complete map of hypothetical target application program quilt is to the primitive that is provided by the VM of lower floor (management by objective).Therefore, they will be benefited from basic code conversion seldom, and real being benefited from the overall improvement of VM execution mechanism.In order to understand efficient how to improve VM, once utilized the Amdhal law.In the version of Hennessy and Patterson statement, the Amdhal law is expressed as follows: be subject to the time share of operable immediate mode from utilizing the resulting performance improvement of certain quick executive mode, more synthetically, " cause common situation to be accelerated ".

The performance of interpreter depends on the selectable expression of executable code and is used to send the mechanism of bytecode.The first kind of approach that reduces executive overhead is to reduce the expense that instruction sends, because the core of interpretive routine is its instruction transmit mechanism.Typical interpretive routine is referred to as pure bytecode interpreter, and its execution mechanism similarly is the imitation of processor: a big switch statement is arranged in closed loop, instruction is sent to their executive routine.Therefore, the interior ring of pure bytecode interpreter is very simple: read next bytecode and it is sent to executive routine with switch statement.Interpretive routine is an infinite loop, contains switch statement sending bytecode in succession, and control is turned back to the control of naming a person for a particular job of rising of infinite loop reaches next bytecode by opening switch.Below one group of instruction show the implementation of a typical bytecode interpreter.

Loop (op=*pc++; { Case op_1: // op_1 ' s realizes break to Switch (op); Case op_2: // op_2 ' s realizes break; Case op_3: // op_3 ' s realizes break;

Suppose from switch opens to be back to its starting point by the implicit jump of loop end points, program compiler is optimized this whole jump (transfer) chain, then with the overhead that this method links mutually is:

Increment instruction pointer pc,

Take off a bytecode from memory read,

Carry out the redundancy range check with regard to switch parameter,

From table, read terminal point event ID address,

Skip to this address,

And at each bytecode terminal point:

Skip back to the loop starting point and read next bytecode.

In this example, as ignore other sources such as non-validity in the switch statement practical implementation, then the expense that sends of instruction comprises:

2 memory access: once being the value of extracting next instruction, once is the address of extracting this instruction execution,

Add 2 transfers: once be to skip to the execution of bytecode and another time is the beginning that is back to loop.Jumping in the architecture in modern times is the most expensive instruction.

Pure bytecode interpreter is easily write understandable.They are also very small and exquisite, but quite slow.Therefore they are not easy to use embedded system.Carry out in the situation of simple operations at most bytecodes, example as shown above, most of execution time is wasted in instruction and sends.In fact, in order to be familiar with the true expense of this kind mechanism, the executive overhead of single bytecode might as well be compared.The Java bytecode has very rudimentary semantics, and their implementation is loaded down with trivial details usually.Therefore, the bytecode that the most generally is performed is in fact cheap than transmit mechanism itself.

Shown in following one group of instruction, improving according to first kind of benefit that the present invention obtained is to adopt indirect thread code:

op_1_lbl：

// op_1 ' s realizes

goto?opcode_table(*pc++)；

op_2_lbl：

// op_2 ' s realizes

goto?opcode_table(*pc++)；

op_3_lbl：

// op_3 ' s realizes

goto?opcode_table(*pc++)；

Op_1_lbl herein, 3 kinds of different operational codes that Op_2_lbl and Op_3_lbl representative are explained by the VM interpretive routine.

According to this executive routine that is called indirect thread code, VM is encoded into indirect thread code interpretive routine.At the bytecode translate duration, the address of next bytecode is resolved.Reference list is designated as opcode_table, comprises the bytecode executive address.This reference list is subjected to the visit of pointer (* p++) index.For skipping to the address of extracting next bytecode when the each bytecode of next bytecode is translated.Adopt this mode, the execution of each bytecode jumps directly to the execution of next bytecode, thereby has saved once and shifted, and outer shroud and switch statement are carried out unnecessary non-validity in (range check and default situation are handled).

According to the preferred embodiments of the invention, its translation is to realize based on those bytecodes useless in the VM performance of bytecode language by exploitation.

Fig. 1 has summed up the key step according to method provided by the present invention, and this method is by means of the thread code interpretive routine is with bytecode indirectly, and for example bytecode is two is pressed into, and translates into native instructions:

Step K0=BIPUSH; The beginning of the two method of press-fitting of translation bytecode comprises storehouse is inserted in 1/2 word, and its 1/2 word is two parameters (par) that are pressed into

Step K1=PAR; Extract two parameters (par) that are pressed into

Step K2=PUT; To two be pressed into parameter and insert storehouse

Step K3=GOTO; By checking that the reference table (opcode_table) that comprises next bytecode executive address is transferred to next bytecode (goto opcode_table (* pc)).

The performance of VM can be doubled by the thread code that adopts own, but will see below us that it can also provide other interesting optimization chance.The statistical results show of Java bytecode, on average, each has 5-6 instruction in shifting approximately.On the CPU in any modern times, transfer is intrinsic costliness instruction, because they can cause the pipeline obstacle and/or trigger the external bus activity.In addition, to loop do not roll or formation in method call (method call in-line), in fact also be at one's wit's end.Even when code compile expense machine is represented, control statement will still exist.

Relevant CPU studies show that recently the use of the object-oriented application program of high-end workstation, just as the mistake expection effect that transfer instruction produced, at this moment CPU can consume its clock period of 70% so that recover and from pending data and instructions (high-speed cache misses) such as primary memorys from the pipeline obstacle.In addition, available CPU has very little high-speed cache in embedded system, does not have hardware auxiliary to dynamic branch prediction, and the low and/or narrow storage interface of band L2 high-speed cache.These extra restrictions will further reduce utilization factor and the performance of CPU.

The Java bytecode can be divided into two classes:

Simple operations sign indicating number (pack into, storage, computing and control statement) and complex operations sign indicating number (storage administration, synchronous etc.).

The simple characters sign indicating number is typically not as the transmit mechanism costliness.And the complex characters sign indicating number is much expensive, sends the very little some that expense is only represented the overhead of bytecode executive overhead.The simple characters sign indicating number is carried out (a high approximately order of magnitude) more continually than complex characters sign indicating number, and this just means that classical Java interpretive routine is used to its most of the time to send character code rather than really does any useful thing.Therefore, can conclude that it is more more effective than complicated bytecode to simple bytecode to reduce the transmission expense.

Bytecode is translated into indirect thread code provides chance also for any conversion to executable code.A kind of such conversion is exactly to survey the common sequence of bytecode and they are translated into single-threaded " grand sign indicating number ".This grand sign indicating number is carried out the work of former bytecode whole sequence.Therefore, according to a preferred embodiment of the invention a, suggestion replaces simple byte code sequence with some equivalence " grand sign indicating number ".For example, introduce as above-mentioned quoting in the article, that bytecode " be pressed into literal, be pressed into variable, interpolation, storage variable " can be translated in simple thread code is simple " interpolations-literal-to a variable " grand yard.This optimization is effectively, because they have avoided the overhead that repeatedly sends by former bytecode editor, and this repeatedly being sent in the grand sign indicating number is cancelled.When carrying out, avoided the N-1 bytecode to send from the single grand sign indicating number that the former byte code sequence translation of N comes.Can find in the article about how generating the details of grand sign indicating number above-mentioned quoting.The grand sign indicating number of this class must satisfy following criterion:

Grand must the generation by simple byte code sequence still do not have skill and can execute because just reduce the transmission expense aspect of complicated bytecode.

Be the instruction that possible divert the aim grand never can comprising, and if not, must make great change to the VM execution mechanism.Grand code book body can be to divert the aim.

Grand must the termination with control statement or method call is because the expense that a deuterzooid machine shifts is equivalent to the expense of a transmit operation.

In order to carry out conveniently, a grand maximum length should be about 15 bytecodes." neutrality " average grand code length is a 4-5 bytecode.Can build the grand sign indicating number of this class sequence at an easy rate from these criterions, and only need seldom the CPU time with constraint.The bytecode of method itself is only needed scanning simply, most analyze then can be expression drive be the basis with single bytecode.

According to a kind of special selection scheme of this preferred embodiment, consider that bytecode useless is (average out to 30-40) seldom, can be used to 2 byte representations to represent the new bytecode of this new macro instruction.The operand of former sequence is grouping at once after new sequence, makes these operands be easy to visit by means of increment virtual machine program counter.

In case process is scanned, can by binary code that compiler is produced for the thread code interpreter simply in addition cut and paste be built into macro instruction.And macro instruction is just thought normal bytecode by the thread transmitter.

Fig. 2 has summarized the preferred embodiment according to a kind of virtual machine of the present invention.VM is implemented to load program, and these programs contain by the bytecode of VM interpreter interprets.The key step of this method is as follows:

Step K0=INIT: packing into by the program that will contain bytecode starts the process of being carried out by VM,

Step K1=OPCODE: extract the bytecode of being explained,

Step K2=MACRO: replace simple byte code sequence with grand bytecode,

Step K3=TRANS: utilization is explained grand bytecode as the indirect thread interpretive routine described in Fig. 1,

Step K4=RES: obtain the result, method finishes.

To the tracking statistical results show that actual java application is carried out, typical macro instruction length is the 4-5 bytecode, and therefore, after code conversion, grand execution is higher to 5 times more than than remainder bytes sign indicating number usually.The remainder bytes sign indicating number is such some codes, and their execution is too complicated and be unworthy online (to be worth in-lining) and those owing to consider the code that the analysis that diverts the aim is given up.Therefore, total bytecode transmission expense can reduce more than 4 times.Account for the about 50% of total executive overhead at first if send expense, then use the present invention to reduce significantly.

The present invention has also taken some additional advantage out of.The processor transfer instruction also can reduce about 5 times.Because the code that is performed linearization, the performance of processor pipeline and memory sub-system can be improved significantly.Actual improvement degree depends on the architecture of processor to pipeline obstacle expense, cache line is filled the architecture that depends on memory sub-system.In " high storage requirement " system, as most embedded systems, quite high and undoubtedly worth reduction of this class expense.Remaining transmission expense depends on the control statement in the Java code basically.In order as exemplary dynamic recompiles, bytecode to be translated into binary code fully, should in executable code, introduce jump statement.Have like this with left residue and send roughly the same expense.

One of grand advantage is, they are common byte code sequences, other method elsewhere simultaneously, perhaps even to find a kind of possibility of such sequence in the environment of same procedure be quite high.Once the Java bytecode was carried out overtesting.The major part of discovery macro instruction can be reused.Therefore, by to the consideration of service factor again, can reduce by micro instruction code and carry out the storer vestige (memory footprint) that uses.All translate into binary code and will consume the storage of twice at least, and only obtain inappreciable performance advantage probably.For example, suppose and scheduling overhead further can be cut down other two times, and total visible increment is very little on speed.Thereby be unworthy exchanging for double storer vestige probably.

Another grand advantage is that they do not have any influence to normal bytecode transmit mechanism.Grand in the VM Already in be need not to add other execution mechanism.Need not the method that compiled He do not compile is distinguished, also need not to repeat to the destiny and the overhead at this machine code interface.

Exist for feature as this object oriented language of Java with subsection code very.The Java method also is difficult to online (inline), because their potential multiforms always almost.Therefore, this method even the abundant program compiler of optimizing can be videoed better on lower floor's processor architecture is carried out on semantics, and the overhead of the beginning and end of binary translation method usually exceeds any superiority.

Carry out efficient for improving, adopt the storehouse cache technology, it is kept three addresses of Java storehouse in the processor register file, thereby has reduced memory accesses significantly.It is exactly this fact of storehouse machine itself that this technology makes full use of target processor.The execution of former bytecode is replaced with equivalent process device instruction sequence.Can realize very fast and effective technique of compiling by means of complicated translation table and simple expense function (memory reference number).According to the another kind of embodiment of selecting of the present invention, as an example under the Java situation, the expense that will describe the storer I/O now reduces.

Java is a kind of language based on storehouse: bytecode intercoms mutually with storer.The execution of every single bytecode means at least one memory access, and this visit is very expensive.For example, consider following simple expression formula:

C＝a+b；

In language, it is translated into based on storehouse:

Being pressed into a-1 reads 1 and writes

Being pressed into b-1 reads 1 and writes

Adding-2 reads 1 and writes

Storage c-1 reads 1 and writes

More than expression has the operation of 9 memory access.And the CPU with minimum internal state only can reach same purpose with 3 memory access.Consider the modern processors architecture, memory reference is the most expensive operation, thereby belongs to the desirable field of optimization.By means of additional code work seldom, a version can finishing the Java bytecode makes data exchange by machine register, rather than exchanges by external memory storage.Therefore can produce grandly, from being referred to as these specific bytecodes of key element, can reduce more than 2 times at the number of times of a macro instruction internal storage visit.

The execution of " Marcroizer " and bytecode " Standifier " does not need the code of too many row.Part to the interpretive routine ring rewrites and can be estimated, for example, is about thousands of capable C codes.For carrying out simple thread code, only need the capable assembly language of number, and " Standifier " needed hundreds of row.

To testing working time, the time consumption that bytecode is pasted and new grand bytecode generates is not considered in this test.But measure with this machine code configuration file working time in any case.When moving big application program,, expend a very little percentage that still maintains total execution time the T.T. of " maeroization " as crawler.

Shown in Figure 2 is example according to a kind of receiver of the present invention.This is a set-top box receiver 20, is used for the interactive video transmission.It comprises demoder, for example with MPEG 2 (Motion Picture Experts Groups, ISO/IEC 13818-2) suggestion is compatible, be used for by the coded signal of cable transmission channel 23 receptions, then the received signal decoding shown on 25 to be presented at video so that extract the data that are transmitted from video transmitter 24.The function of this set-top box can be the software of effectively carrying out with a system, this system carry out such as with the Java of bytecode form by interpretative code.This system comprises primary processor CPU and is used for store software code storer MEN partly that the software code means representative makes primary processor CPU finish the instruction of its method by the present invention described in Fig. 1 or Fig. 2.

According to another embodiment of the present invention, set-top box 20 can receive java application, and this java application comprises the bytecode as a received signal part.In this case, set-top box should comprise a loading bin, the program based on bytecode of packing into and receiving, and this program is from the far-end transmitter.

Claims

1. method of in based on the virtual machine interpreter of the language of bytecode, optimizing the program of being explained, wherein the virtual machine former sequence that replaces simple bytecode by the new sequence with grand bytecode reaches self dynamic and reconfigures, wherein virtual machine interpreter is encoded into the thread code interpreter, be used for bytecode is translated into the code that they are realized, comprise reference list, reference list contains quoting the address of the realization of bytecode, make that the address of realization of retrieving next bytecode is so that can skip to next bytecode during the current bytecode of translation;

Wherein virtual machine interpreter comprises one group of predetermined bytecode, and some of them are no, and wherein the new sequence of macro operation code realizes by developing said no bytecode.

2. according to the process of claim 1 wherein that the bytecode of former sequence promptly is grouped after the new sequence of said macro operation code.

3. according to the process of claim 1 wherein that no bytecode is with the presentation code of at least 2 bytes.

4. method of in the virtual machine that uses based on the language of bytecode, optimizing the program of being explained, this method comprises following method:

Comprise the program of bytecode by packing into and initialization,

Replace the sequence of simple bytecode with macrocode,

With the grand bytecode of indirect thread interpreter interprets, so that bytecode is translated into their code, comprise reference list, reference list contains quoting the address of the realization of bytecode, make that during explaining current bytecode the address of realization of retrieving next bytecode is so that can skip to next bytecode;