WO2002046923A2

WO2002046923A2 - Method of developing a software program for a target platform

Info

Publication number: WO2002046923A2
Application number: PCT/GB2001/005355
Authority: WO
Inventors: Martin Sean Kelly
Original assignee: Smart Card Solutions Limited
Priority date: 2000-12-07
Filing date: 2001-12-04
Publication date: 2002-06-13
Also published as: GB2372853A; GB0129021D0; GB0029796D0; WO2002046923A3

Abstract

The present invention enables an incremental approach to optimisation by allowing a programmer to incrementally re-define functional definitions into native code; as no modifications or code changes are performed to the reference implementation, the environment remains fully functional, enabling the programmer to concentrate on optimising a function at a time by writing into native code, with high visibility given to the effects of the native coding. As the functionality remains the same, the profile of the program remains unchanged. allowing native code to override higher level representations in this way enables target specific implementations to be generated rapidly and more accurately; key advantages where a fast port of a program is needed in order to utilise a fast approaching mask manufacturing window.

Description

METHOD OF DEVELOPING A SOFTWARE PROGRAM FOR A TARGET

PLATFORM

FIELD OF THE INVENTION This invention relates to a method of developing a software program for a target platform. It is based upon a new programming language for developing portable applications.

DESCRIPTION OF THE PRIOR ART

Software programs are generally written to run on a specific device, such as a CPU, taking advantage of instruction sets and control registers which are dedicated to that particular target device. The typical design sequence commences with a program written in source code, which is then compiled into binary object code using a compiler specific to the target and debugged. Once debugged, the binary code may be masked into ROM.

This close linkage between a program and the target device or platform it is designed to run on makes the process of porting a software application (so that it can run on different devices from the one which it was originally designed for) a complex and often slow one. Slowness is a critical disadvantage, particularly where an application has to be masked into ROM, largely because there is currently very limited ROM masking capacity; when occasional windows of capacity open up, typically several months away, they can only be exploited if it is possible to rapidly develop a port and mask. This places extreme pressure on engineers porting software. Hence, in all situations where masked code is used, e.g. micro-controllers and CPUs, the time it takes to port an application to a target platform is critical.

One approach to porting software across platforms without simply re-writing the code from the beginning is to use a cross-compiler. In theory, a cross-compiler can deliver executable binary code for multiple target platforms. But in practice, cross-compilers frequently fail to perform that well; where faults arise, it can be very difficult to de-bug code which has been generated by a cross-compiler. Further, cross-compilers tend to generate code which is not that efficient or compact, and this can be a major disadvantage when masking into a capacity restricted ROM. Overall, cross-compilers still leave software engineers with a great deal to do in order to deliver properly debugged, appropriately optimised code for a new target. As noted above, the consequence of this might be that ROM masking manufacturing capacity which opens up through cancellations etc. can not be exploited.

Other attempts have been made to solve the problem of rapid, accurate porting of software with 'universal' computer languages such as 'C. This has had limited success due to inefficiencies in the generated code. Bulky and/or slow code is generally unacceptable in resource limited environments such as smart cards. The advantage of portability is lost when the resulting code is unusable.

A more recent approach, as demonstrated by JavaCard and Multos, has been to implement a virtual machine (VM) or 'abstract processor'. Implementing this NM on a variety of target chips, it becomes possible to write portable applications. The disadvantage of this approach is that the NMs themselves are not trivial and must be largely written in the assembly code of the target chip. This means that porting the NM to different chips can be a time-consuming and error prone process.

SUMMARY OF THE INVENTION

In a first aspect of the invention, there is provided a method of developing a software program for a target platform comprising the following steps: (i) generating a reference implementation of the software program in source code, the reference implementation not being specific to the target platform;

(ii) porting the reference implementation to the target platform to produce a ported program; (iii) overriding some or all of the function definitions in the reference implementation with versions of those definitions in code which is native to the target platform in order to optimise the performance of the ported program.

Hence, in one implementation, the present invention enables an incremental approach to optimisation by allowing a programmer to translate any individual function into native code. As no modifications or code changes are performed to the reference implementation, the environment remains fully functional, enabling the programmer to concentrate on optimising a function at a time by writing into native code, with high visibility given to the effects of the native coding. As the functionality remains the same, the profile of the program also remains unchanged. Allowing native code to override higher level representations in this way enables target specific implementations to be generated rapidly and more accurately; key advantages where a fast port of a program is needed in order to utilise a fast approaching mask manufacturing window. The present invention shifts the porting process away from getting a full program to run successfully to the far simpler and faster process of incrementally optimising it.

Another advantage of this approach is that the initial code generation may be performed within a development environment and not on the target platform. Developing and de- bugging on, for example, a PC, with its extensive range of available tools, is far preferable to doing so on a typical micro-controller target. This reference implementation is threading model invariant and therefore target independent.

In a second aspect, there is a software program developed using the inventive method defined above.

In a third aspect, there is a masked ROM including a software program developed using the above inventive method. In a fourth aspect, there is a method of modifying the behaviour of a CPU from one operating mode to at least one other operating mode, in which the applicable operating mode is determined by the address from which an instruction is fetched.

Further specifics of the invention are particularised in the appended claims.

DETAILED DESCRIPTION

The present invention will be described with reference to an implementation called Quarter™, from Smart Card Solutions Limited ('SCS') of Cambridge, England.

The Quarter language ("Quarter") is SCS's proprietary format programming language for the development of applications in a uniquely portable format. This format is used by SCS for implementing complex programs. It simplifies and accelerates the task of porting an application to a new platform.

After achieving the initial port of the application the only task remaining for the Quarter programmer is to optimise the program for performance. This task can be time consuming and requires a certain technical ability. However the presence of SCS's profiling tools and testing environment make this task significantly easier than the initial problem of implementing the application.

Quarter's advantages are:-

• Rapid porting between platforms.

• Porting no longer requires an in depth knowledge of the application by the programmer.

• More compact than native machine code. • Optimisation is simplified and may easily be distributed to several programmers who do not need to 'understand' the whole application in order to achieve their task.

The net effect of this is rapid availability of applications on multiple target platforms whilst avoiding the need for re-engineering. It also gives relatively high performance and code density when compared to traditional high-level development tools such as 'C compilers.

The portable language: Quarter

At the heart of the portable format is the Quarter computer programming language; this has much in common with Forth.

Quarter has many features that make it well suited to its purpose.

• Quarter permits the mixing of native code with the higher level representation.

• The code generation differs from traditional Forth in that the code generation is entirely within the development environment and not on the target.

• Quarter's structure lends itself to simple and effective compile time optimisations.

• Quarter permits redefinition of code subroutines; this means that it is possible to develop a single 100% quarter implementation of a program and subsequently override function definitions specifically for the target platform without having to modify the original portable form.

Porting Quarter programs

A Quarter program consists of a set of words. Each word describes a set of actions and a sequence in which to perform them. Each action is itself a reference to another word.

Quarter is threading model invariant. The mechanism used to call words and the representation of words in memory is not defined by Quarter. This task is left to the programmer who is targeting a particular platform for his program. Leaving this choice to the programmer rather than the compiler ensures that the Quarter-developed code can be ported to CPUs with a wide variety of architectures. In other words, Quarter is target independent.

Quarter permits a mixture of word types. Words may be native code or a mechanism to process a list of actions. Quarter permits the overriding of a word with a new definition. This facilitates optimisation. The starting point is a 100% Quarter representation of a program. Once it is debugged and tested, definitions of words can be incrementally overridden with new native code versions without having to modify the debugged Quarter representation. This way a target specific implementation can be generated without modifying the generic implementation. Future ports of the program become faster and accuracy is guaranteed.

The ideal porting sequence, illustrated in Figure 1, is:-

1. Take (or develop) the Generic Quarter representation of your program. 2. Implement the threading code suitable for the target CPU (Typically 15-30 machine instructions)

3. Implement the native words. (Typically 20 very short subroutines)

4. Compile and Test. With minimal debugging this will now be a fully functional and accurate port of the application, suitable for undergoing a full system test. However, the performance may leave a bit to be desired.

5. From now until the delivery date, the programmer can take one word at a time and rewrite it in native code, repeatedly using the system test suit to ensure no bugs have been introduced. This task is simplified by various support tools that can perform static and dynamic profiling of the application, thus directing the programmer to features most in need of optimisation.

In addition to the flexibility of target CPU, this approach has many advantages over traditional porting methodologies. It is widely accepted that it is significantly more difficult to get a full program running that it is to optimise it. Quarter guarantees the ported program will run correctly leaving only the simpler optimisation task to the programmer.

Because Quarter's program representation is relatively simple, it is easy for the compiler to perform optimisations on the source code. These include:-

• Code in-lining: Code from subroutines can be placed in line with the program when doing so does not cause an unacceptable expansion in the size of the generated code. Doing this will remove the calling and returning overhead from function invocations.

• Common code extraction: When many words contain the same sequence of actions the compiler can automatically construct a suitable subroutine, thus reducing the overall code volume at the expense of a subroutine call. This process is known as factoring.

• Redundant Code removal: Quarter can recognise words that are not called within the program. They can be removed and program space saved. • Static profiling: The compiler can inform the programmer about functions that are heavily referenced. Such functions are ideal candidates for future translation into native code.

Threading

The threading model is at the heart of Quarter. The choice of model affects the representation of the program, the size of the program and its performance.

All Words have a common structure. They consist of an invoke function and a body of program details, as shown in Figure 2. The invoke function will manage the stack and registers to permit a return to the caller, it will also process the body's program details.

Again, Quarter places no constraints on the contents of the body. The choice of invoke function and body representation is known as the threading model. Four common variants are described below. This two part word structure also facilitates the optimisation process.

Replacing the body of a word with native code and replacing the invoke function with code to directly 'run' the body of the word we are able to freely mix native and Quarter words within a program. The mechanism required to call words of either type is the same.

Direct Threading (DTC) In direct threading, the word, illustrated in Figure 3, is invoked by directiy jumping to the word header. Typically this will contain a native instruction to transfer control to the usual list processor that will process the word's body.

The body will contain a list of addresses of the words that need to be called. The Invoke function will read each one in turn and jump to the word at that address. Native code words are efficient because the invoke function does not need to do anything. Jump to body can be optimised away and the word made to execute its own body.

Indirect Threading (ITC)

In this model the head of the word contains the address of the invoke function instead of code to call the invoke function, as shown in Figure 4. The body remains the same. In this model the invoke function has to read the head of the word to determine where to call in order to process the word's body. This format is slightly more compact than direct threading. In Native words the head must contain the address of the start of the body.

Doubly indirect threading (DITC)

This format, shown in Figure 5, is akin to Indirect threading except the body contains short tokens for each word to be called. The invoke function must translate this token into an address before calling the indicated word, typically via a lookup table. Native word management is the same as indirect threaded code.

Subroutine threading (STC)

This model is shown in Figure 6. Here all code is native. Quarter words are simply defined as a list of native call instructions and the invoke function is redundant as with all native words. Quarter tools and techniques: a re-statement

As noted above, Quarter is a programming language that is compiled to the C or assembly code source code of the target chip. This is achieved using a Forth like model: the source code consists of words (which resemble functions). Most of these words are written in Quarter themselves, only a small number need be written in the native language (usually those that interface with the target hardware). Thus the vast majority of the code is portable, but as it is assembled and executes on the card as machine code it does not have the same speed penalties associated with a NM or compiled 'C code. The unique insight that allows this to happen is that the generated code is threading model independent. When the code is finally assembled for a chosen target, an appropriate threading architecture has to be added by the programmer. This gives no runtime kernel (as Forth has). However it does impose a small speed penalty over well written assembly language code programs but is typically more efficient than that generated by traditional compilers.

This allows rapid porting of code from one target to another but also controlled and incremental optimisation. Normally, optimising source code can lead to unexpected side effects and the code being unstable whilst the changes are taking place. With Quarter, words can be changed one at a time without affecting the overall program. Using profiling tools, the major bottlenecks of the Quarter system can be determined. These words can be rewritten as native words and replaced singly giving no opportunity for side effects and no 'downtime'.

Quarter compiler

Here we briefly describe the features of the Quarter compiler. Basically, this tool takes a text file as input and generates a program file as output. The output file format is generic for practically any macro-assembler. This way, Quarter is isolated from the intricacies of specific assembler syntaxes. Quarter may also generate a 'C language output. This is typically used in the development of the initial 100% Quarter version of a program. Debugging on a PC within a full IDE is preferable to developing directly on a typical micro-controller.

An important difference between Quarter and Forth is that all of the dictionary generation is on the development platform (PC) not the target (e.g. SmartCard). Typically Forth will 'learn' new words or execute known words in response to the input stream. Quarter on the other hand does all the 'learning' before attempting to execute anything. This gives Quarter the opportunity to

the dictionary in ways not possible within standard Forth, e.g. Forth cannot delete unused words because it does not know that the word will not be used in the future.

A further advantage of this off target dictionary generation is that the Target system (Forth Kernel) is stripped down to an absolute minimum. It needs no mechanism to extend its dictionary or process input. In fact all it requires is a threading model and an indication of which word to execute first. This amounts to approximately twenty instructions in a typical micro processor.

Optimisations

Pre-compilation of the entire dictionary permits the compiler to analyse the relationships between all of the known words. It is then possible to transform the dictionary into something functionally identical but structurally different. Judicious choice of transformation can lead to programs that are smaller, or faster or both. Simple optimisations that can be attempted on a complete dictionary include:

• Redundant code removal. Words that are never accessed may be deleted from the dictionary to conserve space.

• Common code extraction (factoring). If multiple words contain identical sequences of code, a new word can be created to perform this functionality and all the instances of the repetition can be replaced with a reference to the new word. This too conserves space at the expense of a small run-time overhead.

• Code-in-lining. Functions that are infrequently used may be removed as long as we replace all the references to the word with its definition. Removal of the threading overhead leads to faster code execution. Code in-lining will either save space or increase program size depending on the size of the word and the number of references which it has. For example, a saving will occur with words of any size that are called only once, but space will be lost when the word size is larger than one and it is called many times.

Command line options in the Quarter compiler permit selective application of the optimisations. In-lining and factoring are parameterised so the programmer may optimise for speed or program size independently.

Static profiling

Because the compiler has generated the full dictionary and knows all the calling dependencies between words it is a trivial task to deliver this information to the programmer. Knowledge of this static profile enables a programmer to intelligently target his optimisation efforts. For example frequently referenced functions are prime candidates for hand translation into native code for the target or excessive reliance on a few related functions may suggest a more optimal breakdown of word definitions.

Runtime emulation The run time emulator takes the 'C output of the Quarter compiler and permits it to be executed on a PC. The 'C run time environment provides the minimum set of primitives for a program to execute. This permits an effectively 'pure' Quarter implementation of the program to be developed and debugged.

Dynamic profiling The emulator also profiles the program at runtime. This 'dynamic profiling' information indicates which words are most frequently executed and therefore which ones should be considered for optimisation. The generation of call trees and execution trace information assists the programmer in his optimisation task by calculating run-time details such as maximum stack utilisation. All of this is readily available and valuable information for optimising the code.

Debugging

The emulator enables the programmer to set break points, single step his program and analyse execution trace information. This is achieved by simple instrumentation of the threading code. These relatively sophisticated concepts are easily supported within the 'C runtime environment and permit sophisticated Quarter code debugging even in environments without 'C debugging tools.

A Hardware Quarter engine

The inherent simplicity of the threading models allows optimisation to be facilitated by hardware. Here we discuss a generic modification that could be made to practically any microprocessor CPU that would optimise the execution of Quarter programs. This optimisation will reduce the effort required to port applications, thus reducing the time to market for any Quarter program and reducing the probability of bugs being introduced during the optimisation phase of the project. The ability to add this design onto a standard CPU architecture removes the need to develop and market specific development tools for the native portion of the CPU. In this way trusted and supported compilers and assemblers can be employed as well as the bespoke Quarter tools.

General features

A typical micro-processor CPU will fetch an opcode from store. It will modify its internal state depending on the value fetched. It will repeat this fetch and execute operation ad- infinitum. We propose to modify this standard behaviour to make the interpretation of the opcode dependent upon the address from which it was fetched. This way we have a dual (or multi) mode processor that can be made to switch between operating modes invisibly (i.e. without any special commands or register setting.) In this way we propose to have a very small instruction set optimised for Quarter that can be implemented in very few hardware gates and seamlessly co-exist with various CPUs.

The mechanism When the opcode fetch occurs within some predefined range the CPU will operate in Quarter Mode. This effectively makes the Fetch/Decode logic behave as follows. Loop {

If (IP is in Quarter range) {

Quarterlnst = *InstructionPointer++; DecodeQuarter(Quarterlnst)

} else {

OpCode = *InstructionPointer++; Decode(OpCode);

} }

Some of the Host CPU registers will be utilised by both Quarter and the Native instruction set. For example register X in the native mode may be the data pointer in Quarter Mode. This double utilisation of registers permits the Host CPU to set up the environment for the Quarter mode operation. Whilst not strictly necessary it is a simple and efficient mechanism.

The Quarter engine's operating environment

The quarter engine has a minimal set of support registers. These are

• IP: The instruction pointer.

• SSP: System stack pointer. • PSP: Parameter/ data stack pointer

• W: Working register. It is almost certain that the host CPU will have IP and SSP and it is also very likely that registers suitable for PSP and W exist.

An ideal design would also include: • TOS: a cached Top of stack register holding the top element from the parameter stack. • Two memory zones where Quarter is active. This way we can have Quarter code regions in the ROM and Non-Nolatile memory of the Host platform.

Example implementation on a typical 16 bit CPU Using the Hitachi H8 architecture as a typical host platform we will define a typical implementation of the accelerator and define its behaviour.

Architecture

ROM memory from 0 to X-1 EEPROM and from 8000 to Y-l will be assumed to contain Quarter code or data, as shown in Figure 7. The rest of the memory will contain native code or data. Internal registers from R4 to R7 will have significance in both Quarter and native mode.

Behaviour (Very simple Quarter accelerator)

When the IP does not point to a Quarter execution region then the CPU behaves like a standard H8. When the IP does point to a Quarter execution region then the CPU behaves as follows.

SSP' = SSP-2 ; make space on the SStack

[SSP']= IP+2 ; save address of next on SStack

IP' = [IP] ; goto new word

This simple mechanism optimises the single most expensive step in a Quarter program: the invocation of a word. All other operations including return from a word must ultimately be coded in native code.

Behaviour (Simple (but better) Quarter accelerator)

Instead of treating all elements in the Quarter region as addresses to thread to we could make a special case of a specific value to trigger the second most code intensive Quarter operation; the word exit.

For example when the IP does point to a quarter execution region then the CPU behaves as follows. if ([IP] == 0) { ; treat zero as a special case IP' = [SSP] ; pop return address

SSP' = SSP+2 ; exit to caller } else { ; normal case of enter next word

SSP' = SSP-2 ; make space on stack [SSP⁵] = IP+2 ; push return address IP' = [IP] ; enter called

} ' ; In this example all aspects of threading are managed within the accelerator. The return from word is coded as opcode zero.

Behaviour (General Quarter accelerator)

Extending the above idea permits a set of common actions to be treated as Quarter special case words. This will permit the execution of elementary Quarter words without the need to pass control to the host processor. It will remove a call and return overhead from the implementation of the elementary words and significantly optimise a Quarter program and simplify the implementation by removing the need to implement these key words. An example is when the IP points to a Quarter execution region then the CPU behaves as shown in Figure 8.

In this example all aspects of threading and the common opcodes are managed within the accelerator.

It will also be possible to exploit features of the address representation to generate alternative instruction coding schemes. For example all addresses may be even. Hence the redundant bit may then flag instruction or address mode for the thread engine. Alternatively the address may be assumed to be even and therefore the 16 bits may represent 128K of program space. Such coding decisions would be left to the specific implementation.

Quarter envisages the following instructions should be considered for optimisation this way.

TOS: Top of Stack NOS: Next on stack NNOS: next next on stack

Summary of off target/code generation advantages of Quarter

• Incremental development/optimisation through the redefinition of Quarter 'words'. This approach permits a 100% Quarter program to be developed as a reference implementation. This code can then be ported and incrementally optimised whilst retaining a fully functional environment. No modifications or code changes are required within the reference implementation during this process. Optimisation is easier than development and testing.

• Target independent compiler. By virtue of the program representation and the compiler's independence from threading model, the generated code is platform independent. The simplest port involves implementation of the code to perform the threading and a base set of primitive words.

• Ability to perform static and dynamic profiling and know the data generated is platform independent. • Off target code generation and optimisation. This differs from Forth where the on-target interpreter is responsible for compilation and therefore global optimisations are impossible.

Summary of Hardware acceleration aspects

• The ability to modify the behaviour of a CPU without manipulation of status registers, e.g using the instruction pointer's value to determine operating mode. This permits seamless transition between operating modes whilst avoiding performance overheads.

• The addition of an alternative operating mode to an existing CPU architecture, thereby enabling the utilisation of third party (high quality and manufacturer supported) tools.

Quarter's tool set remains platform independent.

• The platform dependent elements (i.e. host CPU) require platform specific tools. Integration within Quarter and across a range of CPU's is impractical. Here Quarter offers the best of both worlds: good tools for a CPU supported by a specialist third party and standard tools for Quarter code suitable for all platforms.

Claims

1. Method of developing a software program for a target platform comprising the following steps: (i) generating a reference implementation of the software program in source code, the reference implementation not being specific to the target platform; (ii) porting the reference implementation to the target platform to produce a ported program; (Hi) overriding some or all of the function definitions in the reference implementation with versions of those definitions in code which is native to the target platform in order to optimise the performance of the ported program.

2. The method of Claim 1 comprising the further step of masking the ported program into ROM.

3. The method of Claim 1 in which all code generation is performed within a development environment and not on the target platform.

4. The method of Claim 1 in which elements of the ported program are automatically re-structured to provide code which is functionally identical but optimised.

5. The method of Claim 4 in which optimisation results from a factoring process.

6. The method of Claim 4 in which optimisation results from an inlining process.

7. The method of Claim 1 in which the reference implementation is threading model invariant.

8. The method of Claim 7 in which the reference implementation comprises a set of words, each word including an invoke function and a body of program details.

9. The method of Claim 8 in which the body of program details is replaced by native code for optimisation.

10. The method of Claim 8 or 9 in which the invoke function is replaced by native code for optimisation.

11. The method of Claim 8 - 10 in which a word is invoked by directly jumping to the word header.

12. The method of Claim 8 — 10 in which a word header contains the address of the invoke function and not code to call the invoke function.

13. The method of Claim 12 in which the body contains tokens for each word to be called, a token being translated by the invoke function into an address before calling the indicated word.

14. The method of Claims 8 — 10 in which all code in a word is native code.

15. The method of any preceding claim in which all dictionary definitions are generated off-target.

16. A software program developed using the method of any preceding Claim.

17. A masked ROM including a software program developed using the method of Claim 1 - 15.

18. A method of modifying the behaviour of a CPU from one operating mode to at least one other operating mode, in which the applicable operating mode is determined by the address from which an instruction is fetched.

19. The method of Claim 18 in which an opcode is interpreted depending upon the address from which it is fetched.

20. The method of Claim 18 in which the register which is read is the instruction pointer, so that the value of the instruction pointer determines the applicable operating mode.

21. The method of Claim 18 which is performed to optimise the execution of a software program developed using the methods of Claim 1 — 15.

22. The method of Claim 20 in which word invocation is optimised when the instruction pointer value is within a pre-defined range.

23. A CPU comprising at least two operating modes, which can be selected between using the method of Claims 18 - 22.