WO2007099324A1

WO2007099324A1 - Duplicate code detection

Info

Publication number: WO2007099324A1
Application number: PCT/GB2007/000716
Authority: WO
Inventors: Mel Pullen; Christopher Redpath; Andrew Langstaff; Andrew Sizer; Mark Divall
Original assignee: Symbian Software Limited
Priority date: 2006-03-01
Filing date: 2007-03-01
Publication date: 2007-09-07

Abstract

A method for identifying repeats in a set of computer executable code comprising a plurality of instructions defined by constituent operations and constants, the method comprising: analysing the set of code to identify one or more sequences of instructions that is repeated within the set of code, wherein the step of identifying one or more repeated sequences comprises, for each sequence of instructions: analysing only the constituent operations of the sequence, and ignoring any constants included in the sequence, and determining the sequence to be repeated if multiple instances of that sequence in the set of code each contain the same constituent operations in the same order.

Description

DUPLICATE CODE DETECTION

The present invention relates to a method and apparatus for identifying repeats in computer executable code.

In the present discussion the term 'computing device' is intended to include, without being limited to, Desktop and Laptop computers, Personal Digital Assistants (PDAs), Mobile Telephones, Smartphones, Digital Cameras and Digital Music Players. It also includes converged devices incorporating the functionality of one or more of the classes of device already mentioned, together with many other industrial and domestic electronic appliances.

It is very common for computing devices, especially portable ones, to embed their controlling software as well as some or all of their application software in a persistent memory store, which is normally protected in some way from accidental erasure. Storage media for such embedded software stores are known generically as ROMs, as Read-Only Memory was historically the first type to be used.

The type of masked ROMs used in older computing devices are relatively expensive and inflexible, and modern computing devices typically rely on less expensive and more easily updated technologies, such as NAND Flash, to provide a persistent store for embedded software. However, NAND flash is unlike the older type of ROM in that it does not permit code to be executed directly from persistent store. It has to be copied to a different type of storage, such as Random Access Memory (RAM) to be executed. In this respect, embedded program code on NAND Flash behaves in the same way as hard disk drives on conventional desktop computers.

Because of the expense of ROMs, and the continual downward pressure on the price of those computing devices marketed as consumer products (such as mobile telephones and PDAs) there has always been a pressure to use some type of compression on part or all of its embedded software; a computing device that compresses its embedded software needs a smaller ROM and is therefore less expensive to manufacture. The fact that this meant a loss of the ability to execute code in place was often sufficient reason not to compress embedded software. But on modern computing devices which make use of NAND Flash to store embedded software, nothing can be executed in place in any case; so the arguments for compressing its contents are much more compelling. Apart from the saving of space, it can be the case, if the processor is fast and the NAND flash is slow, that the time saved from having to read less content from the ROM compensates for the extra time spent on decompression.

The growth in the number of computing devices aimed at general consumers, which began in the late 1970s with the widespread availability of microprocessors, has been paralleled by an equivalent growth in both the amount of software in these devices, and in the size of each item. It is now very common for computing devices to contain large amounts of software. For many classes of computing device, this does not present any problems, as their hardware has more than kept pace with the demands placed upon it. When the original IBM Personal Computer (PC) was released in 1982, both the size of its memory and its storage capacity were measured in kilobytes, and its software came on floppy disks; twenty-five years later, state of the art PCs measure their memory in gigabytes and their storage capacity in terabytes. Clearly, personal computers are not generally resource constrained.

This is not, however, true of other types of computing device. Portable computing devices, which typically run off batteries, are limited both by their size (which has to be small) and by their power consumption (which needs to be minimal). Furthermore, many of these devices, such as mobile telephones and MP3 players, sell in quantities that are orders of magnitude greater than any model of PC; this means that small differences in the price of their components is of far greater importance.

Resource-constrained devices, which are built to a particular price point and hardware specification, commonly need to include as much functionality and performance as possible. One key technique for achieving this aim is to budget memory requirements, particularly of embedded software, very carefully. Saving memory translates directly into either lower prices or increased functionality.

It is known to those skilled in the art that one method of reducing the memory used by a body of software is to identify duplicated and repeated code, and replace such occurrences with calls to subroutines or library functions. There are many commercially available tools for performing the task of identifying and removing duplicated code (see http://www,redhiilconsulting.com.au/products/simian/ for instance). However, they are sub-optimal in terms of memory reduction for a number of reasons.

One reason is that the motivation behind many tools is not that removing duplicated code saves memory, but that removing duplicated code lowers maintenance costs. One manufacturer writes:

"Large software systems typically contain 10-25% redundant code ... If redundant code were removed, the IT organization could spend the budget savings on solving new problems, enhancing the efficiency of the parent organization. The savings can be significant, because it typically costs at least $1.00 per sourceline per year to maintain application software in running order."

(http://www.semanticdesiqns.com/Products/Clone/index.html)

The emphasis on lowering maintenance costs has led to a methodology by which existing tools for detecting and removing repeated code examine only the source code.

Examination only of the source code may well be the best way of reducing maintenance costs, but it is not always the case that this is the best way of reducing the memory footprint of a body of code (or codebase) embedded in or used in a computing device.

• Not all the source code is necessarily available in practice. Especially in the case of a complex consumer device, different software components may be obtained from different vendors, and the terms of the licenses granted do not always offer access to the source code.

• Different software modules may be written in different computer languages. In such circumstances, analysis of two items of source code written in different languages will not be able to tell whether they perform the same task and compile down to similar binary sequences.

• Features in some programming languages (for example, templates in C++) are often used to reduce apparent source code length and hence the amount of code required to be maintained, however this does not usually lead to a corresponding reduction in the size of the binary code.

Currently, the preparation of code to be embedded in NAND Flash ROMs has two phases. These are: 1. The compilation of the code to be used in the device. Compilers take source code, analyse it for syntactical semantic correctness, and then generate the binary executable code that's eventually loaded by the processor in the computing devices.

2. The binary executable code from phase 1 above is compressed to a smaller size; it is then placed into the NAND Flash ROM in its compressed form.

Once the prepared code is embedded in the computing device, it needs to be retrieved and decompressed before use, with the original version being regenerated and placed in execute In Place (XIP) memory before it is run.

The most common compression mechanism in use is some type of Lempel Ziv Welch (LZW) compression, as described in the Unisys patent US 4,558,302. LZW compression scans sequentially through the data to be compressed and looks for repeated sequences. If it is possible to replace a repeated sequence with a token that occupies less space than the original, then the replacement is made and the original sequence and the token are saved in a compression dictionary. When decoding, tokens are looked up in the dictionary and replaced with the corresponding sequence, enabling the source material to be recovered from the compressed text.

As described above, the current technological process for building a NAND flash ROM requires:

1. that uncompressed binaries be built by a compiler; and

2. that the binaries be compressed by a compression program.

If it were possible for a compiler to generate compressed code directly, then the second step in the process described above could be missed out, which would result in a simplified manufacturing process with many advantages.

According to a first aspect of the present invention there is provided a method for identifying repeats in a set of computer executable code comprising a plurality of instructions defined by constituent operations and constants, the method comprising: analysing the set of code to identify one or more sequences of instructions that is repeated within the set of code, wherein the step of identifying one or more repeated sequences comprises, for each sequence of instructions: analysing only the constituent operations of the sequence, and ignoring any constants included in the sequence, and determining the sequence to be repeated if multiple instances of that sequence in the set of code each contain the same constituent operations in the same order.

The step of analysing the set of code may comprise assessing in turn each sequence occurring in the set of code, generating an identifier for the sequence, checking whether the identifier exists in a data structure, and if so, determining the sequence to be repeated.

The method may further comprise the step of, if the identifier does not exist in the data structure, adding the identifier to the data structure.

The identifier could be a hash generated from the respective sequence.

The method could further comprise determining, for each sequence in the set of code, the number of times the sequence is repeated within the set of code; and it may optionally include recording in the data structure, for each repeated sequence, the determined number of times the sequence is repeated.

The method could further comprise the steps of: for each repeated sequence, determining whether an overall reduction in the size of the set of code could be achieved by modifying the sequence, and, if so: generating a core function that is common to each instance of the repeated sequence, and storing the core functions in a pattern dictionary associated with the set of code.

It could also comprise the step of, for each instance of each repeated sequence, generating an abbreviated sequence arranged to call to an associated core function, wherein the combination of the abbreviated sequence and the associated core function provides the same functionality as the instance of the repeated sequence.

A reduced set of computer executable code may be generated by replacing each instance of each repeated sequence in the set of code with the respective abbreviated sequence, whereby on execution of the set of code, each abbreviated sequence will call to an associated core function stored in the pattern dictionary. The pattern dictionary could be a part of the data structure. The set of computer executable code could be the result of a compilation step performed by a given compiler, and the data structure could be used in identifying repeated sequences in further sets of computer executable code generated by the same compiler.

According to a second aspect of the present invention there is provided a computer program for performing the method set out above.

According to a third aspect, there is provided a data carrier carrying the said computer program.

According to a fourth aspect there is provided a computing device having stored thereon the reduced set of computer executable code produced by the method outlined above.

The computing device could also have stored thereon the pattern dictionary comprising core functions associated with repeated sequences within the said set of code. The pattern dictionary is preferably stored in such a way that it is always available to programs running on the device. It may be stored in a dynamic link library, and/or in execute-in- place memory.

According to a fifth aspect of the invention there is provided the reduced set of computer executable code produced by the method outlined above.

According to a sixth aspect there is provided an operating system comprising the reduced set of computer executable code.

According to a seventh aspect of the invention there is provided a data carrier carrying the reduced set of computer executable code.

Embodiments of the present invention may thus be used to remedy the deficiencies outlined above. Duplicate code is preferably detected by analysis of the binary that makes up each item of executable code; therefore, duplicated executable code can be detected when the source code is either unavailable or is in different languages. The term "executable" is used herein to refer to code that is in a form such that it can be acted upon by a CPU. This could be compiled, binary code, or it could be code that needs to be interpreted by an interpreter before it can be processed by a CPU.

In common with some known code detection methods used on source code, embodiments of the present invention also allow the identification of near-duplicates; that is, code sequences which are nearly identical to each other can be found.

In particular, the inventors of the present invention have appreciated that since executable code is generated by a compiler, it consists of a finite number of code generation sequences or patterns that are modified by parameters specifying limited variability, such as the value of a constant or the identity of a register. Because the patterns are modified by parameters each time they are generated by the compiler, they cannot be recognised by common algorithms such as LZW that rely on sequential scanning. However, the inventors have identified that the underlying patterns may be identified in the executable code to advantageous effect, as described fully below.

As well as detecting duplicated code, some embodiments of this invention also offer methods for the removal or refactoring of both duplicates and near-duplicates. Such embodiments build on the perception that executable code consists of a finite number of code generation sequences or patterns.

• Where analysis of binary sequences reveals patterns are completely identical, we have an instance of full duplication and each occurrence of the sequence can be replaced by a call to a single instance of the pattern recoded as a subroutine.

• Where analysis of binary sequences reveals that while operation code (or opcode) sequences are identical but that there are differences in the accompanying parameters that specify limited variability, such as the value of a constant or the identity of a register, we have an instance of near duplication where the code can usually be refactored into a small shim sequence enabling the nearly duplicated sequences to call a single instance of the pattern codes as a common subroutine, as explained below.

The present invention will now be described by way of example, with reference to the accompanying drawings, in which: Figure 1 shows a set of executable code;

Figure 2 shows the code of Figure 1 reduced in size in accordance with an embodiment of the invention;

Figures 3 and 4 show further example sets of code;

Figure 5 shows a list of instructions;

Figure 6 shows the code of Figure 5 reduced in size in accordance with an embodiment of the invention; and

Figures 7-10 illustrate exemplary techniques utilising the concepts of the invention.

The following description of an embodiment of this invention makes use of binary executable code developed for Symbian OS, the advanced operating system for mobile telephones developed by Symbian Ltd, London, UK, and compiled for the ARM microprocessor using the Thumb instruction set (which contains 16-bit instructions). It will be readily appreciated by those skilled in the art that the principles underlying this invention are applicable to any operating system and any instruction set for any microprocessor.

Different instruction sets have instructions of different forms, and while the Thumb instruction set is discussed here by way of example, the concepts of the invention are equally applicable to other instruction sets such as, for example, the Intel instruction set (http://www.iegerlehner.ch/intel/opcode.html). In the Intel instruction set instructions can be of different lengths. They may include constant values. For example, a MOV operation can refer to a particular value that is to be "moved" into a register.

In the context of this invention, the term "instruction" is intended to include the constants that may be used by an operation, as well as the operation itself.

We consider firstly instances of near duplication.

Near Duplicates

Figure 1 shows an example of the phenomenon of repeated sequences including different parameters. It shows the executable code generated by a compiler for three different functions used by Symbian OS. This code will be described in some detail for the benefit of those unfamiliar with assembly language.

Figure 1 shows a conventionally compiled version of a set of computer executable code. The code includes three functions, whose entry point addresses are indicated in the first line of each function (00005888 etc.). Each function includes 12 instructions: PUSH, LSL, LDR, etc. The definitions of these instructions can be found in the ARM Architectural Reference Manual edited by Dave Jaggar and published by Prentice Hall, or at www.arm.com.

It can be seen from a review of the three functions shown in Figure 1 that the instructions in each function are of the same form, and that they differ only in the constant parameters to which they refer. For example, in line 3 of the first function, the LDR (Load Register) instruction references the parameter 13 (shown in italics and underlined). This indicates that the constant value 13 is to be loaded into a register. In contrast, the instruction at line 3 of the second function references Od and the instruction at line 3 of the third function references 07. In all other respects the instructions at the third lines of each function are identical. Similar differences in the constant parameters referenced by the functions are indicated by italic font and underlining in Figure 1.

It can thus be seen from a comparison of the object code for the three functions shown in Figure 1 that while the form of the instructions generated by the compiler are identical, the binary sequences are non-identical: different constant parameters accompany the LDR, ADD and STR instructions in the three functions.

A specific embodiment of the present invention aims to reduce the size of code such as that shown in Figure 1 by taking advantage of the similarities in the structure of the different functions, even though the binary sequences are non-identical. This is achieved by extracting sequences of instructions that are repeated within the set of code, and replacing them with abbreviated sequences that can each call to the common elements of the repeated sequences. Thus, in the example of Figure 1 , the repeated sequence: LSL, LDR, CMP, B, LSL, ADD, BL/BLX, BL, MOV, STR, POP is extracted from each of the three functions and placed in a separate commonly accessible function, referred to here as a "core" function. Figure 2 shows a set of executable code that is identical in functionality to the code of Figure 1 , but has been reduced in size in accordance with the specific embodiment. Each of the three original functions now exists in an abbreviated form, which is referred to here as a "shim" function. Thus, the first, second and third functions each now contain only three instructions: PUSH, MOV, MOV, together with a branch instruction, B, in the case of the first and second functions, as discussed further below. Each shim function thus has the effect of moving the values of the constant parameters referred to in the original function into registers.

An analysis of the original code reveals that registers 05 and 06 are unused in the original functions, and they are therefore available as temporary stores for holding the parameters referred to by the functions. On inspection of Figure 2, it can be seen that the first shim function has the effect of moving the value 4c into register 05. 4c is 4 multiplied by the constant value 13, which was referenced in line 3 of the first function of Figure 1. Thus, the first shim multiplies the first referenced parameter by 4, and moves it into one of the pre-defined available registers. Additionally, the first shim moves the value eθ, which is 4 times the constant 38, into register 06. The reason for multiplying the constants by 4 before storing them in the registers derives from a characteristic of ARM processors: it is only possible to access those addresses that are "word aligned", that is, at the start of a word boundary, with a word being defined as 4 bytes. Thus, to ensure that a value added to a register is a whole number of words, the original value (here, the values 13 and 38) is multiplied by 4.

Having loaded the constant parameters referenced in the original function into the predefined registers, the shim can then perform its second role: enabling the use of a further block of code, here a common core function. The core function is the fourth function shown in Figure 2. It contains the instructions which make up the sequence found in each of the original three functions: LSL, LDR, CMP, B, LSL, ADD, BL/BLX, BL, MOV, STR, POP. In order to provide this functionality to each of the three reduced functions, the shims each call to the core function. In the case of the first and second shims, this is achieved by means of the branch instruction at the end of the shim. In the case of the third shim, the end of the shim is adjacent in terms of its memory address to the beginning of the core function, and therefore the core function runs automatically when the final instruction of the third shim is executed.

By virtue of this arrangement, when the core function comes to be executed one of the shim functions has already run and caused the parameters specific to the corresponding original function to be available in pre-defined registers (05 and 06) which are then accessed by appropriate instructions in the core function. In this example, the core function cannot be accessed directly; it can only run by being linked to from another set of code, in this case the three shims.

Thus, in this example of the invention, the code for the three functions has been rearranged in such a way that their previously differing constant parameters are instead stored in the computing device's registers. Since code to achieve this is available, the constants embedded in the code are no longer required, and not merely the instruction sequences but the entire binary sequences become identical, enabling the previous three separate sequences to be replaced with one single common sequence. We refer to this rearrangement as refactoring the code.

It can be seen that the code of Figure 1 contains three functions, each having 12 16-bit Thumb instructions. This gives a total code size of 3x12x2 bytes, i.e. 72 bytes. The code of Figure 2 contains three shim instructions, having 4, 4 and 3 instructions respectively, and a core function having 1 1 instructions. This gives a total code size of (4x2 + 4x2 + 3x2 + 11x2), i.e. 44 bytes. In other words, the technique exemplified in Figures 1 and 2 turns a 72 byte set of code into a 44 byte set of code having three entry points. This constitutes a code saving of almost 40%. Near duplicates in multiple executables

While the example above shows how the size of the code in a single module can be substantially reduced, it is important to note that repetitive code patterns such as these are a feature of compiler technology rather than a feature of individual programs. Consequently, all executable code that has been generated by the same compiler is likely to include the same code patterns.

Where the same patterns occur in multiple executables, it can often be extremely efficient to refactor them together and implement any necessary core functions in a central location, such as a DLL (Dynamic Linked Library) or an XIP (eXecute In Place) core OS (Operating

System), which is guaranteed to be always available.

However, when refactoring functions in different executables, a modified technique has to be used to refactor them to make use of the common pattern. Figures 3 and 4 show how this modified technique can be implemented, using as an example some further code from Symbian OS compiled for an ARM microprocessor. It should be noted that this specific code is shown merely to facilitate explanation of this exemplary embodiment, and is not intended to be limiting.

It will be noted immediately by those skilled in the art that the function C^CAPProtocoLRemoveldleTimerEntry which is taken from one of 14 similar functions in the executable BT.DLL and appears at the start of Figure 3 is clearly an identical pattern to all three of the functions taken from the executable IR.DLL which appeared in Figure 1. All have the same checksum (which appears as RoXoR: 583c5bO9 at the end of each function). However, it will also be noted that the BL Branch with Link instruction that appears in the 9^th line of the functions in Figure 1 takes the parameter ObOcθ, while the same the BL (Branch with Link) instruction that appears in the 9^th line of function CL2CAPProtocol::RemoveldleTimerEntry takes the parameter 018d8.

To take account of this difference, we need to use two shims rather than one. The first shim, as before, is specific to each function in a particular executable, while the second shim is for use by all the functions in that executable.

This can be seen in the case of the functions from IR.DLL, where the new shim structure appears in Figure 3. The same shims appear as in Figure 2, but now they branch not to the common core function, but instead to a second shim, which loads the parameter ObOcθ into a register before branching off to a common core function located not within the same executable, but in a commonly accessible location such as an XIP core OS image. Note that this would require an import stub were it to look like another DLL.

Figure 4 shows the same techniques applied to the function from the executable BT.DLL. The specific shim for the function CL2CAPProtocol::RemoveldleTimerEntry branches to a shim for all the 14 functions in that executable which use the same pattern; it loads the parameter 018d8 (as used by BT. DLL) into a register before branching to the same common core function (found in a location such as an XIP core OS image) as used by I R. DLL which appears at the bottom of Figure 4.

With this modified technique, it can be seen that once a pattern dictionary has been constructed, including the patterns used by a compiler, all the binary executables in any collection that were developed with that compiler may make use of the same dictionary; this is because the pattern dictionary is derived from the characteristics of the compiler used to generate the code.

This is a clear improvement on standard LZW compression techniques; with a single pattern dictionary in an accessible location such as XIP memory, there is no need to store a separate dictionary in each compressed file, and the compressed code does not need to be decompressed before it can be executed.

Furthermore, since the source code is not required, the technique can be used to identify such duplicated code when it is only available in the form of a binary executable, delivered from a different author or vendor, or when the corresponding source code has been written using a different language.

Full duplication

A further application of the concepts of this invention enables functions that differ only in small sub-sequences to be rewritten to remove the common code sequences and turn them into subfunctions. While this type of optimisation can certainly be carried out by hand on source code, the impenetrability of the compilation process means that identifying such common code sequences in binary code cannot be carried out by conventional means.

The first step in this application is for the functions to have any differences identified and removed; this is very similar to the parameterisation technique previously discussed.

Figure 5 shows two functions, which differ only in a few opcodes, but are otherwise identical. For ease of reading, the sections with differences are shown in italics in the figure; they can be found in the 8^th and the last lines of the functions. Using the techniques described above, each function is rewritten to call a common subfunction for those portions of the original code that are identical. The resulting code is shown in Figure 6; it can be clearly seen than subfunctioning results in smaller code size.

It should be noted that the technique only works a) if the data and registers involved in the common sequences are similar enough; and b) if there are no direct dependencies between the common sequences and the differences.

It should also be remembered that we have previously made clear that the use of the ARM instruction set in our examples is for illustration, and that the principles behind this invention are applicable to all instruction sets for all microprocessors. Those familiar with ARM assembly language programming will note that Figure 6 includes the use of a BL/BX R14 pair to call the common subfunctions. While this is a common way of calling a subfunction in ARM Thumb code, it will be appreciated that because this alters the stack frame, it should be modified if the register pair needs to be preserved.

Example implementation of Duplicate Code Detection

For the purposes of illustration only, a sample implementation is shown in the flowchart contained in Figure 7. This sample implementation is not intended to restrict the application of the principles underlying this invention, in any way; indeed, those skilled in the art will be able to read the flowcharts and readily recognise the applicability of those principles to a wide variety of circumstances.

Figure 7 shows the basic flow of control for a stand-alone code duplicate code detection engine based on the application of the technology here.

This duplicate code detection engine has, as its end product, a database which consists of each duplicated code sequence, together with identification information identifying the location of each duplicate in the codebase (executable identity and address), together with a repeat count showing the number of times each sequence is duplicated. The engine begins by initialising the duplicates database. Then it loads each binary executable from the codebase in turn, and for each binary executable, it identifies the code objects (functions and methods) contained therein. The techniques for doing this are well- known to those skilled in the art of disassembling binary code and are outside the scope of this invention.

For each code object, the engine identifies each candidate code sequence in that object which may qualify either as a near duplicate or as a full duplicate, and checks to see if that sequence is already contained in the database. The preferred way of doing this involves calculating a unique mathematical hash or digest of the code sequence for use as an index. It will generally be prudent to impose a lower limit on the size of each identified sequence, as very small duplicated sequences cannot be eliminated without replacing them with larger code sequences, such that no code saving could be ultimately achieved.

Each sequence which isn't already in the database is added to it, together with its type (either as candidate for near or full duplicate), and its unique location in the codebase (executable identity and offset address). If a sequence is already present in the database, then only the unique location of each repeated occurrence needs to be stored. As an optimisation, a repeat count can be used to store the number of times a duplicate has occurred; for subsequent processing, this saves time that would otherwise need to be spent counting the locations to work out the repeat count.

This is repeated for each sequence in the code object, and for each code object in the executable, and for each executable in the codebase.

Following this, the database is then processed for each sequence it contains. Non- duplicated sequences are discarded, as are all sub-sequences that aren't duplicated once the repeats of their main sequence are taken into account. For the sequences that remain, the engine inspects the type and constructs subfunction calls for the fully duplicated candidates, with additional shims for the nearly duplicated candidates. If it is determined that there would be no code saving once these overheads are taken into account, the sequence is discarded; otherwise, the approriate shim or call is added to the database. At the conclusion of this process, the duplicate code detection engine will have produced a database containing a list of all the duplicate code sequences in each binary executable in the database, together with all the locations where they can be found.

Once all the duplicate code has been identified, there are many practical options for utilising the database.

• The database can easily be interrogated to find out what the total possible saving might be for the entire codebase were the duplicates to be eliminated.

• Identification of a small set of duplicates that would enable a large percentage of the potential code savings becomes possible.

• The suppliers of each binary can be notified of the number of duplicates in their code and requested to remove them.

• The database can form the basis of an automatic method of removing the duplicated code. Where the source code is available, access to the compiler and linker output generated during the production of the binary can simplify this task. However, a method which does not require access to the source code is shown in Figure 8: o A library of callable subroutines is created from the code in the duplicates database. o For each binary in the codebase, the database can be interrogated to give a list of all the duplicated sequences. o Each duplicated sequence in the executable is either replaced with a library call to a subfunction (for full duplicates) or with a shim that calls a subfunctions (for near duplicates). o As sequences that don't save code never appear in the duplicates database in the preferred embodiment, it is certain that the sequence will always be larger than its replacement. The remainder of the space taken up by the original sequence is replaced by NOP (no operation) opcodes. Processor instruction sets which don't have NOP opcodes need to use a completely neutral operation instead. Those familiar with the ARM instruction sets, for example, will be aware that an instruction such as MOV rθ,rθ can be used in place of a NOP. o When all the duplicated sequences in a binary have been dealt with, the code will be the same size as the original, but will include a large number of NOPs. These can now be eliminated to produce a smaller binary executable.

This will cause all references to addresses in the binary (such as references to method and subroutines) to be incorrect, and so it needs to be accompanied with fixing up such references to use the new addresses, and also needs any export table to be patched with new addresses. As with the identification of code objects, the techniques for dealing with this addressing problem are well-known to those skilled in the art of disassembling binary code and are outside the scope of this invention. o Finally, the new binary can be written.

In an especially preferred embodiment, a compiler is modified so as to generate compressed code directly, without requiring separate stages of compiling and compression of the binary code. This may be achieved by means of referencing two pre-existing dictionaries, which are included together with the generated code in the NAND flash ROM. The first of these is the pattern dictionary, containing items of duplicated binary code, which is copied into XIP memory and only needs to be referenced when the compressed code is run. The second of these dictionaries is a more conventional tokenised dictionary which needs to be referenced during a decompression stage in order to regenerate executable code when the compressed code is loaded from NAND Flash ROM to XIP memory.

In the present discussion we follow the standard practice in referring to a collection of repeated patterns of code as a dictionary, since this is a term commonly used in the literature on compression for collections of instances of repetitive patterns. However, it should be noted that there are considerable differences between pattern dictionaries of the type referred to in the following specific description, and the dictionaries generated and used by prior art compression methods. In particular, it should be noted that the use of a conventional compression dictionary requires the recognition by a decompressor of some type of distinctively recognizable token which was inserted into the compressed bytestream by the compressor. When the decompressor comes across such a token, it looks for accompanying index data that can be directly translated into an entry in the conventional dictionary, with that entry then being substituted for both the token and the index data. Many such conventional compression techniques are lossless, in that the original can be completely reconstituted from the compressed version.

However, with the use of some embodiments of the present invention no compression dictionary is required. The selection of a specific pattern in the collection that makes up the pattern dictionary is made by embedding a conventional branch, call or jump instruction in the code. Furthermore, while the compressed code is functionally equivalent to the uncompressed code, the compression itself is lossy, as no guarantees can be offered that the original code can be reconstituted from the compressed version.

In the following description of embodiments of the present invention, we explain how a dictionary for an executable code file to be compressed can be constructed from a parameterised set of the patterns utilised by the compiler which was used to generate the executable code from source code. This technique can result in greater space savings over the original binary code format than any technique based on sequential scanning. For example, an executable file which contained multiple calls to a subroutine, each of which passed a different parameter, would appear as different sequences if scanned in a linear fashion; but using an embodiment of this invention, the repetitive nature of the pattern (albeit with a different parameter) would be recognised.

Building a pattern dictionary

As indicated in the above discussion, a pattern dictionary can be generated for a given codebase after a normal compilation of all or part of the program files comprising that codebase. There are many possible ways of constructing and using a pattern dictionary. In particular, the techniques described in relation to Figures 1 and 2 (Near Duplicates) and Figures 5 and 6 (Full Duplicates) may be used.

One possible implementation is outlined below and in Figure 9. This is provided for illustration and is not intended to limit the scope of this invention in any way, as those skilled in the art will readily appreciate that there are many equivalent ways of implementing the general method.

1. The codebase is compiled normally in the first instance (either in its entirety or using a representative subset) using a conventional compiler. However, the objects contained in each binary module are then scanned by a pattern generator tool that generates opcode sequences and associated hashes for each opcode sequence contained in that binary module. It should be noted that the term hash includes any type of unique digest mathematically obtainable from any discrete data item such as an opcode sequences.

Note that no sequence would span more than one compiler-generated object (such as a function).

Furthermore, as mentioned above it would normally be efficient to enforce a lower limit on the length of a sequence, as there will always be a length below which no saving can be made. The exact figure will vary, depending on the architecture of the processor, and the overhead incurred from constructions of shims and subfunction calls.

2. The pattern generator tool maintains a record of hashes and sequences, together with a count of the number of times that each sequence repeats in a tree structure, in such a way that sub-sequences of opcodes that already appear in the list as branches from that opcode sequence.

3. Once the pattern generator tool has built the tree, it traverses it and a. discards all sequences that have no repeats; b. discards all sub-sequences that aren't repeated once the repeats of their main sequence are taken into account (so if the sequence PUSH SUB MOV LSL ADD ADD STRH BL/BLX BLX LDR STRH occurred 10 times, the sequence PUSH SUB MOV LSL ADD ADD STRH BUBLX BLX LDR would be discarded if it did not appear more than 11 times).

4. The pattern generator tool then traverses the tree a second time, and creates templated shims for the core common functions, and subfunction calls for the remaining sequences, as appropriate. These can be added to each branch and leaf in the tree. During this step, those sequences for which the addition of these overheads would result in no code savings are discarded.

5. The pattern generator tool traverses the tree a final time, and generates three persistent objects: a. an executable pattern dictionary consisting of those core common functions and subfunctions which result in code saving for the codebase, which will later be placed in the NAND Flash ROM as a callable library. As such, an export table for each pattern is built and included with the dictionary as if it were a conventional library b. an indexed list of templates for the shims to be used with the core common functions in the pattern dictionary. c. a hash lookup array in which each element contains i. the hashes for each sequence in the pattern dictionary (which is the element used to order the array); ii. the ordinal of each matching pattern in the dictionary; iii. either 0 (for a subfunction) or the index to the appropriate shim templates (for the core common functions) where the first function has an index of 1.

Once these three objects have been generated from an item of binary code, a modified version of a conventional compiler is used to directly generate compressed code. One possible implementation is outlined below and in Figure 6. This is provided for illustration and is not intended to limit the scope of this invention in any way, as those skilled in the art will readily appreciate that there are many equivalent ways of implementing the general method.

1. The compiler starts by generating binary code for each object or function in the source code as normal; but, for each object so generated, it calculates hashes for the opcode sequences it contains and checks the hash lookup array to see if the sequence is a repeated one.

2. For each sequence matched, the compiler looks up the index to the appropriate shim template and the ordinal for the matched pattern in the executable pattern dictionary.

3. If the index to the appropriate shim template is 0, the sequence is replaced by a call or branch to the matched subfunction, referenced by its ordinal in the pattern dictionary. If the index is not 0, the sequence is replaced by the shim for the matched core common function followed by a call or branch to that common function, again referenced by its ordinal in the pattern dictionary.

Once the compiler has checked the hashes for all the opcode sequences in the object, it proceeds in the same manner with all the other objects; and the generated object module is then written in the normal way. When object modules are linked to create a ROM image, the linker fixes up the ordinals in the pattern dictionary with their actual addresses in the usual way; linking other modules not included in the ROM image is similarly a conventional operation.

As intimated above, those skilled in the art can readily modify the above processes for different compiling and linking conventions; for example, systems that link by name rather than ordinal can readily be accommodated. If preferred, the modifications which in the above description are made to the compiler could instead be made to the linker.

Tokenised dictionary compression

Should a greater degree of code compression be desirable, an extra step may be added to the processes described above. However, a consequence of this further compression is that the code is no longer executable and has to be decompressed before it can be run. It is therefore not necessarily a suitable step to take should the executable compressed code already reside in XIP memory, as XIP memory is often expensive and the requirement to have two copies of code (the compressed and the uncompressed versions) would increase the cost of manufacture of the computing device.

However, where the executable compressed code resides in non-XIP memory (such as NAND Flash) and would in any case have to copied to XIP memory in order to be run, there are clearly definite cost advantages in the extra compression step. There may also be speed advantages in circumstances where the time taken to decompress the code is less than the extra time it would take to read a larger code block from relatively slow NAND flash.

The extra compression step follows the code reduction techniques described above. Each remaining function can be further broken down into small commonly occurring sequences, which are then added to an additional conventional compression dictionary.

We illustrate the extra compression step with four small functions shown below:

TAgnException-SetDate

PUSH SUB ADD LSL MOV BL/BLX BL MOV LSL MOV BL/BLX BLX MOV STR

ADD POP

TAgnlnstanceDateTimeld::SetStartDate PUSH SUB ADD LSL MOV BL/BLX BL MOV LSL ADD MOV BL/BLX BLX B

CAgnRptDef::SetStartDate

PUSH SUB ADD LSL MOV BL/BLX BL LSL MOV BL/BLX BL ADD POP

CAgnRptDef::SetEndDate

PUSH SUB ADD LSL MOV BL/BLX BL LSL MOV BL/BLX BL B

When we remove the commonly occurring sequences in these four functions, we can generate a dictionary of small code sequences. The first function contributes four such sequences:

1 PUSH SUB ADD LSL MOV BL/BLX BL

2 MOV LSL MOV BL/BLX BLX

3 MOV STR

4 ADD POP

The second function contributes two sequences:

5 MOV LSL ADD MOV BL/BLX BLX

6 B

The third function contributes one sequence:

7 LSL MOV BL/BLX BL

The fourth function doesn't contribute anything, as it can be completely built from entries already in the dictionary.

When the compressed code block is built, the entries for each of these functions is constructed from the parameterisation block (that holds all the constant values) together with the list of dictionary entries that comprise each function.

TAgπException::SetDate

// constant data 1 2 3 4

TAgnlnstanceDateTimeld::SetStartDate

// constant data 1 5 6 CAgnRptDef::SetStartDate

// constant data 1 7 4

CAgnRptDef::SetEndDate

// constant data 1

7 6

It should be noted that the exact nature of the splits between the common code sequences depends on the contents of the entire body of compressed code, not simply on these four functions. So, for example, in the context of the complete code, it might be more efficient to split sequence 1 into

1a PUSH SUB ADD LSL

1b MOV BL/BLX BL

even though every one of our four functions contains it, if the two sequences appear separately a large number of times but appear less frequently together. Similarly, it may be the case that the sequence

3+4 MOV STR ADD POP

may appear often enough to warrant its inclusion as a separate entry in the dictionary in addition to the original sequences 3 and 4.

Tokenised dictionary compression is relatively straightforward. As each token- compressed binary executable (including the pattern dictionary) is copied into XIP memory, a decompressor scans it for the tokens inserted during the compression dictionary. Whenever a token is found, it is replaced with the corresponding sequence from the tokenised dictionary in order to reconstruct the uncompressed version.

Thus the tokenised compression step can be combined with other techniques described herein to generate the code that makes up the embedded software of a computing devices. As described above, devices which utilise NAND Flash ROM for this purpose are of particular interest in this context. However, the techniques are applicable without limitation to any type of binary executable for any computing device.

Note that other compression techniques (such as standard LZW) may still be used to compress embedded components such as non-executable data files wherever that method provides superior performance. Should further development in compression technology warrant it, it is also possible for such techniques to be used together with the pattern dictionary in place of the tokenised compression described above.

It should be noted that the construction of a pattern dictionary according to embodiments of the invention only needs to be performed once for a given codebase; subsequent generations of any portions of the code will make use of the same pattern dictionary. Construction of a new pattern dictionary would only be required should the programming paradigms in the codebase substantially change consequent on a major revision which introduceed pervasive new Application Programming Interfaces (APIs).

The following set of advantages may result from using embodiments of the invention:

• Detection of duplicated code in binaries can enable computing devices to be manufactured with less memory. This in turn means lower cost to the consumer, and also a reduction in the resources needed to manufacture a device and in the power needed to keep it running.

• Alternatively, duplicate code detection and removal can allow an increased number of software modules to be run effectively in a given amount of XIP memory such as RAM.

• Removal of duplicated code in executable binaries makes them smaller; therefore they will load faster, and the computing device will become more responsive.

• The ability to identify duplicated code in binary modules from different suppliers, when no source code is available, can be used to encourage cooperation and partnership between organisations that may not have realised how much they have in common.

• Once a pattern dictionary is constructed, compilers are able to directly generate compressed code, leading to a quicker, more efficient and more streamline software manufacturing process.

• Unlike other methods of compressing software, code generated by a compiler with reference to a pattern dictionary remains executable without the need for prior decompression; it retains all the functionality of the original but with a considerably reduced size.

• Generation of smaller executables by a compiler means that the executables can load faster. As well as providing an enhanced user experience, this saves power. On mobile devices, this gives increased battery life; on devices obtaining power from conventional sources, reduced power consumption equates to less pollution and a lower carbon footprint, thereby diminishing (albeit by a small amount) global warming.

• Because the pattern dictionary contains sequences of executable code, it can be regarded as a library of common actions that other executable programs can request. This perception lends itself to many other uses for the invention: o When porting the codebase to a different environment (such as new hardware) where the logic of each executable remains unchanged but where various common actions need to be handled in a different way, provision of an alternative pattern dictionary enables this to be achieved with less rewriting and recompilation of programs. Replacement of the relevant patterns in the dictionary is all that is required to ensure compliance with the new environment. o When updating a device that has already been delivered to an end-user, smaller amounts of code need to be transmitted; one update to a pattern dictionary can serve to update many individual programs, and where an individual program needs to be updated, this can be done without needing to retransmit the unchanged common functions.

• There are numerous benefits that arise from the more efficient and economical storage of software in such devices. The conventional compression dictionary needs to be stored only once, in the NAND Flash, together with the compressed program files; when the contents of the ROM are copied to XIP memory (either as a core OS image at boot time or as single executables, on demand, at run time) , the decompressor references the conventional compression dictionary to rebuild the original executables. These executables may themselves be compressed with reference to the pattern dictionary, which will itself have been included in body of embedded software and copied to XIP memory, either at boot time or on demand. Examples

In the following, some specific examples of code reduction are considered in order to further illustrate the principles of the invention.

From the above description, it will be apparent to the skilled person that various modifications of the specifically described methods and implementations may be made within the scope of the invention, as defined by the appended claims.

Claims

1. A method for identifying repeats in a set of computer executable code comprising a plurality of instructions defined by constituent operations and constants, the method comprising: analysing the set of code to identify one or more sequences of instructions that is repeated within the set of code, wherein the step of identifying one or more repeated sequences comprises, for each sequence of instructions: analysing only the constituent operations of the sequence, and ignoring any constants included in the sequence, and determining the sequence to be repeated if multiple instances of that sequence in the set of code each contain the same constituent operations in the same order.

2. A method according to claim 1 wherein the step of analysing the set of code comprises assessing in turn each sequence occurring in the set of code, generating an identifier for the sequence, checking whether the identifier exists in a data structure, and if so, determining the sequence to be repeated.

3. A method according to claim 2 comprising the step of, if the identifier does not exist in the data structure, adding the identifier to the data structure.

4. A method according to claim 2 or claim 3 wherein the identifier is a hash generated from the respective sequence.

5. A method according to any of claims 2 to 4 further comprising determining, for each sequence in the set of code, the number of times the sequence is repeated within the set of code.

6. A method according to claim 5 further comprising recording in the data structure, for each repeated sequence, the determined number of times the sequence is repeated.

7. A method according to any preceding claim further comprising the steps of: for each repeated sequence, determining whether an overall reduction in the size of the set of code could be achieved by modifying the sequence, and, if so: generating a core function that is common to each instance of the repeated sequence, and storing the core functions in a pattern dictionary associated with the set of code.

8. A method according to claim 7 further comprising the step of, for each instance of each repeated sequence, generating an abbreviated sequence arranged to call to an associated core function, wherein the combination of the abbreviated sequence and the associated core function provides the same functionality as the instance of the repeated sequence.

9. A method according to claim 8 further comprising generating a reduced set of computer executable code by replacing each instance of each repeated sequence in the set of code with the respective abbreviated sequence, whereby on execution of the set of code, each abbreviated sequence will call to an associated core function stored in the pattern dictionary.

10. A method according to any of claims 7 to 9 as dependent on claim 2, wherein the pattern dictionary is a part of the data structure.

11. A method according to claim 2 or any of claims 3 to 10 as dependent on claim 2, wherein the set of computer executable code is the result of a compilation step performed by a given compiler, and wherein the data structure can be used in identifying repeated sequences in further sets of computer executable code generated by the same compiler.

12. A computer program for performing the method of any preceding claim.

13. A data carrier carrying the computer program of claim 12.

14. A computing device having stored thereon the reduced set of computer executable code produced by the method of claim 9.,

15. A computing device according to claim 14 having further stored thereon the pattern dictionary comprising core functions associated with repeated sequences within the said set of code.

16. A computing device according to claim 15 wherein the pattern dictionary is stored in such a way that it is always available to programs running on the device.

17. A computing device according to claim 15 or claim 16 wherein the pattern dictionary is stored in a dynamic link library.

18. A computing device according to any of claims 15 to 17 wherein the pattern dictionary is stored in execute-in-place memory.

19. A reduced set of computer executable code produced by the method of claim 9.

20. An operating system comprising the reduced set of computer executable code of claim 19.

21. A data carrier carrying the reduced set of computer executable code of claim 19.