GB2552773A

GB2552773A - Optimisation

Info

Publication number: GB2552773A
Application number: GB1612035.4A
Authority: GB
Inventors: Storey Gregory
Original assignee: Aspartech Ltd
Current assignee: Aspartech Ltd
Priority date: 2016-07-11
Filing date: 2016-07-11
Publication date: 2018-02-14
Also published as: GB201612035D0

Abstract

The present teachings can provide an execution optimiser for a programmable processor. The optimiser can comprise: a code inspector configured to access an instruction compiler to inspect an instruction for the processor being handled by the compiler; a comparator configured to compare the inspected instruction to a record of previously defined optimiser functions; and an interceptor configured to, if the comparator identifies the inspected instruction as corresponding to an optimiser function from the record of previously defined optimiser functions, intercept the inspected instruction and to perform the corresponding optimiser function instead of allowing the inspected instruction to reach the processor and return the result of the optimiser function as the result of the inspected instruction.

Description

(54) Title of the Invention: Optimisation

Abstract Title: Processor optimisation using stored optimised functions (57) The present teachings can provide an execution optimiser for a programmable processor. The optimiser can comprise: a code inspector configured to access an instruction compiler to inspect an instruction for the processor being handled by the compiler; a comparator configured to compare the inspected instruction to a record of previously defined optimiser functions; and an interceptor configured to, if the comparator identifies the inspected instruction as corresponding to an optimiser function from the record of previously defined optimiser functions, intercept the inspected instruction and to perform the corresponding optimiser function instead of allowing the inspected instruction to reach the processor and return the result of the optimiser function as the result of the inspected instruction.

10 16

Sheet 1/11

MEMORY ’ROCESSOF

CACHE

FAST I/O

........i............

STORAGE

CHIPSET

FIG 1

BIOS

Sheet 2/11

Z'

I.

I

MEMORY

Sheet 3/11

S3-1

Sheet 4/11

Sheet 5/11

FIG 5

Sheet 6/11

FIG 6

LO

Optimiser function

L7

L8

L9 LA

..................x..

LB I LC

LE

LF combine

Operand

..'X.

oo

01 02

03

04

iiii

06

07

08

09

1111

oc

1111

OE

iiiil

59

-result

/. A

Output

z /

X

L0+

Li + i l2+

L3+

L4+

L5+

L6+

1.7-f-

L8+

L9+

LA+

LB+

LC+

LD+

LL·'*·

LF+ |

OO

01 02

03

04

05

06

07

08

09

OA

nil

OC

iiili

OE

OF

Sheet 7/11

FIG 7

NEW FUNCTION

INSPECTED INSTRUCTION AND OUTCOME

Sheet 8/11

FIG 8

PROCESSOR	V		T— i
CORE ........................	MEMORY		COMBINER

Sheet 9/11

FIG 9

Sheet 10/11


z.		IIIIO:::::::::::::::
KHi................
77:7:7::^7-7-7:		111170:-:-:::
illlO................		777W:::::::-:
C		IIIIO-I-
O
O		0
		iii-i
0
		iiiiiiiiW:::::::::::::::
73
—I		777W-S:-
A
	'7

σ>

FIG 10 •5

INSTRUCT!	-A, J:	777::Ο:::::::::::::1ϊ ί ¢0 1 7777077::1:: slillG:.................. 7777:\|3777\|
O		11110111::7::
	^V-' CO	SUU/I
“0	NJ	777710:::::::::::::7::
s s:::::::: .................		777W::::::::::::::1::
73		.................73..................
		7:7:7:71^77777:7
NJ		777:7:7^77777

co

Sheet 11/11

APPLICATION

FIG 11

OPERATING

SYSTEM

LAystflr ILCFa

INTERCEPTOR

OPTIMISES

FUNCTION

LIBRARY

FUNCTION

CREATOR

PROCESSOR CORE

i, MEMORY

V ί COMBINER

Application No. GB1612035.4

RTM

Date :30 November 2016

Intellectual

Property

Office

The following terms are registered trade marks and should be read as such wherever they occur in this document:

Macos

Windows

Linux

ARM

Zilog

MIPS

Intel Core

Radeon

GeForce

Nvidia

Nvidia Quadro

Intellectual Property Office is an operating name of the Patent Office www.gov.uk/ipo

OPTIMISATION

FIELD AND BACKGROUND [0001] The present disclosure relates to optimisation, and in particular but not exclusively to optimisation of processor execution efficiency.

[0002] In a programmable computer, one or more processors may receive instructions relating to an operating system of the programmable computer or instructions relating to an application running in such an operating system. Typically the instructions relating to the application or operating system are converted from source code or object code of the operating system or application into machine code or assembler code of the processor.

[0003] The machine code or assembler code instructions are then carried out by one or more processor cores and/or processor pipelines of the one or more processors according to the instruction handling capabilities of the one or more processor. The result of carrying out the instructions is then written to a memory location, from which they can be utilised in further instructions and/or passed to one or more other hardware elements of the computer, such as a display, audio output device or data output device.

SUMMARY [0004] Particular aspects and embodiments are set out in the appended claims.

[0005] Viewed from one perspective, there can be provided an execution optimiser for a programmable processor, the optimiser comprising: a code inspector configured to access an instruction compiler to inspect an instruction for the processor being handled by the compiler; a comparator configured to compare the inspected instruction to a record of previously defined optimiser functions; and an interceptor configured to, if the comparator identifies the inspected instruction as corresponding to an optimiser function from the record of previously defined optimiser functions, intercept the inspected instruction and to perform the corresponding optimiser function instead of allowing the inspected instruction to reach the processor and return the result of the optimiser function as the result of the inspected instruction. Thereby, the present approach can provide for a processor to be bypassed in the carrying out of an operation, thus replacing an execution delay for the instruction through processor with an optimiser function duration, which is expected to be shorter than the execution delay.

[0006] In some examples, the code inspector Is configured to inspect the instruction at each of multiple stages of the instruction compiler and to use the inspection result at one stage of the compiler to identify one or more candidate optimiser functions from the record of previously defined optimiser functions and subsequently to use the inspection result at a subsequent stage of the compiler to reduce the number of candidate optimiser functions. Thereby a suitable optimiser function can be uniquely selected for use before or at the same time that the compiler finishes preparing the instruction for the processor.

[0007] in some examples, each optimiser function of the previously defined optimiser functions describes an operation that provides the same outcome for any given argument data as one or more processor instructions provided with the same argument data. Thereby, the optimiser function can be applied regardless of the argument data upon which the processor instruction is to operate.

[0008] In some examples, each optimiser function of the previously defined optimiser functions comprises a definition that describes a functionality of the optimiser function and a logic sequence that can be implemented to carry out the optimiser function. Thereby the optimiser function can both identify its function and provide the means for carrying out that function in an efficient manner.

[0009] In some examples, the interceptor is configured to intercept the inspected function by intercepting a sequence of processor operations output by the compiler based on the inspected instruction. Thereby the interceptor can accurately prevent the processor from receiving the instruction after handling of the instruction by the compiler.

[0010] In some examples, the interceptor is configured to perform an optimiser function by causing both a logic sequence of the optimiser function and argument data for the inspected instruction to be written to a common location in a volatile memory connected to the processor when the processor performs a memory refresh cycle for that location in volatile memory. Thereby the optimiser function can be carried out automatically when a memory refresh cycle occurs as part of the normal execution of the processor.

[0011] In some examples, the optimiser is configured to return the result of the optimiser function by causing the output of the both the logic sequence and argument data for the inspected instruction to be written to a location in a volatile memory to which the outcome of the inspected instruction would have been written if carried out by the processor. Thereby the optimiser function can provide a result that is indistinguishable from a processor-produced result.

[0012] In some examples, the execution optimiser further comprises a function creator configured to, if the comparator identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions, analyse the intercepted instruction to determine whether a new optimiser function can be defined therefrom. In some examples, the function creator is configured to, if if is determined that a new optimiser function can be defined from the inspected function to define an optimiser function therefrom. Thereby a new optimiser function can be prepared when no suitable function exists already. [0013] in some examples, the function creator is configured to create a new optimiser function only if the new optimiser function is different to all existing optimiser functions. Thereby a set of optimiser functions which is unique for the particular processor and implementation can be created based upon the activities requested of the processor.

[0014] in some examples, the function creator is configured to create a new optimiser function that replaces a previous optimiser function and corresponds to the intercepted instruction. Thereby the set of optimiser functions may be refined over time to both improve upon existing optimiser functions to achieve further optimisation and to replace a optimiser function which has been determined to produce an error result.

[0015] In some examples, the function creator is configured to create a new optimiser function by combining a group of function building blocks representative of a rule, a nature of the calculation and an adaptive output to create both a definition of the optimiser function and a logic sequence of the optimiser function. Thereby the optimiser function can be built from a toolkit of function components which are representative of the instruction set operations to which the processor instructions conform.

[0016] In some examples, the compiler is an operating system compiler and wherein the inspected instruction is an instruction from the operating system or an instruction from an application running on the operating system. In some examples, the instruction conforms to one selected from the group comprising: a CISC instruction set instruction; a RISC instruction set instruction; a x86 instruction set instruction; an ARM instruction set instruction; and a GPU instruction set instruction. Thus the present approaches can be applied on a platformindependent basis to optimise according to the context in which it is deployed.

[0017] In some examples, the execution optimiser comprises a code module configured to run from within a computer BIGS, EFi, UEFI or OpenFirmware, or to run from within a hardware driver. Thereby the present approaches may be run at a very low level within the system and thus interact with the compiler and processor in an efficient manner.

[0018] Viewed from a second perspective, there can be provided an execution optimiser for a programmable processor, comprising: a code inspector configured to access an instruction compiler to inspect an instruction for the processor being handled by the compiler; a divider configured to identify parts of the inspected instruction that can be executed as separate operations and to create a set of separate operations for execution by the processor; and a combiner configured to combine processor outcomes from the set of separate operations into a result corresponding to a result that would have been returned by the processor carrying out the inspected instruction. Thereby an operation stream may be divided in such manner that the overall set executes faster than had the operation scheduling been based directly upon operations as output from the compiler, [0019] In some examples, the divider is further configured to identify parts of the inspected instruction by identifying connectives between parts of the inspected instruction. Thereby the present approaches can divide the operations while taking operation linkages into account.

[0020] In some examples, the divider Is further configured to Identify a presence or absence of dependencies between the parts of the Inspected Instruction. Thereby the present approaches can divide the operations while taking operation dependencies into account.

[0021] In some examples, the divider is further configured to schedule parts of the inspected operation having no dependency therebetween as separate operations which can be performed in series or In parallel by the processor. Thereby, the operations can be scheduled to be performed as soon as there is a space in a processor pipeline.

[0022] in some examples, the divider Is further configured to schedule parts of the Inspected operation having a dependency therebetween as separate operations which can be performed on one or more pipelines of the processor based upon a scheduling constraint to cause the result of a first separate operation required as an input for a second separate operation to be available to the second separate operation. Thereby, the operations can be scheduled to be performed in a relevant order that preserves outcome dependency, while at the same time using earliest available gaps in a processor pipeline.

[0023] Viewed from a third perspective, there can be provided an execution optimiser that combines the approaches outlined from first and second perspectives above, in this arrangement, the optimiser is configured to operate the divider and combiner if the comparator identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions. Thereby multiple optimisation approaches can be used in a synergistic manner to deliver efficiency by applying an operation streamlining technique if an operation bypassing technique is not applicable in a given instance.

[0024] In some examples, where a function creator is deployed, the optimiser can be configured to operate the divider and combiner if the function creator determines that no new optimiser function can be defined from the inspected instruction. Thereby the approach allows for a most appropriate optimisation technique to be applied depending upon the circumstances.

[0025] Viewed from a fourth perspective, there can be provided an method of execution optimisation for a programmable processor, the method comprising: accessing an instruction compiler to inspect an instruction for the processor being handled by the compiler; comparing the inspected instruction to a record of previously defined optimiser functions; and if the comparing identifies the inspected instruction as corresponding to an optimiser function from the record of previously defined optimiser functions: intercepting the inspected instruction; performing the corresponding optimiser function instead of allowing the inspected instruction to reach the processor; and returning the result of the optimiser function as the result of the inspected instruction. Thereby, the present approach can provide for a processor to be bypassed in the carrying out of an operation, thus replacing an execution delay for the instruction through processor with an optimiser function duration, which is expected to be shorter than the execution delay.

[0026] Viewed from fifth perspective, there can be provided a method of execution optimisation for a programmable processor, comprising: accessing an instruction compiler to inspect an instruction for the processor being handled by the compiler; identifying parts of the inspected instruction that can be executed as separate operations and creating a set of separate operations for execution by the processor; and combining processor outcomes from the set of separate operations into a result corresponding to a result that would have been returned by the processor carrying out the inspected instruction. Thereby an operation stream may be divided in such manner that the overall set executes faster than had the operation scheduling been based directly upon operations as output from the compiler.

[0027] Viewed from a sixth perspective, there can be provided there can be provided an execution optimisation method that combines the approaches outlined from the fourth and fifth perspectives above. In this approach, identifying and combining are performed if the comparing identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions. Thereby multiple optimisation approaches can be used in a synergistic manner to deliver efficiency by applying an operation streamlining technique if an operation bypassing technique is not applicable in a given instance [0028] Viewed from another perspective, there can also be provided a computer program product comprising processor-executable instructions for causing a programmable computer to become configured as the execution optimiser as outlined above, or to perform the method as outlined above.

BRIEF DESCRIPTION OF THE FIGURES [0029] Embodiments of the present teachings will be described hereinafter, by way of example only, with reference to the accompanying drawings in which like reference signs relate to like elements and in which:

[0030] Figure 1 is a schematic representation of elements of a programmable computer;

[0031] Figure 2 is a schematic representation of logical elements for utilising an optimiser function;

[0032] Figure 3 is a flow chart illustrating representative steps for utilising an optimiser function; [0033] Figure 4 is a flow chart illustrating representative steps for inspecting a compiler;

[0034] Figure 5 is a schematic representation of an optimiser function library;

[0035] Figure 6 is a schematic representation of utilising an optimiser function;

[0036] Figure 7 is a schematic representation of a function creator;

[0037] Figure 8 is a schematic representation of logical elements for utilising a divider and combiner;

[0038] Figure 9 is a schematic representation of an instruction;

[0039] Figure 10 is a schematic representation of a divided instruction;

[0040] Figure 11 is a schematic representation of logical elements for utilising a divider and combiner, and an optimiser function.

[0041] While the present teachings are susceptible to various modifications and alternative forms, specific embodiments are shown by way of example in the drawings and are herein described in detail, it should be understood, however, that drawings and detailed description thereto are not intended to limit the scope to the particular form disclosed, but on the contrary, the scope is to cover all modifications, equivalents and alternatives falling within the spirit and scope defined by the appended claims,

DETAILED DESCRIPTION [0042] Embodiments and examples are described hereafter by way of example only in the following with reference to the accompanying drawings.

[0043] The present teachings present as number of approaches for increasing the rate at which a processor of a programmable computer can provide outputs from operations passed to the processor. The operations passed to the processor may, for example, relate to operation of the programmable computer (such as instructions arising from program elements within an operating system) or to an application running on an operating system of the programmable computer. The operating system could be any computer operating system, such as a Windows Operating system, a MACOS operating system, a Unix or Unix-like operating system, a Linux or Linux-like operating system or the like.

[0044] As the skilled reader will appreciate, an operating system or application instruction such as to calculate a new icon location on a display or to perform a given calculation will be broken down by a compiler into a set of processor operations that when carried out together cause the operating system or application instruction to be performed. The processor operations output by the compiler will conform to at least a part of the instruction set which the processor is capable of performing. A variety of processor instruction sets exist, each corresponding to one or more hardware processor designs. Examples of such instruction sets include CISC (complex instruction set computing) instruction sets, such as the x86 instruction set and its extensions used by various Intel, AMD, VIA and Cyrix processors (although the processor cores of some x86 compatible processors run as a RISC core implementing instructions from the x86 instruction set), the 6800/68000 instruction set utilised by various Motorola processors, and the Z8/Z80/Z8000 instruction set utilised by various Zilog processors. Further example instruction sets may be RISC (reduced instruction set computing) instruction sets such as the ARM instruction sets utilised by various ARM Architecture processors from a number of manufacturers, SPARC instruction sets utilised by various SPARC architecture processors from a number of manufacturers, and MIPS instruction sets utilised by various MIPS Technologies processors.

[0045] These processor operations are then carried out by one or more processor cores, each of which may include one or more processor pipelines, providing a result that can be used by further instructions and/or passed back as a result to the operating system or application instruction. Thus the operating system or application instruction is carried out as a sequence of processor operations to provide the result requested by the operating system of application instruction. For each operation, the processor will fetch or receive the operation, fetch the operand (argument) for that operation from memory (cache or RAM), perform the operation upon the operand, and output the result to an instructed location in memory.

[0046] The present techniques provide methods for causing the result of this sequence of processor operations to be provided faster than would occur if the sequence of Instructions were passed to the processor for execution of that sequence of instructions.

[0047] According to a first approach described herein, an execution optimiser can inspect the compiler to view upcoming instructions before they have actually been output by the compiler as a sequence of operations, use the inspected instruction to identify a predefined optimiser function that is known to provide the correct output for the inspected instruction and then prevent the sequence of operations corresponding to the inspected instruction from reaching the processor core, instead of allowing that sequence of operations to reach the processor, the optimiser function is used to cause the output from that sequence of operations to be written directly to an appropriate location in memory. In some examples, the execution optimiser provides this functionality by causing a logic sequence of the optimiser function to be written to the same location as the operand for the sequence of operations during a memory refresh cycle of the memory location holding the operand. This causes the operand to be combined with the optimiser logic sequence such as to provide the result of the sequence of operations in the memory location that previously held the operand. In some examples, the optimiser function can cause the result of the operand and the optimiser function logic sequence to be written to a memory location other than the location that previously held the operand. Thus the processing of the operations by the processor can be bypassed and the result provided as a result of the regular memory refresh cycles.

[0048] According to another approach described herein, an execution optimiser can inspect the compiler to view upcoming instructions before they have actually been output by the compiler as a sequence of operations, use the inspected instruction to identify opportunities for separate operation of the operations making up the inspected instruction, replace sequence of operations with the separate operations, wait for the processor to execute the separate operations and then assemble the output from the separate instructions into the result that would have occurred had the separate operations been carried out as the sequence. This allows the separate operations to be executed in parallel such that the assembled overall result can be made available sooner than if ail operations in the sequence were carried out as a single sequence.

[0049] With reference to Figure 1, there is shown a schematic illustration of elements of a programmable computer in which techniques of the present disclosure can be implemented. A processor 10 includes a cache memory 12. The processor 10 is also connected to a memory 14 such as a random access memory (RAM). The cache 12 and memory 14 are used to store both data and instructions/operations that are waiting for execution by the processor and recently executed/output by the processor. The cache 12 and memory 14 may also be used to store data and instructions/operations that are indicated by the operating system and/or an application as being likely to be needed again soon by the processor.

[00S0] The cache 12 and the memory 14 are volatile memory in that they require power to be supplied to the memory for the memory to retain its content, in addition, they require regular memory refresh cycles to maintain the content in the memory. Such refresh cycles are commonly carried out by a memory management circuit and may be controlled by, or in synchronisation with, the processor. In the context of the memory refresh cycles common to DRAM memory, the memory refresh cycle used for this operation could be any or a combination of a RAS (Row Address Strode or row select) only refresh, a CAS (Column Address Strobe or column select) before RAS refresh, or a hidden CAS before RAS referesh.

[00S1] Also connected to the processor may be a fast I/O interface 16, such as a PCie interface via a PCie root complex. The processor may also be connected to a so-called chipset 18 which provides the processor with a single interface to slower access elements such as storage 20, BIOS 22 and regular I/O 24. The storage 20 is non-volatile storage such as a magnetic storage device or a FLASH storage device. The storage 20 is used for persistent storage of operating system, applications and data when the programmable computer is turned off. Content from the storage 20 may be transferred to the cache 12 or memory 14 for use in operation of the operating system and applications and then written back to the storage 20 to enable changes to be recorded to persistent memory.

[0052] The BIOS (Basic Input/Output System) provides hardware initialisation when the programmable computer is started and provides a route for the processor to interact with hardware elements before the operating system has loaded the drivers for facilitating direct interaction with those hardware elements as a native property of the operating system, in some computer architectures, the BIOS may be replaced by another and element with similar functionality such as EFI (extensible firmware interface), UEFI (unified extensible firmware interface) or OpenFirmware or similar.

[0053] Thus the approaches of the present teachings can be implemented to be stored in nonvolatile storage 20 for activation when the programmable computer is started. In some examples, part of the execution optimiser can be run from a location within the BIOS (or EFI, UEFI or OpenFirmware or similar) so as to be treated by the programmable computer as a hardware element. For examples, parts of the optimiser that interact with data flows to and from a processor and/or memory may be implemented in this way.

[0054] in addition, part of the execution optimiser may be run from the cache 12 or memory 14 during operation of the programmable computer. For example, functionality to define new optimiser functions may be run from such locations.

[0055] Referring now to Figure 2, logical elements in the structure for implementing an execution optimiser according to the present techniques are shown. The execution optimiser of this example performs inspection of incoming instructions, uses those to identify optimiser function(s) that can be used to provide the same result as the processor operations generated to perform the incoming operations and replacing the processor operations with the optimiser function(s).

[0058] Thus, as shown in Figure 2, an optional application 30 may run on an operating system 32 of a programmable computer. The operating system 32 passes instructions, which may be from the operating system, an optional application or both, to a compiler 34. The compiler 34 interprets the operating system instructions into machine code or assembler code operations to be carried out by a processor.

[0057] According to the present approach, the activities of the compiler are watched by an inspector 36. The inspector 36 is able to determine by inspecting the compiler 34 what instructions are being processed into what sequences of processor operations, in one example, the inspection is or is in communication with a debug output of the compiler 34. Using this approach, the compiler outputs using a debug output information about the processing performed at each stage of the compiler. This enables the inspector to receive in parallel with the operation of the compiler information on the processor operations that the compiler will be outputting for execution by the processor. This part of the approach can be driven by refresh cycles inherent to operation of the compiler such that the compiler neither works harder nor takes more time to perform its tasks while outputting the debug information.

[0058] The inspected instruction information from the inspector 36 is passed to a comparator 38. The comparator 38 compares the information about the inspected instruction to a plurality of predetermined optimiser functions stored in an optimiser function library 40. If the comparator identifies one or more optimiser functions that can be used to replace part or all of the sequence of processor operations output from the compiler based on the inspected instruction, the comparator selects that optimiser function and notifies an interceptor 42 to prevent the processor instructions that the optimiser function replaces from reaching the processor, instead, the selected optimiser function is caused to operate on the data (operand or argument) that the intercepted processor operations were due to use.

[0059] In one example, the inspector 36 may be provided by the compiler 34 outputting debug information to a predetermined set of memory locations from which the comparator 40 can retrieve that debug information. The inspector 36 may also include a function to activate the debug output functionality of the compiler 34. In another example, the inspector 36 may be provided by a function that accesses or retrieves the debug output information from the compiler and extracts from that debug output information information describing the inspected instruction and/or processor operations being prepared by the compiler operations and makes this information available to the comparator 38 without other debug output information unrelated to the inspected instruction and/or processor operations.

[0060] if one or more processor operations output by the compiler are not to be replaced by an optimiser function, the interceptor allows these to pass and they are then executed by a processor core 44 in a conventional manner. The processor core 44 may correspond to a processor core or processor pipeline of the processor 10 illustrated in Figure 1. The result of such operations as executed by the processor core 44 are output by the processor core 44 to a memory 46. The memory 46 may correspond to either or both of the cache 12 and memory 14 illustrated in Figure 1.

[0061] If one or more processor operations output by the compiler are to be replaced by an optimiser function, the interceptor 42 prevents those processor operations from reaching the processor core 44 and instead writes a logic sequence of the optimiser function to a memory location in the memory 46. The memory location for the logic sequence of the optimiser function is chosen so that upon the next memory refresh cycle of the memory 46, the logic sequence and the data (operand or argument) that the intercepted processor operations were due to use are refreshed into the same memory location. This causes the logic sequence to act as a logic gate on the data, such that the data present in the memory after the memory refresh cycle is the data that the intercepted processor operations were due to use as operated on by the logic sequence of the optimiser function. Although it is likely in many implementations that the location of the operated-on data after the memory refresh would be the same as the location of the data prior to the refresh, the technique of applying the logic sequence in this manner would also work if these two locations are different memory locations.

[0062] Thus it is understood that a logic sequence of an optimiser function can be used to replace a sequence of processor operations to achieve the same result upon the data (operand or argument) upon which those processor operations were to operate without the processor needing to carry out any processor operation. The only involvement of the processor is any involvement that it would have had in any case in the memory refresh operation. Thereby the result of the processor operations can be obtained faster than waiting for the whole sequence to pass through the processor pipeline and the instruction workload for the processor is reduced. Accordingly, a faster execution time can be achieved for the inspected instruction and an instruction occurring after the inspected instruction can be executed sooner.

[0063] Furthermore, as is also illustrated in Figure 2, a function creator 48 may optionally be provided to enable new optimiser functions to be defined. Such new optimiser functions could be new optimiser functions for situations where no optimiser function yet exists and/or could be new optimiser functions that supersede previously defined optimiser functions. As shown in Figure 1, an output of the comparator 38 may be provided to a function creator 48. This may occur only in cases where the comparator finds no existing optimiser function in the library 40, or it may occur on an ongoing basis so as to provide for the possibility of creating new superseding functions. This is discussed in more detail with reference to Figure 6 below.

[0064] Turning now to Figure 3, there is shown an illustrative flowchart of steps carried out by the inspector/comparator/interceptor elements of the present approach.

[0065] The method starts at step S3-1 with the inspection of the instruction compiler. Then, at step S3-3 the information about the upcoming instruction/processor operations is compared to the optimiser functions already existing in the library. If the upcoming instruction/processor operations match (step S3-5) an existing optimiser function, then at step S3-7 the necessary processor operations are intercepted and at step S3-9 the matching optimiser function is caused to operate on the operand for the intercepted processor operations. If on the other hand it is seen at step S3-5 that there is no match, then processing continues at step S3-11 and no interception is made such that the processor operations are allowed to reach the processor. [0066] Thus it is understood that the interception is made conditional on whether or not an existing optimiser function exists. As will be seen from the discussion below, in some circumstances no optimiser function exists because no optimiser function has yet been defined for the combination of processor operations, and in other circumstances no optimiser function exists because no optimiser function can be defined for the combination of processor operations. As will also be seen from the discussion below, in the present implementations only one optimiser function will match a given sequence of processor operations such that no disambiguation of which optimiser function to use need be performed, if the system were implemented such that multiple matches could occur, then a disambiguation process to choose between different possible optimiser functions would be added.

[0067] Turning now to Figure 4, there is shown an illustrative flowchart of steps carried out by the inspector/comparator elements of the present approach to determine whether an inspected instruction (or the sequence of processor operations that the compiler will output to complete the inspected instruction) matches a predefined optimiser function. The example illustrated with respect to Figure 4 assumes a compiler with three debug output stages, but the present approach can be adapted to compilers having more or fewer compiler stages by adding or removing cycles of reviewing the debug output and comparing this to a shortlist of possible optimiser functions.

[0068] As shown in Figure 4, the method starts at step S4-1 with receiving the debug output from the compiler first stage and then comparing that first stage compiler debug output to the library of optimiser functions in the library at step S4-3. At step S4-5 a test is performed to determine whether any optimiser functions in the library could match an inspected instruction that starts with the information revealed in the first stage compiler debug output, if one or more potential matches exist, then at step S4-7 a shortlist of possible optimiser functions is created. [0069] Then, at step S4-9 the method continues with receiving the debug output from the compiler second stage and thereafter comparing that second stage compiler debug output to the shortlist of optimiser functions at step S4-11. At step S4-13 a test is performed to determine whether any optimiser functions in the shortlist could match an inspected instruction that continues with the information revealed in the second stage compiler debug output. If one or more potential matches exist, then at step S4-15 the shortlist of possible optimiser functions is modified to remove any optimiser functions that were possible matches after the first stage compiler debug output but are no longer possible after the second stage compiler debug output. [0070] Then, at step S4-17 the method continues with receiving the debug output from the compiler third stage and thereafter comparing that third stage compiler debug output to the shortlist of optimiser functions at step S4-19. At step S4-21 a test is performed to determine whether any optimiser functions in the shortlist could match an inspected instruction that ends with the information revealed in the third stage compiler debug output. If a potential match exists, then at step S4-23 the method returns of result identifying or forwarding the matching optimiser function. As noted above, in the present implementations only one optimiser function will match a given sequence of processor operations such that no disambiguation of which optimiser function to use need be performed once all stages of the compiler have been inspected. If the system were implemented such that multiple matches could occur, then a disambiguation process to choose between different possible optimiser functions would be added.

[0071] On the other hand, if the test for matching optimiser functions at any of S4-5, S4-13 or S4-21 indicates that no possible matches exist, then at step S4-25 the method returns a result indicating that no matching optimiser function can be found.

[0072] As will be understood from the above description, this process is described in Figure 4 in terms of the steps for tracking a single instruction through the various compiler stages (and the process is adapted to refer to the number of compiler stages present in the particular compiler). Thus, as a compiler may be expected to process a number of instructions at one time, the process is therefore carried out at the same time as many times as there are instructions being processed by the compiler at once.

[0073] Thus it is understood that by processing the debug output of the compiler it is possible to predict the likely outputs of the compiler before the compiler actually provides the processor operations as an output and thus determine alongside the processing by the compiler an iteratively improved list of possible optimiser functions so as to be able to determine as soon as the compiler outputs the final stage debug output whether a matching optimiser function exists such that an interception is appropriate.

[0074] With reference to Figure 5 there is shown a schematic illustration of elements of an optimiser function library 38. As shown the optimiser function library 38 stores n optimiser functions 52a, 52b, ... , 52n. Each optimiser function includes a definition 54 and a logic sequence 56. The definition 54 describes a functionality of the optimiser function and the logic sequence 56 provides an opcode that can be implemented to carry out the optimiser function. The combination of these two elements provides for the optimiser function to provide the effect of replacing a number of processor operations with the single optimiser function.

[0075] Each optimiser function has the property that it replaces a sequence of processor operations with a single logic sequence that always provides the same answer as that sequence of processor operations, whatever the data input (operand/argument) to that sequence of processor operations. This property provides that the optimiser function is independent of the data upon which the processor operations are to use as their input. Accordingly, the optimiser function can be selected based solely upon the inspection of instructions being processed by the compiler without reference to the subject data to which the instructions are applied.

[0078] Thus the definition 54 of each optimiser function 52 is used to identify and describe the particular instruction (i.e. sequence of processor operations) to which it applies.

[0077] As mentioned briefly above, the logic sequence 56 is used as a logic gate array to apply to the operand data for the sequence of processor operations. In effect therefore, the logic sequence is the answer to the set of data operations performed by the sequence of processor operations.

[0078] The application of the optimiser function logic sequence 56 to the operand for a given instance of the sequence of processor operations which the optimiser function 52 replaces is illustrated in Figure 6. As shown in Figure 6, the optimiser function logic sequence 56 is applied to the operand 58. This provides the output 59. in this example, for simplicity of illustration the logic function 56 and operand 58 are 16 bits in size. However the present approaches can be used with the logic sequence and operand being of any length. Common operand data sizes can include 4 bits, 8 bits, 16 bits, 32 bits, 64 bits, 128 bits or larger and the present technique is applicable to any such size.

[0079] As will be understood from the discussion above, the optimiser function logic sequence has bit values for each of bits L1 through LF that provide the end-to-end answer of the processor operations that the optimiser function represents. Also, the operand has any bit values for each of bit OO through OF, as these are the bit values of the data upon which the sequence of processor operations was to operate. The output 59 is created by writing both the optimiser function logic sequence 56 and the operand 58 to the same memory location. This has the effect of performing a direct bit-wise logical addition without carry operation between the two bit sequences. Thus the optimiser function logic sequence is in affect used as a logic gate array through which the operand is passed. This provides that the result is the same result that would have been provided by causing the processor to carry out the sequence of processor operations using that operand.

[0080] Thus it is understood that the optimiser function both describes its own functionality and provides a logic sequence or opcode that can be applied to carry out that functionality.

[0081] In the present examples, each optimiser function is defined to be unique as between all other optimiser functions available in the library such that for any given inspected instruction/sequence of processor operations, there can only be a maximum of one optimiser function. This provides that the execution optimiser doesn’t need to spend additional time disambiguating between multiple possibilities and therefore that the acceleration of performance achieved is as great as possible (although such disambiguation could be implemented at a risk of reduced efficiency). As mentioned above, new optimiser functions can be created that supersede previous optimiser functions.

[0082] With reference to Figure 7 there is shown a schematic illustration of elements of an optimiser function creator 48. As shown in this figure an inspected instruction/sequence of processor operations is provided as an input to a combiner 60 that examines the inspected instruction and combines a mixture of rules 62, calculations 64 and adaptive outputs 66 to create, if possible, a logic sequence that provides the same output when applied as a logic gate array directly to any possible operand for the inspected instruction to the output that wouid have been achieved by presenting the same operand to the inspected instruction by performing the sequence of processor operations corresponding to the inspected instruction. It is expected and has been determined that it is likely that some inspected operations cannot be reduced to an optimiser function as the output varies depending upon the operand and thus cannot be uniquely described by a single logic sequence. However for situations where a logic sequence can be so defined, a now optimiser function is defined and includes the description of the functionality as well as the logic sequence.

[0083] Determining an optimiser function is based upon determining from the instruction input to the compiler what the outputs to be sent to the controller will be. The inspector module knows both the instruction input to and the sequence of instructions output from the compiler and can also see what is passed back up from the processor to the operating system and/or application(s). All of these can be compared and this comparison enables the function creator 48 to work out what the optimiser function corresponding to the instruction can be. This provides that for one instruction observed once a first optimiser function may be created and that subsequent observations of the same (or similar) instruction allows that optimiser function to be refined so as to be applicable to multiple (or all) instances of that instruction (and in some cases also similar instructions).

[0084] For example, in the case of additive sums, ail additive sums ail use an addition function, but they have different inputs and outputs. The inputs and the function are used together to get the output. This knowledge can be used to create a logic sequence (optimiser function) that when applied to any input always provides the output that executing the full operations through the processor would have yielded.

[0085] This can be anaiogised for example to a mechanism for sorting pool bails in a pay-per game pool table. Such a table requires all colour balls (regardless of colour) to go to the same place once they enter any pocket, but for the white bail to be returned for to a different location when it enters a pocket. In such a mechanical implementation the white ball has a different size or weight to allows it to be separated from the colour balls. Considered in the context of the present techniques, all balls can be are handled by the same optimiser function - the ball handling is the function and the colour is the argument. When the optimiser function is first defined, if it only sees “white bail” instructions, then the optimiser function will be defined to handle ail bails as white bails. If a “colour ball” instruction is then received the function will handle the “colour bail” instruction as though it were a “white bail instruction”. This would be expected to trigger an error as the computer will find that there is something wrong with a result which prevents continuing with a sequence of instructions. For example, an error log may be generated for the problem function from the compiler. Such an error log would typically be accessible form the compiler using the debug function already being used to access the workings of the compiler.

[0086] Errors may be generated by the compiler as the compiler typically expects certain properties of the response from the processor and will notice if the response from the processor (including a response actually created by an optimiser function) does not match this expectation. For example, the compiler generally knows what form a response should be and may for example just needs the data sorting (i.e., the arguments operated on). When compiler spots that the result is wrong, if generates an error log (which shows up in the debug report) and re-issues the same instruction to try again. The compiler may know the erroneous nature of the data itself or may obtain this information when the output is passed back up to the operating system/application(s) from which the instruction originated. Thus the operating system or application(s) may for example send an error to the compiler indicating that the result doesn’t match the expected data shape (format) of the result from the sent instruction and refuses to accept that result. The optimiser system notices the exact duplicate, knows that this must be because of the error (from the error log), and allows the instruction to run without interception.

[0087] The execution optimiser therefore can spot this repeat instruction based upon the error log and let it pass unmodified by the interceptor to the processor in case the resent instruction is representative of the error (it may for example be that an error is spotted after a number of instructions and thus it may be that multiple optimiser functions have been used in achieving the error state). By watching the result returned by the processor to the operating system/appllcatlon, the execution optimiser can see whether the resent instruction was a coincidence (i.e. the result from the processor is the same as the result that the optimiser function would have created) or due to an error (i.e. the result from the processor is different to the result that the optimiser function would have created). If the result from the processor is different to the result that the optimiser function would have created, then the new output result can be used to refine the optimiser function - in this example to add a colour bail” instruction to the optimiser function that handies all “ball” instructions. Thus the execution optimiser is able to self-learn improvements to the optimiser functions.

[0088] Thus the optimiser function blocks illustrated in Figure 7 will be understood as conceptual building blocks for the optimiser function. The rules 62 relate to fundamental operations that an optimiser function can carry out These may be broadly representative as individuals or groups of types of sequences of processor operation that a processor can carry out. The calculations 64 relate to actual processor operations that fall within those types of instruction and may relate more directly to operations of that type that the processor can carry out according to its instruction set. The adaptive outputs 66 relate to handling the fact that different sequences of operations can result in different outcomes based upon different arguments and thus provide that an optimiser function provides the correct output for whatever argument data is being operated upon.

[0089] The set of building blocks is related to the instruction set(s) which the processor is configured to implement, also to other processor and memory-specific factors such as memory bus width and instruction bus width as implemented (which may be lower than the processor’s theoretical bus capabilities), and to the particular operating system running on the computer system.

[0090] Initially, in a given implementation, there is typically a set of building blocks that relate to individual processes which are combined to compose optimiser functions that can implement those processes in a meaningful way in the context of the hardware system. Over time, it is seen that existing optimiser functions may be treated as supplemental building blocks such that a new optimiser function can be defined that is a composite of one or more other optimiser functions and/or one or more other processes.

[0091] An illustrative example of creating an optimiser function is seen from looking at a computer system function that uses a defined input which is always the same but when repeated may use different argument data (e.g. a computer function that uses floating point maths capability of the processor). A suitable function to consider is a function to create a text box in desk-top publishing application such as MS Publisher. Such a function is activated by a user initiating a draw text box function (for example using a graphical user interface button) and then drawing the text box size using a mouse, trackball or trackpad input. This application function triggers an instruction which is passed down through the operating system to the compiler, to the processor and back up to the application. When the inspector and Interceptor see both the downward flow to the processor and the return coming back from the processor, the system can learn the behaviour that it observes to create an optimiser function. In this case, it will be observed that the instruction includes a floating point math problem, so the optimiser system can identify the relevant source code classes (from among the source code classes that underlie the high level language of the operating system and/or application from which the instruction originates) that implement the math problem and notes the path that the instruction uses. The path describes how the data (argument) is inserted into the math problem (i.e. the mathematical operator that is used to Insert the data, for example insertion by way of addition, subtraction, multiplication or division, or a sequence of two or more such operators). Thus it is understood that different paths have different set of mathematical operators.

[0092] As will be understood from the discussion above, the aim of creating an optimiser function is that the optimiser function is to be adaptive to different input data (arguments). Thus, from the source code it is possible to break down the whole operation of the math problem, the corresponding source code classes, the path and the result, and to interpret from that the rule for the complete combined cycle the is being performed. From this complete overview a logic gate is defined which acts as a transform that shortcuts directly from the output of the compiler to the processor to the result of the entire combined cycle as output from the processor into memory for passing back to the application. Thus it is seen that, with the exact sequence depending on the output from the compiler, the rule making is to put together OR/NOR/AND/XOR Boolean logic gates to create the result using basic blocks of Boolean logic from a programming language that can be interpreted by the operating system and/or processor. This logic gate is then the opcode part of the optimiser function and is used when necessary to implement the same instruction next time this is defected by the inspector. The description part of the optimiser function describes the functionality of the opcode in such manner as to make the optimiser function re-findable within the optimiser function library. As discussed above, if an optimiser function is mis-defined such that it returns an error result, the system can learn from a re-run of the same instruction for which an error result was returned and learn therefrom to redefine the optimiser function.

[0093] Thus it is understood that according to the above-described approach the present teachings provide an approach for effectively bypassing involvement of a processor in carrying out some processor instructions, thus enabling two potential sources of performance improvements for a programmable computer. The first is that the instructions which bypass the processor are likely to be completed faster than if those instructions had passed through the processor pipeline. The second is that instructions that cannot be bypassed in this way would be expected to have less of a wait for execution by the processor as the processor doesn’t spend time processing those instructions that can use the by-pass. Accordingly, the approach of the present examples is self-selecting in the sense that instructions that can use the by-pass will do so and instructions that cannot use the by-pass will not try to do so, but will benefit from any reduced overall processor loading that the bypass provides.

[0094] Turning now to Figure 8, logical elements in another structure for implementing an execution optimiser according to the present techniques are shown. The execution optimiser of this example performs inspection of incoming instructions, uses this information to break the instructions into parallelisabie or part-parallelisable part instructions and then feeds processor operations corresponding to the part instructions to the processor for parallel or part-parallel operation.

[0095] Thus, as shown in Figure 8, an optional application 30 may run on an operating system 32 of a programmable computer. The operating system 32 passes instructions, which may be from the operating system, an optional application or both, to a compiler 34. The compiler 34 interprets the operating system instructions into machine code or assembler code operations to be carried out by a processor.

[0098] According to the present approach, the activities of the compiler are watched by an inspector 70. The inspector 70 is able to determine by inspecting the compiler 34 what instructions are being processed into what sequences of processor operations, in one example, the inspection is or is in communication with a debug output of the compiler 34. Using this approach, the compiler outputs using a debug output information about the processing performed at each stage of the compiler. This enables the inspector to receive in parallel with the operation of the compiler information on the processor operations that the compiler will be outputting for execution by the processor.

[0097] The inspected instruction information from the inspector 36 is passed to a divider 72. The divider 72 reviews the inspected instruction to find connectives between the different parts of the instruction. These connectives represent joining codes that link different processor operations together. In some cases, the connective simply provides a link to join the processor operations into a single queue, in other cases the connectives are indicative of a dependency of a following processor operation upon the completion of a preceding processor operation. [0098] The divider 72 uses the information on dependencies (and absences of dependencies) to remove or alter the connectives such that the processor operations making up the parts of the inspected instruction are no longer presented as a single linear queue. Where dependencies exist, the dependency information is retained so that the instruction management of a processor will cause the dependencies to be honoured when carrying out the processor operations. The ordering to handle such dependencies is achieved by, when removing connective, knowing how fast the processor works so as to know how fast those operations could be processed in a string. The execution optimiser then feeds the operations info the instruction queue of the processor in a sequence with very small time gaps so that the instructions can be fed info the processor without the processor taking a break to look at the dependencies, but so that the dependencies are accounted for. In other words, the processor doesn't manage the dependencies any more, rather the system aligns the operations in the instruction queue such that they will necessarily be carried out with the dependencies observed. Thereby the execution optimiser effectively moves the inevitable inefficiencies associated with dependency management sideways out of the processor to a functionality where those inefficiencies have less or no effect on overall processor throughput performance.

[0099] The divider then passes the revised processor operations with their revised dependency information to an interceptor 74 which replaces the sequence of processor operations output by the compiler 34 in response to the intercepted instruction with the divided processor operations from the divider 72.

[00100] In one example, the inspector 70 may be provided by the compiler 34 outputting debug information to a predetermined set of memory locations from which the comparator 40 can retrieve that debug information. The inspector 70 may also include a function to activate the debug output functionality of the compiler 34. in another example, the inspector 70 may be provided by a function that accesses or retrieves the debug output information from the compiler and extracts from that debug output information information describing the inspected instruction and/or processor operations being prepared by the compiler operations and makes this information available to the divider 72 without other debug output information unrelated to the inspected instruction and/or processor operations.

[00101] If one or more sequences of processor operations cannot be divided by the divider, the interceptor 74 allows these to pass and they are then executed by a processor core 44 in a conventional manner. The processor core 44 may correspond to a processor core or processor pipeline of the processor 10 illustrated in Figure 1.

[00102] Whether the processor core 44 is carrying out divided operations inserted by the interceptor 74 or unmodified operations output by the compiler 34, the processor core 44 writes the results of the operations to a memory 46. The memory 46 may correspond to either or both of the cache 12 and memory 14 illustrated in Figure 1.

[00103] Where divided operations are inserted in place of a sequence of operations, then processor core 44 may carry out those divided operations in a parallel or part-parallel manner. Thus the processor core may provide different ones of the divided instructions to different pipelines of the processor core and thus provide a lower infer-operation latency than has the same operations been fed as a sequence to the same processor pipeline. Alternatively or additionally, different ones of the divided operations may be fed to different processor cores 44. Further, in some examples different ones of the divided instructions may be handled simultaneously by the same pipeline of the processor core 44. This is possible where the pipeline has a higher bandwidth than the operation/data width than the operations being passed to that pipeline. For example if multiple instances of the same operation are set to operate on different data, and the data width which the processor core can carry out that operation on provides for two separate operands to be handled together, those operations can be carried out in this parallel manner in the same processor pipeline.

[00104] Where the divided operations have some dependencies therebetween, the operations can be fed to a single or multiple processor cores or pipelines in such manner that the operations pass through the processor pipeline or core overlapping in time, but arranged such that the output upon which the dependency relies is calculated and available before it is needed by the dependent operation. Multiple different options for such arrangements can be provided depending upon the nature of both the operations and the dependencies. If for example the dependency is that the results from two separate operations are required as inputs to a third operation, the two earlier operations can be carried out sequentially, overlapping or in parallel according to the availability of space in the processor execution pipeiine(s). The third operation then follows at least far enough behind in time that both of the earlier results are available as inputs (operands) for the third operation by the time that the operands for that third operation are needed by the processor.

[09105] Where divided operations have been inserted by the interceptor 74, a combiner 76 then acts upon the results written to the memory 46 by the processor core 44 to combine or build those results into the combined output that would have been provided had the divided operations not replaced the sequence of operations from the compiler. The combiner 76 may be provided with information such as a destination memory address for the result of each divided operation from the divider 72 to facilitate this assembly of an overall result.

[00106] Thus it is understood that by breaking a sequence of processor optimisations into a set of independent and partially dependent operations, those operations can be scheduled to a one or more processor execution pipeline(s) such that the overall result is available sooner than if that sequence had been executed as a purely linear sequence. Thereby the result of the processor operations can be obtained faster than by for the whole sequence to pass through the processor pipeline in order. Accordingly, a faster execution time can be achieved for the inspected instruction.

[00107] Turning to Figures 9 and 10, these figures schematically illustrate a simple example of dividing a sequence of instructions in the manner employed by the present approach.

[00108] Figure 8 illustrates an instruction made up of four instruction parts 80, 82, 84 and 86, joined by connectives 88. In this example, it is assumed that instruction part 3 84 and instruction part 4 86 have a dependency from instruction part 2 82, that instruction part 2 832 itself has no dependencies, and that instruction part 1 80 has no dependencies and no other part is dependent upon it.

[00109] Thus in Figure 9 there is schematically illustrated a relationship with connectives removed, showing that the instruction parts 1 and 2 80 and 82 have no dependencies and thus can be executed in parallel (or in sequence or overlapping, according to processor availability). Also, the instruction parts 3 and 4 84 and 86 have dependency from instruction part 2 89 and therefore can be executed overlapping (or in sequence, according to processor availability), but cannot be executed In parallel, with instruction part 2 82.

[00110] Thus it is understood that according to the above-described approach the present teachings provide an approach for allowing operations that are passed out from the compiler as a sequence but in fact have no or only limited sequential dependency to be carried out in parallel or overlapping in time (part-parallel) by removing unnecessary sequential connectives yet maintaining sequential dependencies. Using this approach, the set of operations from the sequence of operations may be performed sooner than if that same set were performed in sequence. In addition, by permitting parallelisation of multiple operations onto the same position in the processor execution pipeline, an overall acceleration of processor operation may be achieved by allowing otherwise wasted processor capability to be utilised. [00111] Although the two approaches of processor bypassing through use of optimiser functions and enhanced parallelisation have been described separately, they can be applied together as a unified approach. This is schematically illustrated in Figure 11.

[00112] Figure 11 illustrates in combination both the optimiser-function based inspector/comparator/library/interceptor structure described with reference to Figures 2 to 7 above and the inspector/divider/interceptor structure described with reference to Figures 8 to 10 above.

[00113] As shown in Figure 11, the inspector 36, 70 inspects the compiler 34 to obtain information describing the instructions that the compiler is processing into a sequence of processor operations. The inspector 36 provides this information to the comparator which attempts to find a matching optimiser function from the library 38 as described above. If an optimiser function is found, then the interceptor 42, 74 intercepts the relevant processor operations and prevents those operations from reaching the processor, instead replacing them with the identified optimiser function, if no matching optimiser function is found, the divider 72 performs processing to divide the sequence of processor operations into divided operations. If such divided operations can be achieved, these are used by the interceptor 42, 74 to replace the sequence of operations so as to enable at least partial parallelisation of those operations through the processor. Where such divided operations are used, the combiner 76 operated to create the combined output that would have resulted from the sequence of operations output by the compiler, [00114] Although Figure 11 shows that the comparator 40 and divider 72 work in sequence, with the divider only dividing the operations if no matching optimiser function is found, it is possible for these to work in parallel. Thus the divider 72 can be determining a divided operations while the comparator 40 is seeking a matching optimiser function 38. Thus if a matching optimiser function is found the divided operations are discarded but if no optimiser function is found then the divided operations are ready for use without waiting for the divider to start working at that later time. The interceptor 42, 74 could be configured to provide the control as to which result is used, with a configuration to use an optimiser function when one is available and to use divided operations if no optimiser function is available.

[00115] Whether to implement the comparator working in sequence or in parallel may depend upon factors such as the relative speeds of operation of the comparator and divider and the resources available for implementing these functionalities, [00116] In testing using the CineBench™ test suite for evaluating the performance of programmable computer systems, the approach of the present techniques has been evaluated to determine real-world impact of the approaches. The tests were performed using both the optimiser function and divider/combiner approaches discussed herein, as illustrated in Figure

11. The CineBench test suite tests both central processor (CPU) performance and graphics (GPU) performance, in these tests, the graphics processor of each test computer was left unchanged and the central processor performance was adjusted using the execution optimiser of the present techniques, such that any change in CineBench score is due to the operation of the execution optimiser on the central processor. The tests using the optimiser function were based upon a clean installation such that no established library of optimiser functions tailored to the test software had been pre-built within the system. The results of these tests are shown in Tables 1 and 2:

CPU	CineBench test results
Test 1	Test 2	Test 3	T est Average
Intel Ceieron-N2830	61	63	61	61
Intel Core 15-3317U	173	174	174	174
Intel Core I7-4960X	1118	1119	1119	1119
Intel Core I7-5960X	1377	1379	1375	1377
Intel Xeon E5-2687W v2	2240	2243	2243	2242
Intel Xeon E5 2699 v3	4361	4635	4361	4362
Table 1: CineBench test without execution op’	timiser

CPU	CineBench test results
Test 1	Test 2	Tests	T est Average
Intel Ceieron-N2830	175	175	174	175
Intel Core 15-3317U	341	342	342	342
Intel Core I7-4960X	2094	2098	2095	2095
Intel Core I7-5960X	3569	3569	3570	3569
Intel Xeon E5-2687Wv2	5224	5221	5224	5223
Intel Xeon E5 2699 v3	12692	12683	12695	12690
Table 2: CineBench test with execui	.ion optimiser

[00117] Thus it can be seen that a significant performance increase was observed with the execution optimiser active. In many cases, the improvements was at least a factor of 2. [00118] As an additional check, the temperatures of the processors were measured under heavy processing load to establish whether an increase in processor performance by the execution optimiser was causing the processor to work harder and thus consume (and thereby dissipate as thermal energy) more power. This test was performed using the same hardware as the CineBench tests (excluding the Intel Xeon E5 2699 v3 test machine due to access limitations to this hardware in the test environment). In each case the test computer was caused to run the Prime95 application which attempts to calculate the value of Pi to an infinite number of decimal places. This ensures that the processor is running a full execution workload. Each test machine was left to run the software for half an hour per test so as to allow the impact of the test workload to be fully realised in the processor. The results of these tests are shown in Tables 3 and 4:

CPU	Temperature average over half hour test AC
Test 1	Test 2	Test Average
Intel Ceieron-N2830	75	82	78.5
Intel Core 15-3317U	79	86	82.5
Intel Core I7-4960X	42	52	47
Intel Core I7-5960X	31	45	38
Intel Xeon E5-2687W v2	82	92	87

Table 1: Phme95 test without execution optimiser

CPU	Temperature average over half hour test AC
Test 1	Test 2	Test Average
Intel Ceieron-N2830	74	80	77
Intel Core 15-3317U	78	87	82.5
Intel Core I7-4960X	47	52	49.5
Intel Core I7-5960X	31	45	38

Intel Xeon E5-2687W v2

82

93

87.5

Table 2: Prime95 test with execution optimiser [00119] Thus it is seen that the use of the execution optimiser appears to have little or no effect on the temperature of the test processor under full load. It is therefore expected that in real world usage of a programmable computer, the actual performance impact will be that a given operation can be completed faster such that over the time taken to perform the operation the power used by the processor will be lower. The actual speed and power experience will vary dependent upon the type and load of computing operations demanded of the programmable computer.

[00120] Thus it is understood that the various approaches discussed herein may be synergistically applied as a combined approach. This provides that where optimiser functions can be used to perform the bypass of processor activity to carry out processor operations, this is performed. In addition, where no optimiser function is available, the operations that are passed to the processor may benefit from both the possibility of non-sequential operation through parallel of part-parallel (overlapping) execution of divided operations and the reduced processor workload provided by some operations bypassing the processor.

[00121] As will be appreciated, at the processor operations level, a given sequence of processor operations may be caused by a wide variety of operating system or application functionalities. For example, it may be found in practice that certain cursor movements caused by a mouse or touchpad input to the operating system actually trigger the same pattern of processor operations as an application functionality to create a new document or produce audio playback of certain frequency combinations. Thus each optimiser function defined and used according to the approach described herein may be used at a processor level to increase the responsiveness or processing speed of a wide variety of operating system and application tasks.

[00122] As a result of this disconnect between high level functionality and processor operation sequences, it is possible to optimise the performance of the execution optimiser described herein by running the optimiser through the optimiser. Thus acts of the execution optimiser may themselves be optimised by use of the approach of the execution optimiser. [00123] With reference to the discussion above in relation to execution of the execution optimiser, some parts of the execution optimiser may be stored and run from non-volatile storage for activation when the programmable computer and/or the execution optimiser is started. Also, some parts of the execution optimiser can be run from a location within the BIOS (or EFI, UEFI or OpenFirmware or similar) so as to be treated by the programmable computer as a hardware element.

[00124] In particular in some examples, functionality corresponding to the interceptor 42, 74 can be implemented to run from the BIOS and thereby be treated by the programmable computer as though it were a hardware module. However implemented, the interceptor is able to access a processor instruction queues in memory of the programmable computer and thereby prevent certain operations from reaching the processor. The interceptor can also access a memory location in which to insert data of its choice (for example a logic sequence of the optimiser function). This logic sequence can therefore be inserted into a suitable memory location to cause the logic sequence to be used in operation of a memory refresh, as described in more detail below. As the skilled reader will appreciate, a module configured via the BIOS to be treated as a hardware module may run from the BIOS or may be initialised as a hardware module via the BIOS and then run from memory locations of the cache 12 or memory 14 during operation of the programmable computer.

[00125] The implementation of the modules of the present techniques allows those modules to interact as discussed herein, whether or not any modules are implemented to be run from or initialised via the BIOS.

[00126] Thus the comparator (for example running as a software module) may be provided with information from the debug outputs of the compiler by the inspector (for example running as a software module) from compiler and then depending on what is currently in memory in terms of instruction sets to be operated by the CPU these then inform on a perinstruction basis the interceptor (for example running as a software implementation of a hardware module) to control the passage or non-passage to the processor. The interceptor can then apply the optimiser function to the correct location in memory to ensure that the optimiser function is applied at the next memory refresh cycle. Thus such an implementation effectively uses the BIOS-implemented module as a bridge between hardware and software realms. Access to the BIOS to create a module that is treated as a hardware module by the programmable computer can be provided using approaches such as the functionality that permits so-called flashing of the BIOS to provide BIOS updates or similar in the programmable computer.

[00127] In other implementations, rather than implement modules (such as the interceptor modules) as software that is treated as hardware by the running programmable computer, the interceptor could alternatively be implemented as a dedicated hardware module connected to the programmable computer. Such a dedicated module could therefore run as an actual hardware module rather than a software module being treated as a hardware module. Such a dedicated module could be implemented as one or more ICs and associated circuity provided on a motherboard of the programmable computer or as one or more ICs and associated circuitry on a so-called daughterboard or riser-card connected to the motherboard of the programmable computer, connected via an interface such as PCIe, SATA Express, NVM

Express or M.2. Such a dedicated module could alternatively be implemented as one or more ICs and associated circuity provided on a connectable module connectable to the motherboard of the programmable computer, connected via an interface such as SATA, SATA Express, or USB.

[00128] Although it has been described above that the optimisation system of the present examples can be implemented to optimise the performance of a computer CPU, the same or similar approach can also be applied to other processors. One example of another processor type for which the present techniques have been applied is the so-called “GPU” (graphics processing unit) which may be provided as part of a video-processing module for a computer system. Example GPUs include the Radeon series of GPUs from AMD (formerly from ATI), the GeForce series of GPUs from Nvidia, and the integrated GPU cores present in some Intel processors in the i3 and i5 ranges. A GPU may be a dedicated graphics processor or a socalled general purpose GPU (GPGPU) and may be physically or logically constructed from a number of stream processors, and may be physircally separate from or integrated with a CPU. [00129] This approach to optimisation of processor capability is implemented to work in largely the same way as the approaches discussed above for a central processor. The difference is that rather than obtaining access to the processor queues and memory by way of appearing to be or being a hardware module, for a GPU such as a Radeon or GeForce GPU It has been found that the same access can be obtained by implementing the relevant modules Into or through the GPU driver which is used by the operating system to perform tasks using the GPU. Each actual Implementation may be GPU-specific, but the principles apply to the different GPUs and the same module can be compiled into the relevant driver for the GPU being optimised.

[00130] In testing using the CineBench™ test suite, Performance Benchmark software, and tests under DirectCompute, the approach of the present techniques has also been evaluated to determine real-world impact of the approaches on a GPU. The tests were performed using both the optimiser function and divider/combiner approaches discussed herein, as illustrated in Figure 11. The CineBench test suite tests both central processor (CPU) performance and graphics (GPU) performance. The Performance Benchmark software tests all aspects of a PCs speed. The DirectCompute test is used to change colours rapidly and without screen tearing in the test performed by Performance Benchmark. In these tests, the central processor configuration was left unchanged as a dual processor E5-2699 at standard dock speed of 2.3GHz with 16GB of RAM and the GPU was switched between an Nvidia Quadra M6000 (graphics board manufactured by PNY and equipped with 12GB of GDDR5 memory, manufacturer part no VCQM6000-PB) and an Nvidia GeForce Titan X (graphics board manufactured by EVGA and equipped with 12288MB GDDR% memory, manufacturer part no 12G-P4-2992-KR) to test two different GPU models. Performance was adjusted using the execution optimiser of the present techniques, such that any change in benchmarking test score is due to the operation of the execution optimiser on the GPU. The tests using the optimiser function were based upon a clean installation such that no established library of optimiser functions tailored to the test software had been pre-built within the system. The results of these δ tests are shown in Graphs 1, 2 and 3 (in the graphs, tlm optimiser functions are identified as the “Aspariech Converter’):

08 16

Graph 1; Quadro test with and without execution optimiser

08 16

Graph2: Titan X test with end without execution optirniser ^tdS/TdhtlhtsOOrir/OthtJttiO'dlidsfTihdtX:

·.«.

GraphS: CireotCompute test on Quatro and Titan X with and without execution optirniser [00131] Although it has been described above that a library o? optirniser functions is maintained by the optirniser system such that the opcode for a required optirniser function can be written to a suitable memory location for use each time trie optirniser function Is required, the responsiveness of the optimiser system to write such opcodes into the correct memory location can be altered in situations where spare processor RAM is available (i.e. not allocated to other activities of the processor, operating system or applications.

[00132] In this approach, the optimiser system is able to access unallocated RAM for the processor (whether this is system RAM for a CPU or VRAM for a GPU) and use that spare RAM to cache some or all of the opcodes for the optimiser functions from the function library. This provides that when a particular optimiser function is called, the corresponding opcode can be copied from the cache copy in RAM to the correct memory location to be used by the optimiser system with the memory refresh cycle as discussed above, it is seen that such an approach will provide for the opcode to be available for use faster than if the opcode is being retrieved from a library in application storage of the computer system as a RAM copy is typically a very rapid operation.

[00133] Such a caching approach can also be made dynamically aware of the likely efficiencies to be obtained therefrom when the spare RAM is, or becomes through allocation of RAM to other processes, insufficient to put the entire optimiser function library into the RAM cache. Thus the system can chose which optimiser function opcodes to cache into RAM based upon an order of preference related to efficiency gains. The order of preference may prioritise caching of optimiser function opcodes corresponding to the most complex processor operations (i.e., those that will take the processor longest to do) and/or the optimiser function opcodes that have been observed to be the most frequently called.

[00134] Accordingly, it will be appreciated that there have been described a number of approaches to a flexible approach for execution optimisation such as to both reduce processor workload and make more efficient use of the processor’s processing power for the workload that remains.

[00135] As will also be appreciated, various modifications and alternatives are possible while retaining the overall approach of the present teachings. For example, although it is discussed above that the optimiser functions using a memory refresh cycle that happens anyway to refresh the volatile memory, the present approach can still be used in a system that employs memory which does not require refresh cycles, or which requires only infrequent refresh cycles. For example, an instruction can be inserted into the processor pipeline to cause an operation equivalent to a memory refresh cycle to affect the memory location(s) used by the optimiser function. Even though this would be one processor operation that would otherwise not have occurred, this is still expected to be a lower processor burden than carrying out the whole sequence of operations that the optimiser function replaced, thus maintaining the abovedescribed speed of obtaining the optimiser function result and reduction in processor workload. [00136] The presently described approaches can be implemented on a programmable computer. In order to provide the functionality to a programmable computer a computer program product may be provided which can be provided or transmitted to the computer to deliver code that enables the computer to implement the functionality. The product may be distributed and/or stored by way of a physical medium such as a computer-readable disc or portable memory device, which may collectively be termed a non-transitory computer readable medium. The product may also or alternatively be distributed and/or stored by way of a transient medium such as a transmission channel, communication signal or other propagating waveform, which may collectively be termed a transient computer-readable medium.

[00137] Although the embodiments above have been described in considerable detail, numerous variations and modifications will become apparent to those skilled in the art once the above disclosure is fully appreciated. It is intended that the following claims be interpreted to embrace all such variations and modifications.

Claims

1. An execution optimiser for a programmable processor, the optimiser comprising:

a code inspector configured to access an instruction compiler to inspect an instruction for the processor being handled by the compiler;

a comparator configured to compare the inspected instruction to a record of previously defined optimiser functions; and an interceptor configured to, if the comparator identifies the inspected instruction as corresponding to an optimiser function from the record of previously defined optimiser functions, intercept the inspected instruction and to perform the corresponding optimiser function instead of allowing the inspected instruction to reach the processor and return the result of the optimiser function as the result of the inspected instruction.

2. The execution optimiser of claim 1, wherein the code inspector is configured to inspect the instruction at each of multiple stages of the instruction compiler and to use the inspection result at one stage of the compiler to identify one or more candidate optimiser functions from the record of previously defined optimiser functions and subsequently to use the inspection result at a subsequent stage of the compiler to reduce the number of candidate optimiser functions.

3. The execution optimiser of claim 1 or 2, wherein each optimiser function of the previously defined optimiser functions describes an operation that provides the same outcome for any given argument data as one or more processor instructions provided with the same argument data.

4. The execution optimiser of claim 1, 2 or 3, wherein each optimiser function of the previously defined optimiser functions comprises a definition that describes a functionality of the optimiser function and a logic sequence that can be implemented to carry out the optimiser function.

5. The execution optimiser of any preceding claim, wherein the interceptor is configured to intercept the inspected function by intercepting a sequence of processor operations output by the compiler based on the inspected instruction.

6. The execution optimiser of any preceding claim, wherein the interceptor is configured to perform an optimiser function by causing both a logic sequence of the optimiser function and argument data for the inspected instruction to be written to a common location in a volatile memory connected to the processor when the processor performs a memory refresh cycle for that location in volatile memory.

7. The execution optimiser of claim 6, wherein the optimiser is configured to return the result of the optimiser function by causing the output of the both the logic sequence and argument data for the inspected instruction to be written to a location in a volatile memory to which the outcome of the inspected instruction would have been written if carried out by the processor.

8. The execution optimiser of any preceding claim, further comprising a function creator configured to, if the comparator identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions, analyse the intercepted instruction to determine whether a new optimiser function can be defined therefrom.

9. The execution optimiser of claim 8, wherein the function creator is configured to, if it is determined that a new optimiser function can be defined from the inspected function to define an optimiser function therefrom.

10. The execution optimiser of claim 9, wherein the function creator is configured to create a new optimiser function only if the new optimiser function is different to all existing optimiser functions.

11. The execution optimiser of claim 9 or 10, wherein the function creator is configured to create a new optimiser function that replaces a previous optimiser function and corresponds to the intercepted instruction.

12. The execution optimiser of claim 9, 10 or 11, wherein the function creator is configured to create a new optimiser function by combining a group of function building blocks representative of a rule, a nature of the calculation and an adaptive output to create both a definition of the optimiser function and a logic sequence of the optimiser function.

13. The execution optimiser of any preceding claim, wherein the compiler is an operating system compiler and wherein the inspected instruction is an instruction from the operating system or an instruction from an application running on the operating system.

14. The execution optimiser of any preceding claim, wherein the instruction conforms to one selected from the group comprising: a CISC instruction set instruction; a RISC instruction set instruction; a x86 instruction set instruction; an ARM instruction set instruction; and a GPU instruction set instruction.

15. The execution optimiser of any preceding claim, comprising a code module configured to run from within a computer BIOS, EFI, UEFI or OpenFirmware, or to run from within a hardware driver.

16. An execution optimiser for a programmable processor, comprising:

a divider configured to identify parts of the inspected instruction that can be executed as separate operations and to create a set of separate operations for execution by the processor; and a combiner configured to combine processor outcomes from the set of separate operations into a result corresponding to a result that would have been returned by the processor carrying out the inspected instruction.

17. The execution optimiser of claim 16, wherein the divider is further configured to identify parts of the inspected instruction by identifying connectives between parts of the inspected instruction.

18. The execution optimiser of claim 16 or 17, wherein the divider is further configured to identify a presence or absence of dependencies between the parts of the inspected instruction.

19. The execution optimiser of claim 18, wherein the divider is further configured to schedule parts of the inspected operation having no dependency therebetween as separate operations which can be performed in series or in parallel by the processor.

20. The execution optimiser of claim 18 or 19, wherein the divider is further configured to schedule parts of the inspected operation having a dependency therebetween as separate operations which can be performed on one or more pipelines of the processor based upon a scheduling constraint to cause the result of a first separate operation required as an input for a second separate operation to be available to the second separate operation.

21. The execution optimiser of any of claims 1 to 15, further comprising the execution optimiser of any of claims 16 to 20 and configured to operate the divider and combiner if the comparator identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions.

22. The execution optimiser of claim 21 as dependent from at least claim 8, configured to operate the divider and combiner if the function creator determines that no new optimiser function can be defined from the inspected instruction.

23. An method of execution optimisation for a programmable processor, the method comprising:

accessing an instruction compiler to inspect an instruction for the processor being handled by the compiler;

comparing the inspected instruction to a record of previously defined optimiser functions; and if the comparing identifies the inspected instruction as corresponding to an optimiser function from the record of previously defined optimiser functions:

intercepting the inspected instruction;

performing the corresponding optimiser function instead of allowing the inspected instruction to reach the processor; and returning the result of the optimiser function as the result of the inspected instruction.

24. The method of claim 23, wherein inspecting comprises:

inspecting the instruction at each of multiple stages of the instruction compiler;

using the inspection result at one stage of the compiler to identify one or more candidate optimiser functions from the record of previously defined optimiser functions; and subsequently using the inspection result at a subsequent stage of the compiler to reduce the number of candidate optimiser functions.

25. The method of claim 23 or 24, wherein each optimiser function of the previously defined optimiser functions describes an operation that provides the same outcome for any given argument data as one or more processor instructions provided with the same argument data.

26. The method of claim 23, 24 or 25, wherein each optimiser function of the previously defined optimiser functions comprises a definition that describes a functionality of the optimiser function and a logic sequence that can be implemented to carry out the optimiser function.

27. The method of any of claims 23 to 26, wherein the intercepting comprises intercepting a sequence of processor operations output by the compiler based on the inspected instruction.

28. The method of any of claims 23 to 27, wherein the performing an optimiser function comprises causing both a logic sequence of the optimiser function and argument data for the inspected instruction to be written to a common location in a volatile memory connected to the processor when the processor performs a memory refresh cycle for that location in volatile memory.

29. The method of claim 28, wherein performing the optimiser function comprises causing the output of the both the logic sequence and argument data for the inspected instruction to be written to a location in a volatile memory to which the outcome of the inspected instruction would have been written if carried out by the processor.

30. The method of any of claims 23 to 29, further comprising, if the comparator identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions, analysing the intercepted instruction to determine whether a new optimiser function can be defined therefrom.

31. The method of claim 30, further comprising, if it is determined that a new optimiser function can be defined from the inspected function, to define an optimiser function therefrom.

32. The method of claim 31, further comprising creating a new optimiser function only if the new optimiser function is different to all existing optimiser functions.

33. The method of claim 31 or 32, wherein the function creator is configured to define a new optimiser function that replaces a previous optimiser function and corresponds to the intercepted instruction.

34. The method of claim 31, 32 or 33, wherein defining a new optimiser function comprises combining a group of function building blocks representative of a rule, a nature of the calculation and an adaptive output to create both a definition of the optimiser function and a logic sequence of the optimiser function.

35. The method of any of claims 23 to 34, wherein the compiler is an operating system compiler and wherein the inspected instruction is an instruction from the operating system or an instruction from an application running on the operating system.

36. The method of any of claims 23 to 35, wherein the instruction conforms to one selected from the group comprising: a CISC instruction set instruction; a RISC instruction set instruction; a x86 instruction set instruction; an ARM instruction set instruction; and a GPU instruction set instruction.

37. The method of any of claims 23 to 36, implemented at least partially comprising a code module configured to run from within a computer BIOS, EFI, UEFI or OpenFirmware, or to run from within a hardware driver.

38. A method of execution optimisation for a programmable processor, comprising: accessing an instruction compiler to inspect an instruction for the processor being handled by the compiler;

identifying parts of the inspected instruction that can be executed as separate operations and creating a set of separate operations for execution by the processor; and combining processor outcomes from the set of separate operations into a result corresponding to a result that would have been returned by the processor carrying out the inspected instruction.

39. The method of claim 38, further comprising identifying parts of the inspected instruction by identifying connectives between parts of the inspected instruction.

40. The method of claim 38 or 39, further comprising identifying a presence or absence of dependencies between the parts of the inspected instruction.

41. The method of claim 40, further comprising scheduling parts of the inspected operation having no dependency therebetween as separate operations which can be performed in series or in parallel by the processor.

42. The method of claim 40 or 41, further comprising scheduling parts of the inspected operation having a dependency therebetween as separate operations which can be performed on one or more pipelines of the processor based upon a scheduling constraint to cause the result of a first separate operation required as an input for a second separate operation to be available to the second separate operation.

43. The method of any of claims 23 to 37, further comprising the method of any of claims 38 to 42, wherein the identifying and combining are performed if the comparing identifies the inspected instruction as not corresponding to an optimiser function from the record of previously defined optimiser functions.

44. The method of claim 43 as dependent from at least claim 30, wherein the identifying and 5 combining are performed if the comparing determines that no new optimiser function can be defined from the inspected instruction.

45. A computer program product comprising processor-executable instructions for causing a programmable computer to become configured as the apparatus of any of claims 1-22 or to

10 perform the method of any of claims 23 to 44.

46. An execution optimiser, substantially as hereinbefore described with reference to any of the accompanying Figures.

15

47. An execution optimisation method, substantially as hereinbefore described with reference to any of the accompanying Figures

48. An computer program product for execution optimisation, substantially as hereinbefore described with reference to any of the accompanying Figures

Intellectual

Property

Office

Application No: GB1612035.4