CN114375441A

CN114375441A - Reducing compiler type checking cost through thread speculation and hardware transactional memory

Info

Publication number: CN114375441A
Application number: CN201980100216.6A
Authority: CN
Inventors: 张仕宇; 丁俊勇; 李天佑; M·R·哈格海特
Original assignee: Intel Corp
Current assignee: Intel Corp
Priority date: 2019-10-08
Filing date: 2019-10-08
Publication date: 2022-04-19
Also published as: US20220326921A1; KR20220070428A; EP4042273A4; US11880669B2; WO2021068102A1; EP4042273A1

Abstract

Systems, apparatuses, and methods may provide techniques for generating a first compiler output based on input code that includes variable information for a dynamic type and generating a second compiler output based on output code, where the second compiler output includes type check code for verifying one or more type inferences associated with the first compiler output. The techniques may also execute the first compiler output and the second compiler output in parallel via different threads.

Description

Reducing compiler type checking cost through thread speculation and hardware transactional memory

Technical Field

Embodiments are generally related to compiler technology. More particularly, embodiments relate to reducing compiler type checking costs through thread speculation and hardware transactional memory.

Background

Computer programming languages provide for the use of variables to retrieve data, perform operations, output data, and so forth. During compilation of an application written in a given programming language, the types of variables used in the application may affect certain code optimization decisions made by the compiler. Applications written in languages of dynamic type (such as JAVASCRIPT, PYTHON, and RUBY) typically have variable type information that is known only at runtime (e.g., and is unknown at compile time). The lack of variable type information at compile time may cause the application to run slower than an application written in a static type language. For example, conventional solutions may compile applications of a dynamic type into a generic code path structured to handle strings, floating point numbers, double-precision numbers, and all other variable types when a code path optimized for integer addition is likely to be more efficient (e.g., but integer type information is not known at compile time).

Although just-in-time (JIT) compilation techniques may be combined with type inference and/or speculation to increase execution speed, there is still considerable room for improvement. For example, current approaches for implementing type checking code (e.g., to verify type inferences made during compilation) may introduce processing overhead that offsets the benefits of JIT compilation and type inference/speculation.

Drawings

Various advantages of the embodiments will become apparent to those skilled in the art by reading the following specification and appended claims, and by referencing the following drawings, in which:

FIG. 1 is a comparative block diagram of an example of a conventional compiler execution architecture and a compiler execution architecture according to an embodiment;

FIG. 2 is a block diagram of an example of expected code according to an embodiment;

fig. 3 is a flow diagram of an example of a method of operating a compiler apparatus according to an embodiment;

FIG. 4 is a flowchart of an example of a method of synchronizing communications between a first compiler output and a second compiler output, according to an embodiment;

FIG. 5 is a comparative illustration of an example of a conventional compiler output and a compiler output according to an embodiment;

FIG. 6 is an illustration of an example of an execution flow when a type check passes according to an embodiment;

fig. 7 is an illustration of an example of an execution flow when a type check fails according to an embodiment;

FIG. 8 is a block diagram of an example of a performance enhanced computing system according to an embodiment;

fig. 9 is an illustration of an example of a semiconductor device according to an embodiment;

FIG. 10 is a block diagram of an example of a processor according to an embodiment; and

FIG. 11 is a block diagram of an example of a multiprocessor-based computing system, according to an embodiment.

Detailed Description

Turning now to fig. 1, a conventional execution architecture 20 is shown in which a processor core 22 ("core 1") executes compiler output 24(24a-24c), the compiler output 24 including type checking code 24a, "expected" code 24b (e.g., including payload code for retrieving data, performing operations, outputting data, etc.), and fallback code 24 c. In the illustrated example, the type check code 24a validates one or more type inferences made by a compiler (not shown) to optimize the expected code 24b, which when executed, generates the results 26. If the illustrated type checking code 24a determines that at least one of the type inference(s) is incorrect, the type checking code 24a may trigger execution of the fallback code 24c, which generates the result 28 without relying on the incorrect type inference(s).

It is particularly noted that the illustrated type check code 24a, expected code 24b, and fallback code 24c execute sequentially within the same processor core 22 (e.g., within the same thread and/or task). As a result, the conventional execution architecture 20 may execute the compiler output 24 relatively slowly, especially when the type check code 24a has a relatively high processing overhead (e.g., due to a large number of branches). In fact, if the type inference(s) is generally correct, the overhead/cost associated with the type checking code 24a may generally be unnecessary. In addition, because the type check code 24a is included in the same instruction execution stream as the expected code 24b, the compiler output 24 may consume a relatively large amount of cache space (e.g., first level/L1 instruction cache/I-cache).

In contrast, enhanced execution architecture 30 may include a first processor core 32 ("core 1") that executes a first compiler output 34(34a-34b, e.g., in a first thread), and a second processor core 36 that executes a second compiler output 38 (e.g., in a second thread) in parallel with first compiler output 34. In an embodiment, the first compiler output 34 includes expected code 34a and rolled-back code 34b, while the second compiler output 38 includes type checking code 40. Thus, the type check code 40 may validate one or more inferences made by the compiler apparatus to optimize the expected code 34a, with the expected code 34a generating the result 42. In the illustrated example, the expected code 34a also generates a result 44, the result 44 being synchronized with the result 42 of the expected code 34a via a shared memory address 46. If the illustrated results 44 of the type checking code 40 indicate that at least one of the type inference(s) is incorrect, then the expected code 34a may trigger execution of the fallback code 34b, which generates the results 48 without relying on the incorrect type inference(s). Thus, if validation of the type inference(s) fails, the fallback code 34b may transition the architecture 30 to an execution state unrelated to the type inference(s).

Indeed, via a unified programming model (such as, for example, ONEAPI), different threads may be dispatched to different hardware compute units in a heterogeneous system. More specifically, the unified programming model may be used to program a wide range of processor types, including CPUs (central processing units), GPUs (graphics processing units), FGPAs (field programmable gate arrays), and dedicated accelerators. For example, since the second compiler output 38 is typically short and contains no computational work, a unified programming model may be used to dedicate this second compiler output 38 to a small CPU core. In contrast, the unified programming model may be used to 1) dedicate a first compiler output 34 to a relatively large CPU core if that first compiler output 34 contains heavy scalar computations; 2) if the first compiler output 34 contains heavy vector computations, dedicating the first compiler output 34 to the GPU; 3) if the first compiler output 34 contains heavy matrix calculations, dedicating the first compiler output 34 to an Artificial Intelligence (AI) accelerator (e.g., an Intel MOVISION accelerator); or 4) dedicating the first compiler output 34 to the FPGA if the first compiler output 34 contains heavy spatial computations.

Thus, the enhanced execution architecture 30 enables enhanced performance by executing the first compiler output 34 and the second compiler output 38 in parallel via different threads. More specifically, the illustrated expected code 34a is not forced to wait before the operation of the type check code 40 (e.g., which may include a large number of branches) is completed, as in the conventional execution architecture 20. Instead, it is contemplated that the code 34a may be speculatively executed in a separate thread such that the results 42 are generated more quickly and while the type check code 40 is executing. In fact, experiments and investigations have shown that the dynamic type V8JAVASCRIPT engine achieves significant performance improvements without making any hardware changes. The enhanced performance may translate into higher responsiveness and/or smoothness of the application (e.g., web application, NODEJS application), as well as an improved user experience. Moreover, because the type check code 40 is isolated in a separate thread, a system with a relatively small instruction cache (I-cache) may achieve a level of locality and performance that is on par with a system with a relatively large instruction cache size.

FIG. 2 shows an example of prospective code 50(50a-50c) that may be easily substituted for prospective code 34a (FIG. 1). In the illustrated example, the expected code 50 includes: transactional execution code 50a having a section for loading data from memory and operating on the loaded data; spin lock code 50b to place the first compiler output in a wait state until verification of type inference(s) is confirmed by the second compiler output; and memory storing code 50c for storing one or more results of the transactional execution code 50a to memory. Transactional execution code 50a may also include code for aborting tasks (e.g., loads, operations) of the intended code 50 in the event of a verification failure for the type inference(s).

In this regard, Restricted Transactional Memory (RTM) may provide a software interface for transactional execution. In an embodiment, the RTM provides three instructions, XBegIN, XEND, and XABORT, to start, commit, and abort transactional execution, respectively. The XBEGIN instruction may be used to specify the beginning of a transactional code region and the XEND may be used to specify the end of the transactional code region. The processor may abort RTM transactional execution for a number of reasons (e.g., type check failure). In such cases, the hardware automatically detects a transactional abort condition, resumes execution from the retired instruction address with an architectural state corresponding to a state at the beginning of XBEGIN execution, and updates the EAX register to describe the abort state. Thus, the XABORT instruction enables execution of an RTM region to be explicitly aborted. In an embodiment, the XABORT instruction takes an 8-bit immediate variable loaded into the EAX register, which becomes available to software after an RTM abort.

Fig. 3 illustrates a method 60 of operating a compiler apparatus. The method 60 may generally be implemented in an execution architecture, such as, for example, the enhanced execution architecture 30 (fig. 1). More specifically, method 60 may be implemented as one or more modules using a set of logic instructions stored in a machine-or computer-readable storage medium such as Random Access Memory (RAM), Read Only Memory (ROM), programmable ROM (prom), firmware, flash memory, etc., in configurable logic such as, for example, Programmable Logic Arrays (PLA), Field Programmable Gate Arrays (FPGA), Complex Programmable Logic Devices (CPLD), in fixed-function logic hardware using circuit technology such as, for example, Application Specific Integrated Circuit (ASIC), Complementary Metal Oxide Semiconductor (CMOS), or transistor-transistor logic (TTL) technology, or in any combination thereof.

For example, computer program code for carrying out operations shown in method 60 may be written in any combination of one or more programming languages, including an object oriented programming language such as JAVA, SMALLTALK, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. Additionally, logic instructions may include assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, state setting data, configuration data for an integrated circuit, state information to personalize electronic circuitry and/or other structural components native to hardware (e.g., host processor, central processing unit/CPU, microcontroller, etc.).

The illustrated processing block 62 provides for: a first compiler output is generated based on the input code including variable information of the dynamic type. In an embodiment, the input code is written in a language such as, for example, JAVASCRIPT, PYTHON, RUBY, etc., where the variable type information is unknown at compile time. Thus, block 62 may include making one or more type inferences and optimizing the first compiler output based on the type(s). For example, block 62 may include: inferring portions of the input code involves integer addition operations and creating code paths in the first compiler output that are customized/optimized for the integer addition operations. At illustrated block 64, a second compiler output is generated, wherein the second compiler output includes type checking code for verifying type inference(s) associated with the first compiler output.

In addition, block 66 executes (e.g., at runtime) the first compiler output and the second compiler output in parallel via different threads. Thus, block 66 may include executing the first compiler output in a first thread running on a first processor core and executing the second compiler output in a second thread running on a second processor core. As already mentioned, threads may be dispatched to the appropriate hardware compute units via, for example, ONEAPI dynamic dispatch and interface to the execution processor. Block 66 may also include synchronizing communications between the first compiler output and the second compiler output via one or more shared memory objects (e.g., in a shared memory address). Additionally, the first compiler output may include: transactional execution code to immediately abort one or more tasks of the first compiler output if the validation of the type inference(s) fails; spin lock code to place the first compiler output in a wait state until verification of the type inference(s) is confirmed by the second compiler output; and memory code to store one or more results of the transactional execution code to memory. In an embodiment, the first compiler output further comprises rollback code for transitioning the compiler apparatus to an execution state unrelated to the type inference(s) if the validation of the type inference(s) fails.

Thus, the illustrated method 60 enables enhanced performance by executing the first compiler output and the second compiler output in parallel via different threads. More specifically, the expected code is not forced to wait before the operations of the type checking code (e.g., which may be relatively "multi-branched") are completed. Instead, code is expected to execute speculatively in a separate thread while the type check code executes. Indeed, enhanced performance may translate into better responsiveness and/or smoothness of the application and an improved user experience. Moreover, because the type check code is isolated in separate threads, a system with a relatively small instruction cache may achieve a level of locality and performance that is on par with a system with a relatively large instruction cache size. Additionally, the use of transactionally executing code (e.g., when the expected code is dispatched to the CPU core) may further enhance performance by enabling the task of the expected code to be immediately aborted upon detection of a type check failure.

Fig. 4 illustrates a method 70 of synchronizing communications between a first compiler output and a second compiler output. The method 70 may generally be incorporated into block 66 (fig. 3) already discussed. More specifically, the method 70 may be implemented as one or more modules using a set of logic instructions stored in a machine or computer readable storage medium such as RAM, ROM, PROM, firmware, flash memory, etc., in configurable logic such as, for example, PLAs, FPGAs, CPLDs, in fixed function logic hardware using circuit technology such as, for example, ASIC, CMOS or TTL technology, or in any combination thereof.

Illustrated processing block 72 provides for notifying transactional execution code in a first compiler via a first shared memory object (e.g., in a shared memory address) whether validation of the type(s) inference has been successfully performed by a second compiler output. Block 74 notifies the spin lock code in the first compiler output via the second shared memory object whether verification of the type inference(s) has been successfully performed by the second compiler output. As already mentioned, the spin lock code may place the first compiler code in a wait state until verification of the type inference(s) is confirmed by the second compiler output. Thus, the illustrated method 70 further enhances performance by making communication between the first compiler output and the second compiler output more efficient.

Fig. 5 illustrates a conventional compiler output 80 and an enhanced compiler output 82 from a JIT compilation flow 86 that optimizes (e.g., based on type inference/speculation) input code 84 written in a dynamic type of programming language. In the illustrated example, the first shared memory object (MUTEX1) is used to inform the transactional execution code in the "expected" task (e.g., the first compiler output) whether the verification of the type inference by the "type check" task (e.g., the second compiler output) has succeeded. Additionally, a second shared memory object (MUTEX2) may be used to inform spinlock code in the intended task whether the verification of the type inference(s) by the type check task has been successful. In the illustrated example, the type check task updates the MUTEX1 object only in response to a failure of the type check and updates the MUTEX2 object in response to a failure or successful completion of the type check.

Fig. 6 illustrates an execution flow 90 that may occur when a type check passes (e.g., completes successfully). If all type checks pass, the type check task will only update the MUTEX2 object. The expected operation may be executed transactionally and successfully committed. Also, because type checking is typically faster than expected operations, spin lock code is typically bypassed with less overhead (e.g., "pass through" the example before a transaction commits). If the type check is slower than the expected operation, the spinlock code will maintain the expected task in a wait state (e.g., cause the spinlock "S" to execute longer) until all type checks are completed and obtain the results from the type check task via the MUTEX2 object (e.g., the "transaction committed pass" example). In the illustrated example, memory storage is performed only after the spin lock code.

Fig. 7 illustrates an execution flow 100 that may occur when a type check fails. If the type check fails, the illustrated type check task will update both the MUTEX1 object and the MUTEX2 object. If an update occurs before a transaction in the intended task commits (e.g., the "transaction fail before commit" example), the failure causes the transaction to roll back and abort immediately. Additionally, if the type check failure occurs later than the transaction (e.g., the "fail after commit transaction" example), the transaction will commit, but the spinlock code will maintain the expected code in a wait state until the type check result is received. The rollback code will execute after the expected task obtains the type check failure result via the MUTEX2 object.

Turning now to FIG. 8, a performance enhanced computing system 150 is illustrated. The system 150 may generally be an electronic device/system with computing capabilities (e.g., personal digital assistant/PDA, notebook computer, tablet computer, convertible tablet, server), an electronic device/system with communication capabilities (e.g., smart phone), an electronic device/system with imaging capabilities (e.g., camera, camcorder), an electronic device/system with media playing capabilities (e.g., smart television/TV), portions of electronic devices/systems with wearable functionality (e.g., watches, eyewear, headwear, foot apparel, jewelry), electronic devices/systems with in-vehicle functionality (e.g., cars, trucks, motorcycles), electronic devices/systems with robotic functionality (e.g., autonomous robots), and the like, or any combination thereof. In the illustrated example, system 150 includes a host processor 152 (e.g., a central processing unit/CPU) with an Integrated Memory Controller (IMC)154, IMC 154 coupled to a system memory 156. In an embodiment, at least a portion of system memory 156 operates as hardware Restricted Transactional Memory (RTM).

The illustrated system 150 also includes an Input Output (IO) module 158, the IO module 158 being implemented as a system on chip (SoC) on a semiconductor die 162 with the host processor 152 and the graphics processor 160. The illustrated IO module 158 communicates with, for example, a display 164 (e.g., touchscreen, liquid crystal display/LCD, light emitting diode/LED display), a network controller 166 (e.g., wired and/or wireless NIC), and mass storage 168 (e.g., hard disk drive/HDD, optical disk, solid state drive/SSD, flash memory). In an embodiment, the network controller 166 obtains input codes including variable information of a dynamic type.

In an embodiment, host processor 152, graphics processor 160, and/or IO module 158 execute program instructions 170 retrieved from system memory 156 and/or mass storage 168 to perform one or more aspects of method 60 (fig. 3) and/or method 70 (fig. 4) already discussed. Thus, execution of the illustrated instructions 170 may cause the computing system 150 to generate a first compiler output based on the input code and generate a second compiler output based on the input code, wherein the second compiler output includes type check code to verify one or more type inferences associated with the first compiler output. Thus, semiconductor die 62 may operate as a compiler device (e.g., a JIT compiler that compiles at runtime).

Additionally, execution of the program code 170 may cause the computing system 150 to execute (e.g., at runtime) the first compiler output and the second compiler output in parallel via different threads. In an example, semiconductor die 162 includes the enhanced execution architecture 30 (fig. 1) already discussed. Thus, to execute the first compiler output and the second compiler output in parallel, program instructions 170 may execute the first compiler output in a first thread running on a first processor core of semiconductor die 162 and execute the second compiler output in a second thread running on a second processor core of semiconductor die 162.

In one example, the first compiler output includes: transactional execution code to abort one or more tasks output by the first compiler in the event of validation failure of the one or more type inferences; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and memory storing code for storing one or more results of the transactional execution code to system memory 156 and/or mass storage 168. The transactional execution code may also include fallback code to transition the computing system 150 to an execution state unrelated to the type inference(s) if validation of the type inference(s) fails.

Moreover, execution of the program instructions 170 may cause the computing system 150 to synchronize communication between the first compiler output and the second compiler output via the one or more shared memory objects. In such cases, the program instructions 170 may notify the transactional execution code in the first compiler output via the first shared memory object whether the verification by type inference(s) has been successfully performed by the second compiler output. Additionally, the program instructions 170 may notify the spin lock code in the first compiler output via the second shared memory object whether the verification of the type(s) has been successfully performed by the second compiler output.

As already mentioned, the first compiler output (e.g., including the intended task) and the second compiler output (e.g., including the type check task) may be allocated in different hardware computing units via ONEAPI according to the profiled computing type of the task. For example, if the known type checking task is short and contains no computational work, the second compiler outputs a mini-core (not shown) that may be dedicated to the host processor 152. Also, a first compiler output (e.g., an expected task is dedicated to the first compiler output) may be dispatched as follows: if the expected task contains heavy scalar computations, the first compiler output is dispatched to a relatively large core (not shown) of host processor 152; 2) if the expected task involves heavy vector computations, the first compiler output is dispatched to graphics processor 160; 3) if the expected task involves heavy matrix computations, the first compiler output is dispatched to an AI accelerator (not shown, e.g., Intel MOVITIUS); or 4) if the expected task involves heavy spatial computation, the first compiler output is dispatched to an FPGA (not shown). By task separation by the compiler, and dynamic assignment of a unified programming model (such as ONEAPI), all available hardware resources are fully utilized, thereby improving power savings and performance.

Thus, the illustrated computing system 150 is considered performance enhanced at least in the sense that it executes the first compiler output and the second compiler output in parallel via different threads. More specifically, the expected code is not forced to wait before the operations of the type checking code (e.g., which may be relatively multi-branched) are completed. Instead, code is expected to execute speculatively in a separate thread while the type check code executes. Indeed, enhanced performance may translate into better responsiveness and/or smoothness of the application and an improved user experience. Moreover, because the type check code is isolated in separate threads, execution of the program instructions 170 may achieve a level of locality and performance on par with systems having relatively large instruction cache sizes, with the computing system 150 having a relatively small instruction cache. Additionally, using transactional execution code may further enhance performance by causing the task of the intended code to be immediately aborted upon detection of a type check failure.

Fig. 9 illustrates a semiconductor device 172 (e.g., chip, die, package). The illustrated device 172 includes one or more substrates 174 (e.g., silicon, sapphire, gallium arsenide) and logic 176 (e.g., transistor arrays and other integrated circuit/IC components) coupled to the substrate(s) 174. In an example, logic 176 implements one or more aspects of method 60 (fig. 3) and/or method 70 (fig. 4) already discussed. Thus, the logic 176 may generate a first compiler output based on the input code and generate a second compiler output based on the input code, wherein the second compiler output includes type check code to verify one or more type inferences associated with the first compiler output. Thus, the semiconductor device 172 may operate as a compiler apparatus (e.g., a JIT compiler engine).

Logic 176 may be at least partially implemented in configurable logic or fixed function hardware logic. In one example, logic 176 includes transistor channel regions positioned (e.g., embedded) within substrate(s) 174. Thus, the interface between the logic 176 and the substrate(s) 174 may not be an abrupt junction. Logic 176 may also be considered to include epitaxial layers grown on an initial wafer of substrate(s) 174.

Since the semiconductor device 172 executes the first compiler output and the second compiler output in parallel via different threads, the semiconductor device 172 is considered to be performance enhanced. More specifically, the expected code is not forced to wait before the operations of the type checking code (e.g., which may be relatively multi-branched) are completed. Instead, code is expected to execute speculatively in a separate thread while the type check code executes. Indeed, enhanced performance may translate into better responsiveness and/or smoothness of the application and an improved user experience. Moreover, because the type check code is isolated in a separate thread, the logic 176 may achieve a level of locality and performance on par with systems having relatively large instruction cache sizes where the semiconductor device 172 has a relatively small instruction cache. Additionally, using transactional execution code may further enhance performance by causing the task of the intended code to be immediately aborted upon detection of a type check failure.

Figure 10 illustrates a processor core 200 according to one embodiment. Processor core 200 may be a core for any type of processor, such as a microprocessor, an embedded processor, a Digital Signal Processor (DSP), a network processor, or other device to execute code. Although only one processor core 200 is illustrated in fig. 10, a processing element may alternatively include more than one processor core 200 illustrated in fig. 10. Processor core 200 may be a single-threaded core, or for at least one embodiment, processor core 200 may be multithreaded in that each of its cores may include more than one hardware thread context (or "logical processor").

Fig. 10 also illustrates a memory 270 coupled to processor core 200. The memory 270 may be any of a wide variety of memories (including various layers of a memory hierarchy) known or otherwise available to those of skill in the art. Memory 270 may include one or more instructions of code 213 to be executed by processor core 200, where code 213 may implement method 60 (fig. 3) and/or method 70 (fig. 4), already discussed. Processor core 200 follows a program sequence of instructions indicated by code 213. Each instruction may enter the front-end portion 210 and be processed by one or more decoders 220. Decoder 220 may generate as its output a micro-operation, such as a fixed width micro-operation in a predefined format, or may generate other instructions, micro-instructions, or control signals that reflect the original code instruction. The illustrated front-end portion 210 also includes register renaming logic 225 and scheduling logic 230, the scheduling logic 230 generally allocating resources and queuing operations corresponding to the convert instruction for execution.

Processor core 200 is shown to include execution logic 250 having a set of execution units 255-1 through 255-N. Some embodiments may include several execution units dedicated to a particular function or set of functions. Other embodiments may include only one execution unit or one execution unit that may perform a particular function. The illustrated execution logic 250 performs the operations specified by the code instructions.

After completing execution of the operation specified by the code instruction, back-end logic 260 retires the instruction of code 213. In one embodiment, processor core 200 allows out-of-order execution but requires in-order retirement of instructions. Retirement logic 265 may take various forms (e.g., reorder buffers, etc.) as known to those skilled in the art. In this manner, processor core 200 is transformed during execution of code 213, at least in terms of the outputs generated by the decoder, the hardware registers and tables utilized by register renaming logic 225, and any registers (not shown) modified by execution logic 250.

Although not illustrated in fig. 10, the processing elements may include other elements on a chip with processor core 200. For example, a processing element may include memory control logic in conjunction with processor core 200. The processing element may include I/O control logic and/or may include I/O control logic integrated with memory control logic. The processing element may also include one or more caches.

Referring now to fig. 11, shown is a block diagram of an embodiment of a computing system 1000, according to an embodiment. Shown in fig. 11 is a multiprocessor system 1000 that includes a first processing element 1070 and a second processing element 1080. Although two processing elements 1070 and 1080 are shown, it is to be understood that embodiments of system 1000 may also include only one such processing element.

System 1000 is illustrated as a point-to-point interconnect system where a first processing element 1070 and a second processing element 1080 are coupled via a point-to-point interconnect 1050. It should be understood that any or all of the interconnects illustrated in fig. 11 may be implemented as a multi-drop bus rather than a point-to-point interconnect.

As shown in fig. 11, each of processing elements 1070 and 1080 may be multicore processors, including first and second processor cores (i.e.,

processor cores

1074a and 1074b, and

processor cores

1084a and 1084 b).

Such cores

1074a, 1074b, 1084a, 1084b may be configured to execute instruction code in a manner similar to that discussed above in connection with fig. 10.

Each processing element 1070, 1080 may include at least one shared

cache

1896a, 1896 b. The shared

caches

1896a, 1896b may store data (e.g., instructions) that are utilized by one or more components of the processors, such as the

cores

1074a, 1074b, and 1084a, 1084b, respectively. For example, the shared

caches

1896a, 1896b may locally cache data stored in the

memories

1032, 1034 for faster access by components of the processors. In one or more embodiments, the shared

caches

1896a, 1896b may include one or more intermediate level caches (such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache), Last Level Caches (LLC), and/or combinations thereof.

Although shown with only two processing elements 1070, 1080, it is to be understood that the scope of the embodiments is not so limited. In other embodiments, one or more additional processing elements may be present in a given processor. Alternatively, one or more of the processing elements 1070, 1080 may be an element other than a processor, such as an accelerator or a field programmable gate array. For example, the additional processing element(s) may include additional processor(s) that are the same as first processor 1070, additional processor(s) that are heterogeneous or asymmetric to first processor 1070, accelerators (such as, for example, graphics accelerators or Digital Signal Processing (DSP) units), field programmable gate arrays, or any other processing element. There may be various differences between the processing elements 1070, 1080 in terms of a range of quality metrics including architectural, microarchitectural, thermal, power consumption characteristics, and so forth. These differences may manifest themselves effectively as asymmetries and heterogeneity amongst the processing elements 1070, 1080. For at least one embodiment, the processing elements 1070, 1080 may reside in the same die package.

The first processing element 1070 may further include memory controller logic (MC)1072 and point-to-point (P-P) interfaces 1076 and 1078. Similarly, second processing element 1080 may include a MC 1082 and

P-P interfaces

1086 and 1088. As shown in fig. 11, MC's 1072 and 1082 couple the processors to respective memories, namely a memory 1032 and a memory 1034, which may be portions of main memory locally attached to the respective processors. Although MC 1072 and MC 1082 are illustrated as being integrated into the processing elements 1070, 1080, for alternative embodiments, the MC logic may be discrete logic external to the processing elements 1070, 1080 rather than integrated therein.

First processing element 1070 and second processing element 1080 may be coupled to I/O subsystem 1090 via

P-P interconnects

1076, 1086, respectively. As shown in FIG. 11, I/O subsystem 1090 includes

P-P interfaces

1094 and 1098. In addition, the I/O subsystem 1090 includes an interface 1092 that couples the I/O subsystem 1090 with a high performance graphics engine 1038. In one embodiment, bus 1049 may be used to couple graphics engine 1038 to I/O subsystem 1090. Alternatively, a point-to-point interconnect may couple these components.

In turn, I/O subsystem 1090 may be coupled to a first bus 1016 via an interface 1096. In one embodiment, first bus 1016 may be a Peripheral Component Interconnect (PCI) bus, or may be a bus such as a PCI Express (PCI Express) bus or another third generation I/O interconnect bus, although the scope of the embodiments is not so limited.

As shown in fig. 11, various I/O devices 1014 (e.g., biometric scanners, speakers, cameras, sensors) may be coupled to first bus 1016, along with a bus bridge 1018, which may couple first bus 1016 to a second bus 1020. In one embodiment, second bus 1020 may be a Low Pin Count (LPC) bus. In one embodiment, various devices may be coupled to the second bus 1020 including, for example, a keyboard/mouse 1012, communication device(s) 1026, and a data storage unit 1019 (such as a disk drive or other mass storage device) which may include code 1030. The illustrated code 1030 may implement the discussed method 60 (fig. 3) and/or method 70 (fig. 4), and may be similar to the discussed code 213 (fig. 10). Further, an audio I/O1024 may be coupled to second bus 1020 and battery 1010 may provide power to computing system 1000.

Note that other embodiments are contemplated. For example, instead of the point-to-point architecture of fig. 11, a system may implement a multi-drop bus or another such communication topology. Also, more or fewer integrated chips than shown in FIG. 11 may alternatively be used to partition the elements of FIG. 11.

Additional description and examples:

example 1 includes a performance enhanced computing system comprising: a network controller for obtaining an input code including variable information of a dynamic type; a processor coupled to the network controller; and a memory coupled to the processor, the memory including a set of instructions that, when executed by the processor, cause the computing system to: the method includes generating a first compiler output based on the input code, generating a second compiler output based on the input code, wherein the second compiler output includes type check code to verify one or more type inferences associated with the first compiler output, and executing the first compiler output and the second compiler output in parallel via different threads.

Example 2 includes the computing system of example 1, wherein the instructions, when executed, further cause the computing system to synchronize communication between the first compiler output and the second compiler output via the one or more shared memory objects.

Example 3 includes the computing system of example 2, wherein to synchronize the communications, the instructions, when executed, cause the computing system to: notifying transactional execution code in the first compiler output via the first shared memory object whether verification of the one or more type inferences has been successfully performed by the second compiler output; and notifying, via the second shared memory object, spin lock code in the first compiler output whether verification of the one or more type inferences has been successfully performed by the second compiler output.

Example 4 includes the computing system of example 1, wherein the first compiler output is to include: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code for storing one or more results of the transactional execution code to the memory.

Example 5 includes the computing system of example 4, wherein the first compiler output further includes fallback code to transition the computing system to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

Example 6 includes the computing system of any of examples 1-5, further comprising: a first processor core and a second processor core, wherein to execute the first compiler output and the second compiler output in parallel, the instructions, when executed, cause the computing system to: executing a first compiler output in a first thread running on a first processor core; and executing the second compiler output in a second thread running on the second processor core.

Example 7 includes a semiconductor device comprising: one or more substrates; and logic coupled to the one or more substrates, wherein the logic is at least partially implemented in one or more of configurable logic or fixed function hardware logic, the logic coupled to the one or more substrates to: the method includes generating a first compiler output based on an input code including dynamic type variable information, generating a second compiler output based on the input code, wherein the second compiler output includes type check code to verify one or more type inferences associated with the first compiler output, and executing the first compiler output and the second compiler output in parallel via different threads.

Example 8 includes the semiconductor device of example 7, wherein the logic coupled to the one or more substrates is to synchronize communication between the first compiler output and the second compiler output via the one or more shared memory objects.

Example 9 includes the semiconductor device of example 8, wherein the logic coupled to the one or more substrates is to: notifying transactional execution code in the first compiler output via the first shared memory object whether verification of the one or more type inferences has been successfully performed by the second compiler output; and notifying, via the second shared memory object, spin lock code in the first compiler output whether verification of the one or more type inferences has been successfully performed by the second compiler output.

Example 10 includes the semiconductor device of example 7, wherein the first compiler output is to include: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code for storing one or more results of the transactional execution code to the memory.

Example 11 includes the semiconductor device of example 10, wherein the first compiler output further includes fallback code to transition the computing system to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

Example 12 includes the semiconductor device of any one of examples 7 to 11, wherein the logic coupled to the one or more substrates is to: executing a first compiler output in a first thread running on a first processor core; and executing the second compiler output in a second thread running on the second processor core.

Example 13 includes at least one computer-readable storage medium comprising a set of instructions that, when executed by a computing system, cause the computing system to: generating a first compiler output based on an input code comprising variable information of a dynamic type; generating a second compiler output based on the input code, wherein the second compiler output includes type-checking code for verifying one or more type inferences associated with the first compiler output; and executing the first compiler output and the second compiler output in parallel via different threads.

Example 14 includes the at least one computer-readable storage medium of example 13, wherein the instructions, when executed, further cause the computing system to synchronize communication between the first compiler output and the second compiler output via the one or more shared memory objects.

Example 15 includes at least one computer-readable storage medium as described in example 14, wherein to synchronize communications, the instructions, when executed, cause the computing system to: notifying transactional execution code in the first compiler output via the first shared memory object whether verification of the one or more type inferences has been successfully performed by the second compiler output; and notifying, via the second shared memory object, spin lock code in the first compiler output whether verification of the one or more type inferences has been successfully performed by the second compiler output.

Example 16 includes the at least one computer-readable storage medium of example 13, wherein the first compiler output includes: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code for storing one or more results of the transactional execution code to the memory.

Example 17 includes the at least one computer-readable storage medium of example 16, wherein the first compiler output further includes fallback code to transition the computing system to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

Example 18 includes the at least one computer-readable storage medium of any one of examples 13 to 17, wherein to execute the first compiler output and the second compiler output in parallel, the instructions, when executed, cause the computing system to: executing a first compiler output in a first thread running on a first processor core; and executing the second compiler output in a second thread running on the second processor core.

Example 19 includes a method of operating a just-in-time (JIT) compiler apparatus, the method comprising: generating a first compiler output based on an input code comprising variable information of a dynamic type; generating a second compiler output based on the input code, wherein the second compiler output includes type-checking code for verifying one or more type inferences associated with the first compiler output; and executing the first compiler output and the second compiler output in parallel via different threads.

Example 20 includes the method of example 19, further comprising: communication between the first compiler output and the second compiler output is synchronized via one or more shared memory objects.

Example 21 includes the method of example 20, wherein synchronizing communications comprises: notifying transactional execution code in the first compiler output via the first shared memory object whether verification of the one or more type inferences has been successfully performed by the second compiler output; and notifying, via the second shared memory object, the spin lock code in the first compiler output of whether verification of the one or more type inferences by the second compiler output has been successfully performed by the second compiler output.

Example 22 includes the method of example 19, wherein the first compiler output includes: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code for storing one or more results of the transactional execution code to the memory.

Example 23 includes the method of example 22, wherein the first compiler output further includes fallback code to transition the compiler apparatus to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

Example 24 includes the method of any one of examples 19 to 23, wherein executing the first compiler output and the second compiler output in parallel includes: executing a first compiler output in a first thread running on a first processor core; and executing the second compiler output in a second thread running on the second processor core.

Thus, the techniques described herein improve the performance of JAVASCRIPT and other dynamic-type languages on multi-core platforms with transactional memory. The technique also expands the advantages over systems with larger instruction cache sizes or bridges the performance gap between systems with larger instruction cache sizes, as moving multi-branch type checking into separate tasks will greatly reduce the instruction cache pressure of the intended task (e.g., benefit the performance of platforms with smaller instruction cache sizes). Moreover, the techniques may improve the user experience in terms of responsiveness, smoothness, etc. of web applications and NODEJS-based applications.

Embodiments are applicable for use with all types of semiconductor integrated circuit ("IC") chips. Examples of such IC chips include, but are not limited to, processors, controllers, chipset components, Programmable Logic Arrays (PLAs), memory chips, network chips, system on chip (SoC), SSD/NAND controller ASICs, and the like. Additionally, in some of the figures, signal conductors are represented by lines. Some lines may be different to indicate more constituent signal paths, may have a number label to indicate the number of constituent signal paths, and/or may have arrows at one or more ends to indicate primary information flow direction. However, this should not be construed in a limiting manner. Rather, such added detail may be used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit. Any represented signal lines, whether or not having additional information, may actually comprise one or more signals that may travel in multiple directions and be implemented with any suitable type of signal scheme, such as digital or analog lines implemented with differential pairs, fiber optic lines, and/or single-ended lines.

Example sizes/models/values/ranges may have been given, but embodiments are not limited thereto. As manufacturing techniques (e.g., photolithography) mature over time, it is expected that devices of smaller size could be manufactured. In addition, well known power/ground connections to IC chips and other components may or may not be shown within the FIGS, for simplicity of illustration and discussion, and to avoid obscuring aspects of the embodiments. Moreover, to avoid obscuring embodiments, and also in view of the fact that specifics with respect to implementation of such block diagram arrangements are highly dependent upon the platform within which the embodiments are to be implemented (i.e., such specifics should be well within purview of one skilled in the art), the arrangements may be shown in rudimentary form. Where specific details (e.g., circuits) are set forth in order to describe example embodiments, it should be apparent to one skilled in the art that the embodiments can be practiced without, or with variation of, these specific details. The description is thus to be regarded as illustrative instead of limiting.

The term "coupled" may be used herein to refer to any type of direct or indirect relationship between the components in question, and may apply to electrical, mechanical, fluidic, optical, electromagnetic, electromechanical or other connections. In addition, the terms "first," "second," and the like may be used herein only for ease of discussion, and do not have a specific temporal or chronological meaning unless otherwise stated.

As used in this application and the claims, a list of items joined by the term "one or more of … … can mean any combination of the listed items. As used in this application and in the claims, by the term "A, B or one or more of C" can mean a; b; c; a and B; a and C; b and C; or A, B and C.

Those skilled in the art will appreciate from the foregoing description that the broad techniques of the embodiments can be implemented in a variety of forms. Therefore, while the embodiments have been described in connection with particular examples thereof, the true scope of the embodiments should not be so limited since other modifications will become apparent to the skilled practitioner upon a study of the drawings, the specification and the following claims.

Claims

1. A performance enhanced computing system, comprising:

a network controller for obtaining an input code including variable information of a dynamic type;

a processor coupled to the network controller; and

a memory coupled to the processor, the memory including a set of instructions that, when executed by the processor, cause the computing system to:

generating a first compiler output based on the input code,

generating a second compiler output based on the input code, wherein the second compiler output comprises type check code for verifying one or more type inferences associated with the first compiler output, an

Executing the first compiler output and the second compiler output in parallel via different threads.

2. The computing system of claim 1, wherein the instructions, when executed, further cause the computing system to synchronize communication between the first compiler output and the second compiler output via one or more shared memory objects.

3. The computing system of claim 2, wherein to synchronize the communications, the instructions, when executed, cause the computing system to:

notifying transactional execution code in the first compiler output via a first shared memory object whether validation of the one or more type inferences has been successfully performed by the second compiler output; and

notifying spin lock code in the first compiler output via a second shared memory object whether verification of the one or more type inferences has been successfully performed by the second compiler output.

4. The computing system of claim 1, wherein the first compiler output is to comprise: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code to store one or more results of the transactional execution code to the memory.

5. The computing system of claim 4, wherein the first compiler output further comprises fallback code to transition the computing system to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

6. The computing system of any of claims 1 to 5, further comprising a first processor core and a second processor core, wherein to execute the first compiler output and the second compiler output in parallel, the instructions, when executed, cause the computing system to:

executing the first compiler output in a first thread running on the first processor core; and

executing the second compiler output in a second thread running on the second processor core.

7. A semiconductor device, comprising:

one or more substrates; and

logic coupled to the one or more substrates, wherein the logic is at least partially implemented in one or more of configurable logic or fixed function hardware logic, the logic coupled to the one or more substrates to:

generating a first compiler output based on the input code including variable information of the dynamic type,

8. The semiconductor device of claim 7, wherein the logic coupled to the one or more substrates is to synchronize communication between the first compiler output and the second compiler output via one or more shared memory objects.

9. The semiconductor device of claim 8, wherein the logic coupled to the one or more substrates is to:

10. The semiconductor device of claim 7, wherein the first compiler output comprises: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code to store one or more results of the transactional execution code to the memory.

11. The semiconductor device of claim 10, wherein the first compiler output further comprises fallback code to transition the computing system to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

12. The semiconductor device of any of claims 7 to 11, wherein the logic coupled to the one or more substrates is to:

executing the first compiler output in a first thread running on a first processor core; and

executing the second compiler output in a second thread running on a second processor core.

13. At least one computer-readable storage medium comprising a set of instructions that, when executed by a computing system, cause the computing system to:

generating a first compiler output based on an input code comprising variable information of a dynamic type;

generating a second compiler output based on the input code, wherein the second compiler output includes type-checking code for verifying one or more type inferences associated with the first compiler output; and

14. The at least one computer readable storage medium of claim 13, wherein the instructions, when executed, further cause the computing system to synchronize communication between the first compiler output and the second compiler output via one or more shared memory objects.

15. The at least one computer readable storage medium of claim 14, wherein to synchronize the communications, the instructions, when executed, cause the computing system to:

16. The at least one computer-readable storage medium of claim 13, wherein the first compiler output comprises: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code to store one or more results of the transactional execution code to the memory.

17. The at least one computer-readable storage medium of claim 16, wherein the first compiler output further includes fallback code to transition the computing system to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

18. The at least one computer readable storage medium of any one of claims 13 to 17, wherein to execute the first compiler output and the second compiler output in parallel, the instructions, when executed, cause the computing system to:

19. A method of operating a just-in-time (JIT) compiler apparatus, comprising:

20. The method of claim 19, further comprising: synchronizing communication between the first compiler output and the second compiler output via one or more shared memory objects.

21. The method of claim 20, wherein synchronizing the communications comprises:

notifying, via a second shared memory object, spin lock code in the first compiler output of whether verification of the one or more type inferences by the second compiler output has been successfully performed by the second compiler output.

22. The method of claim 19, wherein the first compiler output comprises: transactional execution code to abort one or more tasks output by the first compiler if validation of the one or more type inferences fails; spin lock code to place the first compiler output in a wait state until verification of the one or more type inferences is confirmed by the second compiler output; and a memory storing code to store one or more results of the transactional execution code to the memory.

23. The method of claim 22, wherein the first compiler output further includes fallback code for transitioning the compiler apparatus to an execution state unrelated to the one or more type inferences if validation of the one or more type inferences fails.

24. The method of any of claims 19 to 23, wherein executing the first compiler output and the second compiler output in parallel comprises: