CN111984325A

CN111984325A - Apparatus and system for improving branch prediction throughput

Info

Publication number: CN111984325A
Application number: CN202010439722.2A
Authority: CN
Inventors: M.S.S.戈文丹; 邹浮舟; A.恩戈; W.T.昌瓦特斋; M.特卡奇克; G.D.祖拉斯基
Original assignee: Samsung Electronics Co Ltd
Current assignee: Samsung Electronics Co Ltd
Priority date: 2019-05-23
Filing date: 2020-05-22
Publication date: 2020-11-24

Abstract

According to one general aspect, an apparatus may include branch prediction circuitry configured to predict whether a branch instruction will be taken or not taken. The apparatus may include a branch target buffer circuit configured to store a memory segment empty flag indicating whether a memory segment following the target address includes at least one other branch instruction, wherein the memory segment empty flag is created during a commit phase prior to occurrence of the branch instruction. The branch prediction circuit may be configured to skip a memory segment if the memory segment empty flag indicates the absence of other branch instructions.

Description

Apparatus and system for improving branch prediction throughput

Technical Field

The present disclosure relates to processor instruction flow, and more particularly, to improving branch prediction throughput by skipping cache lines without branches (cachelines).

Background

In computer architectures, a branch predictor or branch prediction unit is a digital circuit that attempts to guess which way a branch (e.g., if-then-else structure, jump instruction) will go before actually calculating and knowing the result. The purpose of branch predictors is generally to improve the flow in an instruction pipeline. In many modern pipelined microprocessor architectures, branch predictors play a crucial role in achieving high performance.

Bidirectional branching is typically implemented using conditional jump instructions. The conditional jump may be either "not taken (token)" and continue to execute the first code segment immediately following the conditional jump, or "taken (token)" and jump to a different location in program memory where the second code segment is stored. It is often not certain whether a conditional jump is taken or not taken before the condition has been computed and the conditional jump has passed through the execution stages of the instruction pipeline.

Without branch prediction, the processor would typically have to wait for the conditional jump instruction to pass through the execution stage before the next instruction can enter the fetch stage in the pipeline. Branch predictors attempt to avoid this waste of time by attempting to guess whether a conditional jump is most likely to be taken or not taken. The instructions that are guessed to be most likely to be taken at the destination of the branch are then fetched and speculatively executed. If the instruction execution stage detects that a speculative branch is incorrect, the speculatively or partially executed instruction is typically discarded and the pipeline resumes from the correct branch, causing a delay.

Disclosure of Invention

According to one general aspect, an apparatus may include: branch prediction circuitry configured to predict whether a branch instruction is taken or not taken. The apparatus may include a branch target buffer circuit configured to store a memory segment empty flag indicating whether a memory segment following the target address includes at least one other branch instruction, wherein the memory segment empty flag is created during a commit phase prior to occurrence of the branch instruction. The branch prediction circuit may be configured to skip a memory segment if the memory segment empty flag indicates the absence of other branch instruction(s).

According to another general aspect, an apparatus may include branch detection circuitry configured to detect a presence of at least one branch instruction stored within a portion of a memory segment during a commit phase of a current instruction. The apparatus may include a branch target buffer circuit configured to store: a branch instruction address; and a memory segment empty flag indicating whether a portion of the memory segment following the target address includes at least one other branch instruction.

According to another general aspect, a system may include branch detection circuitry configured to detect a presence of at least one branch instruction stored within a portion of a memory segment during a current commit instruction commit phase. The system may include a branch target buffer circuit configured to store: a branch instruction address; and a memory segment empty flag indicating whether a portion of the memory segment following the target address includes at least one other branch instruction. The system may include branch prediction circuitry configured to predict whether a branch instruction is taken, and wherein the branch prediction circuitry is configured to skip a memory segment if the associated memory segment empty flag indicates the absence of a branch instruction.

The details of one or more implementations are set forth in the accompanying drawings and the description below. Other features will be apparent from the description and drawings, and from the claims.

A system and/or method for processor instruction flow, and more particularly, to a system and/or method for improving branch prediction throughput by skipping cache lines without branches, substantially as shown in and/or described in connection with at least one of the figures, as set forth more completely in the claims.

Drawings

Fig. 1 is a block diagram of an example embodiment of a system in accordance with the disclosed subject matter.

FIG. 2 is a block diagram of an example embodiment of a data structure in accordance with the disclosed subject matter.

FIG. 3 is a diagram of an example embodiment of a data structure according to the disclosed subject matter.

Fig. 4 is a block diagram of an example embodiment of a system in accordance with the disclosed subject matter.

Fig. 5 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter.

Fig. 6 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter.

FIG. 7 is a schematic block diagram of an information handling system that may include devices formed in accordance with the principles of the disclosed subject matter.

Like reference symbols in the various drawings indicate like elements.

Detailed Description

Various example embodiments will be described more fully hereinafter with reference to the accompanying drawings, in which some example embodiments are shown. The subject matter of the present disclosure may, however, be embodied in many different forms and should not be construed as limited to the example embodiments set forth herein. Rather, these example embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosed subject matter to those skilled in the art. In the drawings, the size and relative sizes of layers and regions may be exaggerated for clarity.

It will be understood that when an element or layer is referred to as being "on," "connected to" or "coupled to" another element or layer, it can be directly on, connected or coupled to the other element or layer or intervening elements or layers may be present. In contrast, when an element is referred to as being "directly on," "directly connected to" or "directly coupled to" another element or layer, there are no intervening elements or layers present. Like numbers refer to like elements throughout. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

It will be understood that, although the terms first, second, third, etc. may be used herein to describe various elements, components, regions, layers and/or sections, these elements, components, regions, layers and/or sections should not be limited by these terms. These terms are only used to distinguish one element, component, region, layer or section from another region, layer or section. Thus, a first element, component, region, layer or section discussed below could be termed a second element, component, region, layer or section without departing from the teachings of the disclosed subject matter.

Spatially relative terms, such as "below," "in.. below," "over," "above," "over," and the like, may be used herein for ease of description to describe one element or feature's relationship to another element(s) or feature(s), as illustrated. It will be understood that the spatially relative terms are intended to encompass different orientations of the device in use or operation in addition to the orientation depicted in the figures. For example, if the device in the figures is turned over, elements described as "below" or "beneath" other elements or features would then be oriented "above" the other elements or features. Thus, the exemplary term "below" can encompass both an orientation of above and below. The device may be otherwise oriented (rotated 90 degrees or at other orientations) and the spatially relative descriptors used herein interpreted accordingly.

Likewise, for ease of description, electrical terms such as "high," "low," "pull-up," "pull-down," "1," "0," etc. may be used herein to facilitate describing other voltage levels or another element(s) or feature(s) relative to a voltage level or current, as shown. It will be understood that the electrically relative terms are intended to encompass different reference voltages of the device in use or operation in addition to the voltages or currents depicted in the figures. For example, if a device or signal in the figure is inverted or uses another reference voltage, current or charge, then an element described as "high" or "pull-up" will be "low" or "pull-down" compared to the new reference voltage or current. Thus, the exemplary term "high" may encompass a relatively low or high voltage or current. Otherwise, the device may be based on a different electrical frame of reference and interpret the electrical relative descriptors used herein accordingly.

The terminology used herein is for the purpose of describing particular example embodiments only and is not intended to be limiting of the subject matter of the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Example embodiments are described herein with reference to cross-sectional views, which are schematic illustrations of idealized example embodiments (and intermediate structures). As such, variations from the shapes of the illustrations as a result, for example, of manufacturing techniques and/or tolerances, are to be expected. Thus, example embodiments should not be construed as limited to the particular shapes of regions illustrated herein but are to include deviations in shapes that result, for example, from manufacturing. For example, an implanted region illustrated as a rectangle will typically have rounded or curved features and/or an implant concentration gradient at its edges, rather than a binary change from implanted to non-implanted regions. Also, a buried region formed by implantation may result in some implantation in the region between the buried region and the surface through which implantation occurs. Thus, the regions illustrated in the figures are schematic in nature and their shapes are not intended to illustrate the actual shape of a region of a device and are not intended to limit the scope of the disclosed subject matter.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which the presently disclosed subject matter belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein

Hereinafter, example embodiments will be explained in detail with reference to the accompanying drawings.

Fig. 1 is a block diagram of an example embodiment of a system 100 in accordance with the disclosed subject matter. In various embodiments, system 100 may comprise a computer, a plurality of discrete integrated circuits, or a system on a chip (SoC). As described below, system 100 may include many other components that are not shown in this figure so as not to obscure the disclosed subject matter.

In the illustrated embodiment, system 100 includes system memory 104. In various embodiments, system memory 104 may be comprised of Dynamic Random Access Memory (DRAM). It should be understood that the above is merely one illustrative example and that the disclosed subject matter is not so limited. In such embodiments, system memory 104 may comprise on-module memory (e.g., a dual in-line memory module (DIMM)), may be an integrated chip soldered or otherwise fixedly integrated with system 100, or may even be incorporated as part of an integrated chip (e.g., SoC) comprising system 100. It is to be understood that the above are merely some illustrative examples and that the disclosed subject matter is not so limited.

In the illustrated embodiment, the system memory 104 may be configured to store data segments or information. These data segments may include instructions that cause processor 102 to perform various operations. In general, system memory 104 may be part of a larger memory hierarchy that includes multiple caches. In various embodiments, the operations described herein may be performed by another level or tier of the memory hierarchy (e.g., a level 2(L2) cache). It should be understood by those skilled in the art that although operations are described with reference to system memory 104, the disclosed subject matter is not limited to this illustrative example.

In the illustrated embodiment, the system 100 also includes a processor 102. The processor 102 may be configured to perform a number of operations as indicated by various instructions. These instructions may be executed by various execution units (most not shown), such as an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a load/store unit (LSU), an instruction fetch unit 116(IFU), and so forth. It is understood that a unit is merely a collection of circuits that are combined together to perform a portion of the functionality of the processor 102. Generally, a unit performs one or more operations in the pipeline architecture of processor 102.

In the illustrated embodiment, processor 102 may include a Branch Prediction Unit (BPU) 112. As described above, when the processor 102 is executing an instruction stream, the instruction(s) may be branch instructions. A branch instruction is an instruction that causes an instruction stream to branch or diverge between two or more paths. A typical example of a branch instruction is an if-then structure, where (then) a first set of instructions will be executed if (if) satisfies a certain condition (e.g. the user clicks the "OK" button), and a second set of instructions will be executed if (if) does not satisfy a certain condition (e.g. the user clicks the "Cancel" button). As described above, this is a problem in pipelined processor architectures because new instructions must enter the pipeline of processor 102 before the outcome of the branch, jump, or if-then structures is known (because the pipeline stage that resolves the branch instruction is located deep in the pipeline). Thus, new instructions must be prevented from entering the pipeline until the branch instructions are resolved (thereby negating the main advantages of the pipeline architecture), or the processor 102 must guess in which way the instruction stream will branch and speculatively place those instructions into the pipeline. The BPU 112 may be configured to predict how the instruction stream will branch. In the illustrated embodiment, the BPU 112 may be configured to output predicted instructions, or more precisely, memory addresses that store predicted instructions.

In the illustrated embodiment, the processor 102 includes a Branch Prediction Address Queue (BPAQ) 114. BPAQ 114 may include a memory structure configured to store a plurality of addresses for predicted instructions that have been predicted by BPU 112. BPAQ 114 may store the addresses of these predicted instructions in a first-in-first-out (FIFO) order, such that the instruction addresses are output from BPAQ 114 in the same order in which BPU 112 predicted them.

In the illustrated embodiment, processor 102 includes an Instruction Fetch Unit (IFU)116 configured to fetch instructions from a memory hierarchy and place them into the pipeline of processor 102. In such embodiments, IFU 116 may be configured to take the memory address associated with the most recent or oldest instruction (the next instruction) from BPAQ 114 and request the actual instruction from the memory hierarchy. Ideally, instructions will be provided quickly from the memory hierarchy and placed into the pipeline of the processor 102.

Ideally, instructions may be fetched from the level 1(L1) instruction cache 118 (by memory access (s)). In such embodiments, the L1 instruction cache 118, being the top level or higher of the memory hierarchy, may be relatively fast in the pipeline and cause little or no delay. Occasionally, however, the L1 instruction cache 118 may not include the desired instruction. This will result in a cache miss (miss) and will have to fetch or load instructions from a lower, slower level of the memory hierarchy (e.g., system memory 104). Such cache misses may cause delays in the pipeline of processor 102, as instructions will not be input into the pipeline at the rate of one per cycle (or the maximum rate in the processor architecture).

In the illustrated embodiment, processor 102 includes an instruction prefetch unit (IPFU) 120. IPFUs 120 are configured to prefetch instructions into L1 instruction cache 118 before IFU 116 performs the actual fetch operation. Thus, IPFU 120 reduces the occurrence of any cache misses experienced by IFU 116. IPFU 120 may do this by requesting a predicted instruction from L1 instruction cache 118 before IFU 116 executes. In such an embodiment, if a cache miss then occurs, L1 instruction cache 118 will begin the process of requesting the missing instruction from system memory 104. In such an embodiment, instructions may be received and stored in the L1 instruction cache 118 upon request by the IFU 116.

Returning to the BPU 112, the processor 102 may include a Branch Target Buffer (BTB) circuit 122. In various embodiments, BTB 122 may include a memory that maps branch addresses to previously predicted target addresses (to which the branch is to jump). In such embodiments, the BTB 122 may indicate which address the previous iteration of the branch instruction last jumped to or predicted to jump to. This makes the work of the BPU 112 easier and faster, because the BPU 112 may simply request a predicted branch target address from the BTB 122, rather than perform a complete address prediction calculation.

Likewise, the processor 102 may include a Return Address Stack (RAS) circuit 124. In various embodiments, RAS 124 may be a memory or data structure that stores a memory address to return to once a current branch operation or instruction (typically a return instruction) has been completed. For example, when the branch is a subroutine call, once completed, the subroutine will return to the next instruction after the calling memory address. In various embodiments, the RAS calculation circuit 126 may perform this return address calculation.

Now showing the basic structure of the processor 102, FIG. 2 illustrates the operations performed by the processor 102.

FIG. 2 is a block diagram of an example embodiment of a data structure 200 in accordance with the disclosed subject matter. In various embodiments, data structure 200 may represent a memory store of various instructions to be fetched and processed by processor 102 of FIG. 1.

In this context, the general term for a block or portion of memory is "memory segment". For purposes of example, a memory segment may include a cache line, but in particular embodiments, the cache line is larger. In this context, a cache line may be a unit of data transfer between the L1 instruction cache 118 and main memory (e.g., system memory 104). In various embodiments, the disclosed subject matter may relate to a memory segment of a plurality of cache lines, a portion of a cache line, or a memory size that is not measured in a cache line at all. It is to be understood that the above is merely one illustrative example, and that the disclosed subject matter is not so limited.

In the illustrated embodiment, data structure 200 includes

cache lines

204 and 206 that occur in sequence. In such embodiments, as described above, the processor 102 typically fetches and processes instructions from the beginning (e.g., left side) of the cache lines 204 and 206 to the end (e.g., right side) of the cache lines 204 and 206.

The cache lines include branch instructions A211, B212, C213, D214, E215, F216, and G217. In various embodiments, the BPU 112 of fig. 1 may be configured to process each branch instruction (which, for simplicity, is considered a subroutine call) and continue to process cache lines in order as the branch returns to that point.

The BPU 112 may be configured to stop processing at the memory segment of the cache line boundary (for that clock cycle (s)). For example, in processing cache line 204, BPU 112 may process A211 in a first cycle, then process B212 in a second cycle, then process C213 in a third cycle, then process D214 in a fourth cycle, then check portion 224 in a fifth cycle, stopping at the end of cache line 204 before continuing to move to E215 of cache line 206 in the sixth cycle.

Since there are no branches to process in portion 224 (as opposed to portion 222), the time taken to check the cache line is a wasted cycle (or many cycles taken to process portion 224). In various embodiments, portion 224 may comprise a complete cache line. The disclosed subject matter can eliminate or reduce such split pipeline bubbling (or lack of operation during one or more cycles).

In the disclosed subject matter, BTB 122 and/or RAS 124 may include an indication of whether portion 224, or more generally the portion following the target of any given branch instruction, is free of (empty or void) branch instructions. In such embodiments, "empty" does not mean that no instructions are stored there, only that no branch instructions are stored in the memory segment. It is expected (but not required) that many non-branch instructions will fill this portion 224.

For example, branch 202 (e.g., the return branch from call D214) may return the Program Counter (PC) to the end of portion 222. After this return, the BPU 112 may examine the RAS 124 and determine that there are no more branch instructions after D214 (portion 224). The BPU 112 may then begin processing the next cache line 206; thus, the wasted computation time involved for checking the branches of portion 224 is saved.

Similarly, the BTB 122 may include a flag that indicates whether the memory segment following the target address of the branch has no additional branch instructions. In such embodiments, if branch 202 is not a return (from a call) but is another type of branch instruction (e.g., a call, an unconditional jump, a jump, etc.), BTB 122 may include the target address (e.g., the address of the beginning of portion 224) and whether the portion from the target address to the end of the cache line (i.e., portion 224) is free of additional branch instructions.

FIG. 3 is a diagram of an example embodiment of

data structures

300 and 301 according to the disclosed subject matter. In such embodiments, the data structure 300 may be stored by a branch target buffer (e.g., BTB 122 of fig. 1). In various embodiments, the data structure 301 may be stored by a return address stack (e.g., RAS 124 of FIG. 1). It is to be understood that the above are merely some illustrative examples and that the disclosed subject matter is not so limited.

In the illustrated embodiment, data structure 300 may illustrate a representative embodiment of the state of a BTB. In such embodiments, the BTB may include at least three columns or fields (although more columns or fields may be used in various embodiments). The first field 302 includes the address (or other identifier) of the branch instruction. The second field 304 may include the predicted target address of the branch (i.e., the address to which the branch may jump). In a conventional BTB, these two fields 302 and 304 may be the only columns or fields, except for a valid flag (not shown) (note whether a row, line, or entry may be used).

In such an embodiment, when the BPU encounters a branch instruction, it looks up via its memory address (first field 302), and the BPU determines where in memory to find the next instruction (via second field 304). As described above, in such embodiments, the BPU may waste one or more cycles looking for a branch instruction in a non-existent memory address (i.e., the memory segment passing through the target address is empty or has no branch instructions) when that target address is reached.

However, in the illustrated embodiment, the BPU may be configured to check the third field or null flag 306. In such embodiments, the empty flag 306 may indicate whether the memory segment passing through the target address is empty or has no branch instructions. In various embodiments, the value of the null flag 306 may be calculated the first time a branch instruction is encountered. In some embodiments, when the correctness (or lack thereof) of the branch is fully resolved, this may be done at the commit stage or the pipeline stage.

In various embodiments, the empty flag 306 of the memory segment may comprise a single bit or a true/false value. In such embodiments, the empty flag 306 may refer to only an immediate memory segment (immediate memory segment) that includes the target address. In another embodiment, the empty flag 306 may indicate how many memory segments should be skipped. For example, the last line of data structure 300 has a value of 3, indicating that the current memory segment plus the other two memory segments have no branch instructions.

In another embodiment, null flag 306 may comprise a valid flag. In another embodiment, the valid flag of the null flag may be stored as a separate field (not shown). In such embodiments, the valid flag of the null flag may indicate whether the null flag 306 has been computed and whether the null flag 306 may be relied upon. For example, an entry may be placed in the BTB during the instruction fetch pipeline stage, but the empty flag 306 may not be computed prior to the commit stage. Alternatively, in another example, the null flag 306 may only be valid for branches predicted to be "taken" and not valid for branches predicted to be "not taken" (or vice versa). In yet another embodiment, the null flag 306 may only be valid for certain types of branches (e.g., calls and returns). It is to be understood that the above are merely a few illustrative examples and that the disclosed subject matter is not so limited.

In such an embodiment, the null flag 306 may be incremented by 1 bit. In such an embodiment, a valid and true (or set) null flag may be "0 x11," while a valid but false (or cleared) null flag may be "0 x10," where the first bit is a valid bit and the second bit is a null state. It is to be understood that the above is merely one illustrative example, and that the disclosed subject matter is not so limited.

In the illustrated embodiment, the data structure 301 may show a representative embodiment of the state of the RAS. In such embodiments, the RAS may include at least two columns or two fields (although more columns or fields may be used in various embodiments). The field 312 includes a return address (or other identifier) to be returned by the calling branch instruction. In a conventional RAS, field 312 may be the only column or field, except for a valid flag (not shown), noting whether a row, line, or entry may be used. Conventionally, the return address is pushed to the top of the data structure 301 and then popped from the top in a last-in-first-out (LIFO) manner.

In the illustrated embodiment, the BPU may be configured to check the second field or null flag 316. In such embodiments, the null flag 316 may indicate whether the memory segment passing through the target address (field 312) of the return instruction is free of branch instructions, as described above. In various embodiments, the value of the null flag 316 may be calculated the first time a call branch instruction is encountered. In various embodiments, the null flag 316 may be similar to the flags described above. In various embodiments, the empty flag 306 of the BTB and the empty flag 316 of the RAS may include a difference in format or information.

Fig. 4 is a block diagram of an example embodiment of a system 400 in accordance with the disclosed subject matter. In various embodiments, system 400 may comprise a computer, a plurality of discrete integrated circuits, or a system on a chip (SoC). As described below, system 400 may include many other components that are not shown in this figure so as not to obscure the disclosed subject matter.

In the illustrated embodiment, system 400 includes system memory 104. In various embodiments, system memory 104 may be comprised of Dynamic Random Access Memory (DRAM). It should be understood, however, that the foregoing is merely an illustrative example and that the disclosed subject matter is not so limited. In such embodiments, system memory 104 may comprise on-module memory (e.g., a dual in-line memory module (DIMM)), may be an integrated chip soldered or otherwise fixedly integrated with system 400, or may even be incorporated as part of an integrated chip (e.g., SoC) comprising system 400. It is to be understood that the above are merely some illustrative examples and that the disclosed subject matter is not so limited.

In the illustrated embodiment, the system 400 also includes a processor 102. The processor 102 may be configured to perform a number of operations as indicated by various instructions. These instructions may be executed by various execution units (most not shown), such as an Arithmetic Logic Unit (ALU), a Floating Point Unit (FPU), a load/store unit (LSU), an instruction fetch unit 116(IFU), and so forth. It is understood that an element is merely a collection of circuits that are combined together to perform a portion of the functionality of the processor 102. Generally, a unit performs one or more operations in the pipeline architecture of processor 102.

In various embodiments, the processor 102 may operate in various pipeline stages. In computing, a pipeline, also referred to as a data pipeline, is a collection of data processing elements connected in a coarse series (rough series), where the output of one element is the input of the next element. The elements of the pipeline are typically executed in parallel or in a time-sliced manner. A certain amount of buffer memory is usually inserted between the elements.

In a classical Reduced Instruction Set Computer (RISC) pipeline, stages include: instruction fetch (most of which is shown in fig. 1), instruction decode, execution, memory access, and write back. In modern out-of-order and speculative execution processors, the processor 102 may execute instructions that are not needed. The pipeline stage in which it is determined whether an instruction (or its result) is needed is called the commit stage. If the commit stage is put into the Procrumean bed of a classical RISC pipeline, it may be put into a write back stage. In various embodiments or architectures, the commit stage may be a separate pipeline stage.

In the illustrated embodiment, the processor 102 may include an execution unit 402 as described above. In the illustrated embodiment, processor 102 may include a commit queue 404 in which completed instructions are placed in chronological order.

In the illustrated embodiment, processor 102 may include a register file 406. In such embodiments, when instructions are committed (rather than discarded), the results of those instructions may be placed or committed into register file 406. In modern computers with rename registers, commit actions may include verifying or marking as correct values that have been stored in register file 406. In various embodiments, the processor may include a cache 418 (e.g., a data cache) where the data of the register file is eventually moved and then to the system memory 104, as described above.

Further, in the illustrated embodiment, the processor 102 may include a branch detection circuit 420. In such embodiments, the branch detection circuitry 420 may be configured to detect the presence of at least one branch instruction stored with a portion of a memory segment (e.g., a cache line) during the commit phase of the current instruction.

In such embodiments, once the branch detection circuit 420 has made a determination as to whether the memory segment portion is free of any branch instructions, it may create or update a memory segment empty tag in the BTB 122, as described above. In various embodiments, this may include setting or clearing an empty tag associated with the branch instruction.

In some embodiments, the processor 102 or branch detection circuitry 420 may include a last branch memory 422 to store the last or current branch instruction encountered from the commit queue 404. In such embodiments, the last branch memory 422 may indicate the branch instruction associated with the currently computed null tag. In various embodiments, this last branch memory 422 may be valid (no branch empty tags are being computed active) or invalid (no branch empty tags are being computed active).

In various embodiments, BTB 122 may be graph-based. In such embodiments, branches may be stored as nodes, and edges may represent control flow of a program or set of instructions. In various embodiments, the disclosed subject matter may be limited to a first level BTB of a multi-level or hierarchical BTB structure. It is to be understood that the above is merely one illustrative example, and that the disclosed subject matter is not so limited.

In various embodiments, certain designs define instruction blocks and instruction sequences that end with branches. In such embodiments, BTB 122 may look up or index branches based on the starting address of the block rather than the actual address of the branch instruction. In such embodiments, the disclosed subject matter is modified accordingly. Further, the BTB metadata may be enhanced to store how many empty cache lines or memory segments may be skipped before the next branch instruction is encountered. It is to be understood that the above are merely some illustrative examples and that the disclosed subject matter is not so limited.

In various embodiments, a Branch Target Buffer (BTB) may be configured to store metadata associated with branch instructions, e.g., an empty flag. A Branch Prediction Pipeline (BPP) may be configured to detect branch instructions whose target cache line portions or are completely empty, and skip branch predictions for any empty target cache lines. In various embodiments, BPP may achieve this by training with a committed instruction cache line. BPP may mark a taken branch instruction whose target cache line is empty by setting at least one of the take target cache line empty flags. The BPP may mark the not taken branch instruction with the not taken target cache line empty flag as true (true) in the BTB entry of the branch instruction. The BPP may examine the BTB entry or Return Address Stack (RAS) of the branch instruction to determine whether the target cache line empty flag is set. If the target cache line empty flag is set, BPP may skip branch prediction for one or more instruction cache lines of the target cache line that include branch instructions.

Fig. 5 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter. In various embodiments, the technique 500 may be used or generated by a system such as that of fig. 4 or fig. 7. Although it is understood that the above are merely some illustrative examples, the disclosed subject matter is not so limited. It is to be understood that the disclosed subject matter is not limited to the order or number of acts shown in technique 500.

In various embodiments, as described above, technique 500 may illustrate an embodiment of a technique employed by a processor or branch detection unit to determine the correct state of a memory segment empty flag, as described above. In the illustrated embodiment, a technique 500 is shown that may be dedicated to taking branches (taken branches). In another embodiment, a technique may be employed for a not-taken branch (not-taken branch). In yet another embodiment, techniques may be employed for both taken and not-taken branches and/or various types of branch instructions (e.g., call, return, unconditional jump, conditional jump, zero-value jump, or other value jump, etc.). It is to be understood that the above is merely one illustrative example, and that the disclosed subject matter is not so limited.

Block 502 illustrates that, in one embodiment, a commit instruction may be examined to determine if it is a branch instruction. As described above, commit instructions may be provided by or stored in a commit queue, which accommodates branch instructions or non-branch instructions in chronological order. In such embodiments, non-branch instructions may be grouped by the memory segment from which they come.

Block 504 illustrates that, in one embodiment, if the commit instruction is a branch instruction, the branch instruction (or address thereof) may be stored in the last branch memory, as described above. In various embodiments, the last branch memory may be marked as valid or as storing an address for empty flag determination.

Block 506 illustrates that, in one embodiment, if the commit instruction is not a branch instruction, a check may be made to determine whether the last branch memory is valid or active.

Block 508 illustrates that, in one embodiment, if the commit instruction is not a branch instruction and the last branch memory value is valid, then the empty flag associated with the branch stored in the last branch memory may be set to a value indicating that the remainder of the memory segment does not contain a branch instruction. As described above, an empty flag may be stored in the BTB.

Block 510 illustrates that, in one embodiment, if the commit instruction is not a branch instruction, the last branch memory value may be invalidated or marked inactive. In various embodiments, block 510 may be skipped if the result of block 506 indicates that the last branch memory value has been invalid.

Block 599 shows a stopping point. However, it should be understood that technique 500 may be repeated for each commit instruction.

Fig. 6 is a flow chart of an example embodiment of a technique in accordance with the disclosed subject matter. In various embodiments, the technique 600 may be used or produced by a system such as that of fig. 1 or fig. 7. Although it is understood that the above are merely some illustrative examples, the disclosed subject matter is not so limited. It is to be understood that the disclosed subject matter is not limited to the order or number of acts shown in technique 600.

In various embodiments, as described above, technique 600 may illustrate an embodiment of a technique employed by a processor or branch prediction unit to determine whether to skip or pass through a portion of a memory segment or cache line, as described above. In the illustrated embodiment, the technique 600 may be specific to the taken branch shown. In another embodiment, techniques may be employed for no-taken branches. In yet another embodiment, a technique may be employed for both taken and not-taken branches and/or various types of branch instructions (e.g., call, return, unconditional jump, conditional jump, zero-value jump, or other value jump, etc.). It is to be understood that the above is merely one illustrative example, and that the disclosed subject matter is not so limited.

Block 602 illustrates that, in one embodiment, a determination may be made as to whether a branch instruction is predicted taken. If not, the technique 600 may stop 699. However, it should be understood that the above is merely one illustrative example and that the disclosed subject matter is not so limited.

Block 604 illustrates that, in one embodiment, a determination may be made as to which type of branch instruction has been encountered. In the illustrated embodiment, the determination may be whether the branch is a call, a return, or neither. It is to be understood that the above is merely one illustrative example, and that the disclosed subject matter is not so limited.

Block 606 illustrates that, in one embodiment, if the branch instruction is neither a call nor a return, the memory segment empty flag (associated with the branch instruction) may be read from the BTB, as described above.

Block 608 illustrates that, in one embodiment, if the branch instruction is a call branch instruction, the target of the corresponding return branch instruction may be determined. It may then be determined whether the returned target memory segment or the rest of the cache line has no other branch instructions. Once this determination is made and the memory segment empty flag is created, the memory segment empty flag may be pushed onto the RAS along with the return target address, as described above. In such an embodiment, the BPU may perform block 606 for the call instruction once the empty flag of the RAS has been prepared for the final return of the call.

Block 610 illustrates that, in one embodiment, if the branch instruction is a return branch instruction, the null flag of the RAS of the branch may be read (prepared by block 608), as described above.

Block 612 illustrates that, in one embodiment, the value of the null flag (the value of BTB or RAS determined by the branch type) may be determined, as described above. If the empty flag is not set, is cleared, or indicates that the rest of the memory segment is not unbranched, the technique 600 may stop 699 and branching may occur normally.

Block 614 illustrates that, in one embodiment, it may be determined whether virtual to physical (V2P) address translation is available for the cache line containing the target address and the next sequential cache line after the target address. In various embodiments, this may be stored in a translation look-aside buffer (TLB). If the virtual-to-physical (V2P) address translation for the cache line containing the target address and the next sequential cache line after the target address is not available, an indication to move to the next memory segment may be made so that additional work may be done, such as TLB filling. Technique 600 may stop at block 699.

Block 616 illustrates that, in one embodiment, it may be determined that the target cache line and cache lines subsequent to the target cache line are both available in cache (e.g., instruction cache) and/or BTB (cache hit no miss). If not, the technique may not skip empty memory, but move to block 699.

Block 618 illustrates that, in one embodiment, if the empty flag is set (or indicates that the remainder of the target memory segment may be skipped) and both the target cache line and the cache line following the target cache line are available in the cache, the BPU may skip or pass through the remainder of the current memory segment, as described above.

Block 699 shows a stopping point. It is understood, however, that the BPU may continue further processing of branch prediction as described above, and that technique 600 may be part of a larger branch prediction technique. Further, it should be appreciated that technique 600 may be repeated for each branch instruction.

Fig. 7 is a schematic block diagram of an information handling system 700, which information handling system 700 may include a semiconductor device formed in accordance with the principles of the disclosed subject matter.

Referring to FIG. 7, an information handling system 700 may include one or more devices constructed in accordance with the principles of the disclosed subject matter. In another embodiment, information handling system 700 may employ or perform one or more techniques in accordance with the principles of the disclosed subject matter.

In various embodiments, the information handling system 700 may include computing devices such as laptop computers, desktop computers, workstations, servers, blade servers, personal digital assistants, smart phones, tablets, and other suitable computers or virtual machines or virtual computing devices thereof. In various embodiments, information handling system 700 may be used by a user (not shown).

Information handling system 700 according to the disclosed subject matter may also include a Central Processing Unit (CPU), logic, or processor 710. In some embodiments, processor 710 may include one or more Functional Unit Blocks (FUBs) or Combinational Logic Blocks (CLBs) 715. In such embodiments, the combinational logic block may include various boolean logic operations (e.g., NAND, NOR, NOT, XOR), stable logic devices (e.g., flip-flops, latches), other logic devices, or combinations thereof. These combinational logic operations can be configured in a simple or complex manner to process the input signals to achieve the desired result. It should be understood that while some illustrative examples of synchronous combinational logic operations are described, the disclosed subject matter is not so limited and may include asynchronous operations or a mixture thereof. In one embodiment, the combinational logic operation may include a plurality of Complementary Metal Oxide Semiconductor (CMOS) transistors. In various embodiments, these CMOS transistors may be arranged in gates that perform logical operations; it should be understood, however, that other techniques may be used and are within the scope of the disclosed subject matter.

The information handling system 700 according to the disclosed subject matter may also include volatile memory 720 (e.g., Random Access Memory (RAM)). Information handling system 700 according to the disclosed subject matter may also include non-volatile memory 730 (e.g., hard disk drive, optical memory, NAND, or flash memory). In some embodiments, volatile memory 720, non-volatile memory 730, or combinations or portions thereof, may be referred to as "storage media". In various embodiments, the volatile memory 720 and/or the nonvolatile memory 730 may be configured to store data in a semi-permanent or substantially permanent form.

In various embodiments, the information handling system 700 may include one or more network interfaces 740 configured to allow the information handling system 700 to become part of and communicate via a communication network. Examples of Wi-Fi protocols can include, but are not limited to, Institute of Electrical and Electronics Engineers (IEEE)802.11g, IEEE 802.11 n. Examples of cellular protocols may include, but are not limited to: IEEE 802.16m, (also known as wireless MAN (metropolitan area network) advanced), Long Term Evolution (LTE) advanced, enhanced data rates for GSM (global system for mobile communications) evolution (EDGE), evolved high speed packet access (HSPA +). Examples of wired protocols may include, but are not limited to, IEEE 802.3 (also known as ethernet), fibre channel, power line communications (e.g., HomePlug, IEEE 1901). It is to be understood that the above are merely some illustrative examples and that the disclosed subject matter is not so limited.

The information processing system 700 according to the disclosed subject matter may also include a user interface unit 750 (e.g., a display adapter, a haptic interface, a human interface device). In various embodiments, the user interface unit 750 may be configured to receive input from a user and/or provide output to a user. Other kinds of devices may also be used to provide for interaction with the user; for example, feedback provided to the user can be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input.

In various embodiments, the information handling system 700 may include one or more other devices or hardware components 760 (e.g., a display or monitor, a keyboard, a mouse, a camera, a fingerprint reader, a video processor). It is to be understood that the above are merely some illustrative examples and that the disclosed subject matter is not so limited.

The information handling system 700 according to the disclosed subject matter may also include one or more system buses 705. In such embodiments, the system bus 705 may be configured to communicatively couple the processor 710, the volatile memory 720, the non-volatile memory 730, the network interface 740, the user interface unit 750, and the one or more hardware components 760. Data processed by the processor 710 or data input from outside the non-volatile memory 730 may be stored in the non-volatile memory 730 or the volatile memory 720.

In various embodiments, information handling system 700 may include or execute one or more software components 770. In some embodiments, software components 770 may include an Operating System (OS) and/or applications. In some embodiments, the OS may be configured to provide one or more services to applications and manage or act as an intermediary between the applications and various hardware components of the information processing system 700 (e.g., the processor 710, the network interface 740). In such embodiments, the information handling system 700 may include one or more native applications that may be installed locally (e.g., within the non-volatile memory 730) and configured to be executed directly by the processor 710 and to interact directly with the OS. In such embodiments, the native application may comprise pre-compiled machine executable code. In some embodiments, the native application may include a script interpreter (e.g., C shell (csh), AppleScript, AutoHotkey) or a virtual execution machine (VM) (e.g., Java virtual machine, Microsoft common language runtime) configured to convert source or object code into executable code that may then be executed by the processor 710.

The above semiconductor devices may be packaged using various packaging techniques. For example, Package On Package (POP) technology, Ball Grid Array (BGA) technology, Chip Scale Package (CSP) technology, Plastic Leaded Chip Carrier (PLCC) technology, plastic dual in-line package (PDIP) technology, die-in-package-on-chip technology, die-in-wafer technology, chip-on-board (COB) technology, ceramic dual in-line package (CERDIP) technology, plastic standard quad flat pack (PMQFP) technology, plastic flat package (PQFP) technology, small outline ic (soic) technology, Scaled Small Outline Package (SSOP) technology, Thin Small Outline Package (TSOP) technology, thin flat package (TQFP) technology, System In Package (SIP) technology, multi-chip package (MCP) technology, wafer level structure package (WFP) technology, wafer level process stack package (WSP) technology, or any other technology known to those skilled in the art for packaging a semiconductor device constructed in accordance with the principles of the disclosed subject matter.

Method steps can be performed by one or more programmable processors executing a computer program to perform functions by operating on input data and generating output. Method steps can also be performed by, and apparatus can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

In various embodiments, a computer-readable medium may include instructions that, when executed, cause an apparatus to perform at least a portion of the method steps. In some embodiments, the computer readable medium may be included in magnetic media, optical media, other media, or a combination thereof (e.g., CD-ROM, hard drive, read-only memory, flash drive). In such embodiments, the computer-readable medium may be an article of manufacture that is embodied, both tangible and non-transitory.

While the principles of the disclosed subject matter have been described with reference to example embodiments, it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the disclosed concepts. Accordingly, it should be understood that the above embodiments are not limiting, but merely illustrative. Accordingly, the scope of the disclosed concept is to be determined by the broadest permissible interpretation of the following claims and their equivalents, and shall not be restricted or limited by the foregoing description. It is, therefore, to be understood that the appended claims are intended to cover all such modifications and changes as fall within the scope of the embodiments.

Claims

1. An apparatus, comprising:

branch prediction circuitry configured to predict whether a branch instruction will be taken or not taken;

a branch target buffer circuit configured to store a memory segment empty flag indicating whether a memory segment following a target address includes at least one other branch instruction, wherein the memory segment empty flag is created during a commit phase prior to occurrence of the branch instruction;

wherein the branch prediction circuit is configured to skip the memory segment if the memory segment empty flag indicates the absence of other branch instructions.

2. The apparatus of claim 1, wherein the branch prediction circuit is configured to:

determining whether a next memory segment is stored in the instruction cache and the branch target buffer circuit; and

if it is determined that the next memory segment is stored in the instruction cache and branch target buffer circuitry, the memory segment is skipped if the memory segment empty flag indicates the absence of a branch instruction.

3. The apparatus of claim 1, wherein the branch prediction circuit is configured to move to a next instruction within the memory segment if the memory segment includes at least one other branch instruction after the target address.

4. The apparatus of claim 1, wherein a memory segment is a cache line.

5. The apparatus of claim 1, wherein the branch prediction circuit is configured to determine whether the branch instruction is one of a call instruction or a return instruction.

6. The apparatus of claim 5, further comprising: return address stack circuitry configured to store a memory segment empty flag for each return address; and

wherein, if the branch instruction is a call instruction, the apparatus is configured to:

determining whether a memory segment following the associated return instruction includes at least one other branch instruction, and

the determination result is stored as a memory segment empty flag within the return address stack circuit.

7. The apparatus of claim 5, further comprising: return address stack circuitry configured to store a memory segment empty flag for each return address; and

wherein, if the branch instruction is a return instruction, the apparatus is configured to:

the memory segment empty flag is read from the return address stack circuit.

8. The apparatus of claim 1, wherein the branch prediction circuit is configured to:

determining whether physical address translation of a next memory segment and a subsequent sequential memory segment is available; and

if it is determined that physical address translation of the next memory segment and subsequent sequential memory segments is available, skipping a memory segment is performed if the memory segment empty flag indicates the absence of a branch instruction.

9. An apparatus, comprising:

a branch detection circuit configured to detect the presence of at least one branch instruction within the portion stored in the memory segment during a commit phase of the current instruction; and

a BTB circuit configured to store:

the branch instruction address, and

a segment empty flag indicating whether a portion of the segment of memory following the target address includes at least one other branch instruction.

10. The apparatus of claim 9, wherein a memory segment is a cache line.

11. The apparatus of claim 9, wherein the apparatus comprises a commit queue circuit;

wherein the commit queue circuitry is configured to store current commit instructions in chronological order.

12. The apparatus of claim 9, wherein the apparatus comprises a last-committed branch store configured to store previously committed branch instructions.

13. The apparatus of claim 12, wherein the branch detection circuit is configured to:

determining whether the current instruction is a branch instruction; and

if the current instruction is a branch instruction, the current instruction is stored in a last commit branch memory.

14. The apparatus of claim 9, wherein the branch detection circuit is configured to: if the current instruction is not a branch instruction, a determination is made as to whether the last committed branch instruction previously stored is still valid.

15. The apparatus of claim 14, wherein the branch detection circuit is configured to: if the current instruction is not a branch instruction and the previously stored last committed branch instruction is still valid, a bucket empty flag associated with the previously stored last committed branch instruction is set in the branch target buffer circuitry.

16. The apparatus of claim 14, wherein the branch detection circuit is configured to: if the current instruction is not a branch instruction and the previously-stored last-committed branch instruction is invalid, the previously-stored last-committed branch instruction is marked as invalid.

17. The apparatus of claim 9, wherein the BTB circuit comprises a graph-based BTB circuit.

18. The apparatus of claim 9, wherein the memory segment empty flag indicates that a plurality of memory segments or portions thereof following the target address do not include at least one other branch instruction.

19. A system, comprising:

a branch detection circuit configured to detect the presence of at least one branch instruction stored within a portion of the memory segment during a commit phase of a current commit instruction;

a BTB circuit configured to store:

the branch instruction address, and

a memory segment empty flag indicating whether a portion of the memory segment following the target address includes at least one other branch instruction; and

a branch prediction circuit configured to predict whether a branch instruction will be taken or not taken, and wherein the branch prediction circuit is configured to skip a memory segment if the associated memory segment empty flag indicates the absence of a branch instruction.

20. The system of claim 19, wherein the bucket empty flag is valid only for taken branch instructions.