CN111475203B

CN111475203B - Instruction reading method for processor and corresponding processor

Info

Publication number: CN111475203B
Application number: CN202010258353.7A
Authority: CN
Inventors: 倪永良
Original assignee: Xiaohua Semiconductor Co ltd
Current assignee: Xiaohua Semiconductor Co ltd
Priority date: 2020-04-03
Filing date: 2020-04-03
Publication date: 2023-03-14
Anticipated expiration: 2040-04-03
Also published as: CN111475203A

Abstract

The invention relates to an instruction reading method for a processor. The invention further relates to a processor. In the method of the invention, the instruction which is prefetched and hit by the prefetching unit is placed in the buffer instead of the cache, so that the occupation of the cache is reduced, the utilization efficiency of the cache is improved, and meanwhile, because the cache unit stores the instruction which is fetched by the ROM for the first time and other sequential instructions can be prefetched by the prefetching unit, when the program is called for the second time, the speed of the instruction fetching and the speed of the conventional cache are basically consistent and are not influenced by the waiting time.

Description

Instruction reading method for processor and corresponding processor

Technical Field

The present invention relates generally to the field of processors, and more particularly, to an instruction fetching method for a processor. The invention further relates to a processor.

Background

With the progress of semiconductor processes, the processing speed of processors such as general purpose processors, micro-Controller units (Micro-Controller units), etc. has been greatly increased, and the frequency thereof has been increased from the past several MHz to several GHz. At the same time, however, instruction access speeds for Memory devices such as computer hard disks, read Only Memories (ROMs), etc. have not yet evolved to levels commensurate with the instruction execution speed of processors, such as computer hard disks or ROMs, which are Read at speeds of Only tens to 100 or more MB/s, which is far from the instruction processing speed of several billion instructions per second (MIPS) at processor frequencies. Moreover, the speed gap between processors and memory tends to be exacerbated.

In order to make up for the speed gap between processors and memory to reduce the latency of the processor, various solutions have been proposed in the prior art. These schemes are described below with respect to reading a ROM and CPU that require 2 wait cycles as an example:

1. when no optimization is performed, the CPU needs to wait for 2 cycles each time the instruction is fetched, that is, an average of 3 cycles is required to fetch one instruction.

2. The bit width of the ROM is increased and a Buffer (Buffer) is set. When the bit width of the ROM is increased to 4 instructions, 4 instructions can be read at a time, and the buffer is used to store the 4 instructions that were read the last time. When the CPU reads the instructions sequentially, the CPU waits 2 cycles each time it fetches the first of the 4 instructions, and fetches the remaining 3 instructions can be provided directly from the buffer without waiting, that is, on average, 6 cycles can fetch the 4 instructions. Therefore, increasing the bit width of the ROM and providing a buffer can improve the efficiency of the CPU in fetching instructions.

3. A Prefetch Unit (Prefetch Unit) is set. The Prefetch unit is disposed in front of the buffer and the ROM and is used for prefetching (Prefetch), namely reading and saving a plurality of instructions in advance, so as to reduce the waiting time of the CPU. When the instruction prefetched by the prefetching unit is just the instruction to be fetched by the CPU, the instruction is called prefetching hit, otherwise, the instruction is missed. In the case of a miss, the CPU needs to read the instruction from the ROM. In the case of setting the prefetch unit, only the first instruction requires a wait cycle when the program is executed sequentially, and then it is possible to execute one instruction up to an average of 1 cycle, and since the prefetch unit is supposed to act in sequence, it is necessary to insert 2 wait cycles when a jump occurs.

4. A Cache (Cache) is set. The cache is located between the CPU and the ROM and is used to store all instructions executed by the CPU. When the CPU repeatedly executes these instructions, they are provided directly from the cache, and do not need to be read from the slow ROM. When a program that has been completely stored in the cache is repeatedly executed, the CPU performance can be free from the influence of the ROM access waiting period regardless of jump or sequential execution. To improve the performance of the CPU on the first execution, a cache and prefetch unit are typically used in combination.

As can be appreciated from the above, the use of a cache greatly reduces the latency of the CPU. However, the access speed of the cache is very high, but the cost is also very high. Therefore, it is desirable to be able to achieve as little processor latency as possible with as little cache as possible.

Disclosure of Invention

Starting from the prior art, the task of the present invention is to provide an instruction fetch method for a processor, by which the latency of the processor can be reduced as much as possible, while the occupancy of the cache memory is reduced, thereby reducing the required capacity of the cache memory or increasing the utilization efficiency of the cache memory.

According to the invention, this task is solved by an instruction fetch method for a processor, comprising the following steps:

providing, by an arithmetic unit, an instruction address of a first instruction to be read;

providing, by the buffer, the first instruction if the first instruction is present in the buffer; otherwise:

if the first instruction is present in the cache, providing, by the cache, the first instruction and storing, by the buffer, a group of instructions cached in the cache that includes the first instruction; otherwise:

providing, by the prefetch unit and storing, by the buffer, a group of instructions including the first instruction if the first instruction is present in the prefetch unit; otherwise:

an instruction group including the first instruction is read by the memory according to an instruction address of the first instruction and provided to the arithmetic unit, and the instruction group is cached by the cache and stored by the buffer.

In the present invention, "arithmetic unit" refers to a unit in a processor for processing or executing instructions. A "processor" should be broadly interpreted as a device that executes instructions, such as a general purpose processor, a special purpose processor, an MCU, and so forth.

In one embodiment of the invention, provision is made for:

storing, by the buffer, a set of instructions including a first instruction includes: storing, by a buffer, the group of instructions and an instruction address of the group of instructions in association; and/or

Caching, by a cache, a set of instructions including a first instruction includes: the instruction group and the instruction address of the instruction group are stored in association with each other by the cache.

In a further embodiment of the invention, provision is made for:

the instruction group is a plurality of instructions having consecutive storage locations.

For example, the instruction group is instructions of 4 consecutive addresses from the target instruction. Other instruction fetch modes, prefetch modes, and cache fetch modes are also contemplated.

In a further embodiment of the invention, provision is made for:

the bit width of the memory is n times the instruction length and the storage capacity of the buffer is n times the instruction length, where n =2 ^k K is an integer and k is not less than 0; and/or

The instruction group comprises n instructions, wherein n =2 ^k K is an integer and k is not less than 0; and/or

The prefetch unit prefetches n instructions at a time, where n =2 ^k K is an integer and k is not less than 0.

In a further embodiment of the invention, it is provided that the processor is a microcontroller unit MCU and the memory is a read-only memory ROM.

Furthermore, the invention relates to a processor configured to perform the method according to the invention.

In addition, the invention also provides a micro control unit, wherein the memory is a read only memory ROM, and the micro control unit is configured to execute the method according to the invention.

In a second aspect of the invention, the aforementioned task is solved by a processor comprising:

the arithmetic unit is configured to send a target instruction address and receive a target instruction corresponding to the target instruction address for execution;

a buffer configured to store instructions read from memory and to store instructions prefetched by the prefetch unit when the target instructions are not present in both the memory and the cache;

a cache configured to store instructions each time they are read from memory;

a prefetch unit configured to prefetch instructions from a memory at a set timing; and a memory configured to output a plurality of instructions including the target instruction store according to respective target instruction addresses.

In one embodiment of the invention, it is provided that the memory comprises one or more of the following: SDRAM, DRAM, and read only memory.

The invention has at least the following beneficial effects: (1) The invention reduces the waiting time of the processor better by adopting the pre-fetching unit and the buffer; (2) The invention has the advantages that the instruction which is prefetched and hit by the prefetching unit is put into the buffer instead of the cache, so that the occupation of the cache is better reduced, the utilization efficiency of the cache is improved, and meanwhile, because the instruction which is read by the ROM for the first time is stored in the cache unit, and the rest sequential instructions can be prefetched by the prefetching unit, when the program is called for the second time, the speed of the instruction reading is basically consistent with that of the conventional cache, and the influence of waiting time is avoided.

Drawings

The invention is further elucidated with reference to the drawings in conjunction with the detailed description.

FIG. 1 illustrates an architecture of a processor according to the present invention;

FIG. 2 illustrates an embodiment according to the present invention; and

fig. 3 shows a flow of the method according to the invention.

Detailed Description

It should be noted that the components in the figures may be exaggerated and not necessarily to scale for illustrative purposes. In the figures, identical or functionally identical components are provided with the same reference symbols.

In the present invention, "disposed on" \ 8230 "", "disposed over" \823030 "", and "disposed over" \8230 "", do not exclude the presence of an intermediate therebetween, unless otherwise specified. Furthermore, "arranged above or 8230that" on "merely indicates the relative positional relationship between the two components, but in certain cases, for example after reversing the product direction, can also be switched to" arranged below or below "8230, and vice versa.

In the present invention, the embodiments are only intended to illustrate the aspects of the present invention, and should not be construed as limiting.

In the present invention, the terms "a" and "an" do not exclude the presence of a plurality of elements, unless otherwise specified.

It is further noted herein that in embodiments of the present invention, only a portion of the components or assemblies may be shown for clarity and simplicity, but those of ordinary skill in the art will appreciate that, given the teachings of the present invention, required components or assemblies may be added as needed in a particular scenario.

It is also noted herein that, within the scope of the present invention, the terms "same", "equal", and the like do not mean that the two values are absolutely equal, but allow some reasonable error, that is, the terms also encompass "substantially the same", "substantially equal". By analogy, in the present disclosure, the terms "perpendicular," parallel, "and the like in the directions of the tables also encompass the meanings of" substantially perpendicular, "" substantially parallel.

The numbering of the steps of the methods of the present invention does not limit the order in which the method steps are performed. Unless specifically stated, the method steps may be performed in a different order. In particular, in the present application, some acts of the components of the processor may be performed in parallel, e.g., the prefetch instruction act may be performed in parallel with other acts. Thus, the order in which the method steps of the present application are sequenced does not necessarily imply that the associated steps can only be performed in that order.

Furthermore, in the present invention, the term "instruction set" or "set of instructions" may comprise one or more instructions, e.g. depending on parameter limitations such as bandwidth of the memory.

Fig. 1 shows the architecture of a processor 100 according to the invention.

The processor 100 includes, for example, an Arithmetic Unit (AU) 101, a buffer 102, a cache 103, a prefetch unit 104, and a memory, here a read only memory ROM 105. These components are in data communication with each other via data lines, such as an address bus and a data bus. In the present embodiment, the instruction reading speed of the buffer 102 and the cache 103 is faster than that of the ROM 105, for example, but the storage capacity thereof increases in order. The bandwidth of the ROM 105 is, for example, 4 instructions, and the capacity of the buffer 102 is also, for example, 1kB, that is, 1024 bytes, that of the 4 instruction cache 103. In other embodiments, other bandwidths and capacities may be set.

The following components of the processor 100 are described separately.

The operator 101 is, for example, configured to send instruction addresses, such as ROM instruction addresses, to other components of the processor, such as the buffer 102, cache 104, prefetch unit 104, and ROM 105, for example, via an address bus (not shown), and to receive instructions stored by the instruction addresses from the other components for execution, for example, via a data bus (not shown). Similarly, the other components of the processor 100, such as the arithmetic logic unit ALU, the accumulator, the status register, and the general register, the operator 101 may be configured by the components. The operator 101 is configured to execute instructions of one or more instruction sets, such as addition operations, multiplication operations, shift operations, and the like.

The buffer 102 is configured to store instructions read from the cache 103 or the prefetch unit 104 or the ROM 105, for example.

The Cache (Cache) 103 is configured to store instructions each time they are read from the ROM 105, for example. For example, when the target instruction is not present in the cache and the instruction prefetched by the prefetch unit also misses the target instruction (i.e., does not contain the target instruction), the instruction must be fetched from the ROM 105, at which time the cache 103 stores the instruction fetched from the ROM 105.

The prefetch unit 104 is configured, for example, to prefetch instructions from the ROM 105 every instruction cycle or every certain time (e.g., every two instruction cycles, or other time settings are also contemplated), and to send the prefetch hitting instructions to the buffer 102 for storage when the target instructions are not present in the cache. The prefetching of instructions may preferably be done when the ROM is idle. For example, the prefetching of instructions may occur concurrently with the actions of other components, such as a processor read action, when the ROM is idle. The specific prefetch instruction timing may be determined and optimized based on the usage scenario and actual demand.

The ROM 105 has stored therein instructions available to the processor 100. The ROM 105 may output the instruction stored by the instruction address according to the corresponding instruction address in the address bus. In the present embodiment, the bit width of the ROM 105 is, for example, 4 instructions, i.e., 4 instructions can be read at a time.

The method of operation of the processor according to the invention is briefly described below.

1. The instruction address of the next instruction to be read (hereinafter referred to as this instruction) is supplied by the arithmetic unit AU, and a read request is issued to the buffer BUF.

2. If the instruction is stored in the buffer BUF, the instruction is directly provided by the buffer BUF, and the instruction fetching is finished.

3. If the buffer BUF does not store the instruction, the buffer BUF sends a read request to the CACHE CACHE, and the read instruction is provided to the arithmetic unit AU, and an instruction group containing the instruction is stored in the buffer BUF.

4. If the CACHE stores the instruction, the CACHE directly provides an instruction group containing the instruction, and the instruction fetching is finished.

5. If the CACHE CACHE does not store the instruction, the CACHE CACHE sends a read request to the prefetch unit PF, and provides the read instruction group containing the instruction to the buffer BUF, and if the prefetch unit feeds back a miss, the instruction group is stored in the CACHE CACHE.

6. If the prefetch unit PF stores the instruction, the prefetch unit PF directly provides an instruction group including the instruction, and feeds back a prefetch hit, and this instruction fetch is completed.

7. If the prefetch unit PF does not store the instruction, the prefetch unit PF sends a read request to the memory ROM, and provides the read instruction group containing the instruction to the CACHE CACHE, feeds back the prefetch miss, and finishes the instruction fetch.

8. The prefetch unit actively issues a read request to the memory ROM and stores a read instruction group into the prefetch unit when, for example, the memory is idle and a next group of instructions estimated from the instruction addresses issued by the AUs are not stored in the prefetch unit.

In short, each instruction fetch, the action is one of the following ABCDs:

A. if the buffer stores the instruction, actions 1 and 2 are executed.

B. If the buffer does not store the instruction and the cache stores the instruction, actions 1,3 and 4 are executed.

C. If the cache/cache does not store the instruction, and the prefetch unit stores the instruction, actions 1,3,5 and 6 are performed.

D. If none of the cache/prefetch units holds the instruction, actions 1,3,5, and 7 are performed.

Action 8 may be performed simultaneously with the three ABC actions described above.

Fig. 2 shows an embodiment according to the invention. In this embodiment, the ROM 105 is 4 times wider than the instruction, where the ROM requires 2 read latency cycles.

After the first subroutine a call, the instruction read from the ROM 105 is placed in the cache 103, and the instruction provided at the time of the hit by the prefetch unit 104 is not placed in the cache 103 but is placed in the buffer 102. The calls of program a and program B need to be stored in 4 sets of instructions (4 x 4 instructions).

The second time subroutine A is invoked, the performance of processor 100 (or CPU) may be unaffected by ROM 105 access latency, in combination with the 4 sets of instructions stored in cache 103 and the corresponding instruction fetching actions of prefetch unit 104.

Therefore, after the scheme is used, in order to prevent the performance of the CPU from being influenced by the access waiting of the ROM, the occupied cache capacity of the same subprogram is reduced, and the equivalent capacity of the cache is equivalently improved.

Fig. 3 shows a flow of a method 300 according to the invention.

In step 302, an instruction address of a first instruction to be fetched is provided by an arithmetic unit.

At step 304, if the first instruction is present in the buffer, providing, by the buffer, the first instruction; otherwise the method proceeds to step 306.

At step 306, if the first instruction is present in the cache, providing, by the cache, the first instruction and storing, by the buffer, a group of instructions cached in the cache including the first instruction; otherwise, the method proceeds to step 308.

At step 308, if the first instruction is present in the prefetch unit, the first instruction is provided by the prefetch unit and the instruction group containing the first instruction stored in the prefetch unit is stored by the buffer, and the prefetch unit outputs "hit" information to the cache; otherwise, if the first instruction is not present in the prefetch unit, the prefetch unit outputs a "miss" to the cache and the method proceeds to step 310.

At step 310, an instruction group including a first instruction is fetched by the memory according to an instruction address of the first instruction and provided to the arithmetic unit, and the instruction group is cached by the cache and stored by the buffer.

Although some embodiments of the present invention have been described herein, those skilled in the art will appreciate that they have been presented by way of example only. Numerous variations, substitutions and modifications will occur to those skilled in the art in light of the teachings of the present invention without departing from the scope thereof. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

1. An instruction fetch method for a processor, comprising the steps of:

if the first instruction is present in the cache, providing, by the cache, the first instruction and storing, by the buffer, a group of instructions cached in the cache including the first instruction; otherwise:

providing, by the prefetch unit and storing, by the buffer, a set of instructions including the first instruction without storing the set of instructions by the cache, if the first instruction is present in the prefetch unit; otherwise:

2. The method of claim 1, wherein:

storing, by the buffer, a set of instructions including a first instruction includes: storing, by a buffer, the group of instructions and the instruction addresses of the group of instructions in association; and/or

Caching, by a cache, a set of instructions including a first instruction includes: the group of instructions and the instruction addresses of the group of instructions are stored in association with each other by a cache.

3. The method of claim 1, wherein: the instruction group is a plurality of instructions whose storage locations are consecutive.

4. The method of claim 1, wherein:

The group of instructions comprises n instructions, wherein n =2 ^k K is an integer and k is not less than 0; and/or

Prefetch unit prefetches n fingers at a timeOrder, where n =2 ^k K is an integer and k is not less than 0.

5. The method of claim 1, wherein the processor is a Micro Control Unit (MCU) and the memory is a Read Only Memory (ROM).

6. The method of claim 1, further comprising the steps of:

the second set of instructions is prefetched by the prefetch unit based on the instruction address of the first instruction and the prediction algorithm when the memory is idle.

7. A processor configured to perform the method of one of claims 1 to 6.

8. A micro control unit, wherein the memory is a read only memory ROM, the micro control unit being configured to perform the method according to one of claims 1 to 6.

9. A processor, comprising:

a cache configured to store instructions each time they are read from memory;

a prefetch unit configured to prefetch instructions from the memory at a set time, wherein instructions provided on a prefetch unit hit are not placed in the cache but are placed in the buffer; and

a memory configured to output a plurality of instructions including the target instruction store according to respective target instruction addresses.

10. The processor of claim 9, wherein the memory comprises one or more of: SDRAM, DRAM, and read only memory.