WO2016188392A1

WO2016188392A1 - Generation system and method of data address

Info

Publication number: WO2016188392A1
Application number: PCT/CN2016/083018
Authority: WO
Inventors: 林正浩
Original assignee: 上海芯豪微电子有限公司
Priority date: 2015-05-23
Filing date: 2016-05-23
Publication date: 2016-12-01
Also published as: CN106293624A

Abstract

An automatic learning, recording and generation method and system of a data address. When the present invention is applied to the field of processors, the data required by an instruction is transmitted to a processor core in advance to be ready for use before the processor core executes a data reading instruction. In addition, a data address for next execution of the instruction is predicted, and the corresponding data is loaded to a data cache to reduce cache loss.

Description

Data address generation system and method

The invention relates to the field of computers, communications and integrated circuits.

The role of the cache in the processor system is to copy a portion of the contents of the memory, so that the content can be quickly accessed by the processor core in a short time to ensure the continuous operation of the pipeline. Cache is generally divided into instruction cache and data cache. The locality of the instruction address is better. Therefore, the hit rate of the instruction cache is usually higher. However, the locality of the data address generated when the data access instruction is executed is poor, resulting in the data cache hit rate. not tall.

However, the data address of the data access instruction located in the loop code has a certain rule to follow. Usually each time a data access instruction in a loop is executed, its corresponding data address is incremented by a constant (which can be positive, negative or zero). This constant is the data step (stride) corresponding to the corresponding data access instruction. Obviously, the data address of the same data access instruction is executed twice before and after, and the data step is obtained by subtracting the previous data address from the latter data address, and then adding the data step to the latter data address. It is possible to obtain the predicted data address when the data access instruction is executed next time. In this way, the corresponding data can be prefetched from the external memory into the data cache according to the predicted data address in advance. When the data access instruction is actually executed next time, if the actually generated data address is equal to the predicted data address, the data cache must hit; if the two are not equal, the corresponding data is determined according to the actual data address. Hit in the cache.

Although the data step size is fixed in most cases, it will still change in some cases. For example, in the case of two-layer loop nesting, when the inner loop code is executed multiple times, since the executed code is the same, the increment of the data address (that is, the data step size) tends to be the same. However, once the outer loop code is executed, the data address can be additionally incremented so that the difference between the data address and the previous data address is no longer the previous data step size. However, when the inner loop is executed again, the increment of the data address is restored to the previous value. In this case, if only one data step is recorded, an incorrect predicted data address is generated when the loop hierarchy changes, thereby affecting the increase in the data cache hit ratio.

The method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.

The present invention proposes a data address generation system and method that can generate a data address prior to the processor core to access the data memory and read the data for processing by the processor core. The data address generation system and method learns the address of the data access instruction executed by the processor core and the corresponding data address generated by the processor core executing the instruction, and the address between the data address generated by the processor core executing the same data access instruction twice. Increment and store it in the step table.

The step table is a one-dimensional plus a two-dimensional data structure. The one-dimensional data structure is addressed by the instruction address of the data access instruction, and the content is the data address. One dimension of the two-dimensional data structure is addressed by the instruction address of the data access instruction, and the other dimension is addressed by the instruction address of the branch instruction, and the content of the data structure is the data address increment (step size). Such a step table maps the instruction address of the data access instruction to the corresponding data address. And the mapping is not fixed, but a dynamic mapping that varies with the number of executions of the data access instruction and its loop path. The system and method take the instruction address of the reverse branch instruction of the successful branch as the upper limit, and the corresponding branch target instruction address as the lower limit, and automatically corresponding the data address of the instruction address between the lower limit and the upper limit according to the current branch. The state of the loop automatically increments the data address and accesses the data store with the updated data address, reading the corresponding data to the processor core.

It is an object of the present invention to provide a data address generation system and method for generating a data address in advance to mask the delay in accessing the data memory and the absence of a data buffer.

To this end, the present invention provides a data address generation system, including:

a step size table for storing a data address and an address increment; the data address generation system separately learns and records a data address generated by the processor core executing the data access instruction according to the data access instruction address, and performs data between the instruction twice The address increment is stored in the step table;

The data address generation system generates a new data access address access data memory by the data access instruction address addressing step size table content, and obtains the data for use by the processor core.

Optionally, the content of the entry in the step size table is the data address; the step size table is addressed by the data access instruction address.

Optionally, the content of the entry in the step table is the data address increment; one dimension of the step table is addressed by the data access instruction address; another dimension of the step table is The reverse branch instruction address addressing of the successful branch.

Optionally, the data address generation system generates an address by:

The data address generation system adds the data address stored in the step size table and the data address increment to generate a new data address; the new data address is stored back to the step size table.

Optional, including:

The data address generation system accesses a step of a data access instruction address between a reverse branch instruction address of a successful branch and a branch target instruction address thereof; the data address generation system generates a new one according to the content of the step table a data storage address; the data address generation system accesses the data store with the new data storage address, and fetches data for processing by the processor core; the new data address is stored back to the step size table.

The invention provides a data address generating method, comprising the following steps:

Learning a data address generated by the processor core executing the data access instruction according to the data access instruction address;

Learning, according to the data access instruction address, the processor core to perform the data address increment between the same data access instruction twice;

Recording the above data address and data increment in the step table;

The data access instruction address addresses the step table contents to generate a new data access address to access the data memory, and the data is obtained for use by the processor core.

Optionally, the data address generation system generates an address by:

Optional, including:

Other aspects of the present invention can be understood and appreciated by those skilled in the art in light of the description of the invention.

Moreover, the system and method of the present invention can generate a data address in advance from the data memory before the processor core is about to execute the data read instruction and send it to the processor core for use by the processor core, such that the processor core It can be accessed directly when the data needs to be read, which masks the delay in accessing the data memory and obscures the lack of the data buffer.

Further, the system and method of the present invention can automatically learn and record the data address generated by the processor core and its increment under different instruction loops; and automatically adjust the increment for generating the data address according to the level of the instruction loop. The amount of data is generated to make the data address more accurate.

Other advantages and applications of the present invention will be apparent to those skilled in the art.

1 is a schematic diagram of data access instructions in the loop of the present invention;

2 is an embodiment of a cyclic step storage module according to the present invention;

3 is an embodiment of the data address generation system of the present invention;

4 is an embodiment of a row decoder of the step size table of the present invention;

Figure 5 is a block diagram of a processor system using the data address generation system of the present invention.

The preferred embodiment of the invention is shown in Figure 3.

The data cache system and method proposed by the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will be apparent from the description and appended claims. It should be noted that the drawings are in a very simplified form and all use non-precise proportions, and are only for convenience and clarity to assist the purpose of the embodiments of the present invention.

It should be noted that the various embodiments of the present invention are further described to illustrate the various embodiments of the present invention in order to clearly illustrate the present invention. Further, for the sake of brevity of explanation, the contents already mentioned in the foregoing embodiment are often omitted in the latter embodiment, and therefore, contents not mentioned in the latter embodiment can be referred to the previous embodiment accordingly.

Although the invention may be modified in various forms of modifications and substitutions, some specific embodiments of the invention are set forth in the specification and detailed. It should be understood that the inventor's point of departure is not to limit the invention to the particular embodiments set forth, but the inventor's point of departure is to protect all improvements, equivalent transformations and modifications based on the spirit or scope defined by the claims. . The same component numbers may be used in all figures to represent the same or similar parts.

Normally, a data access instruction may be in a multi-layer instruction loop, and each time the same level loop is executed, the corresponding data step size is the same, but when different levels of loops are executed, the corresponding data step size is different. For example, for a data access instruction located in a two-layer loop, each time the inner loop is executed, the data address is increased by '4', that is, the data step is '4'; but each time the outer loop is executed, the data address is increased. '20', ie the data step size is '20'. At this time, whether "4" or "20" is used as the data step size of the instruction causes a certain number of data address prediction errors. According to the technical solution of the present invention, different data step sizes can be given according to the relationship between the branch instruction and the data access instruction, and the same data access instruction is located at different levels, so that the prediction of the data address is more accurate.

Please refer to FIG. 1, which is a schematic diagram of data access instructions in the loop of the present invention. In FIG. 1, the instructions are arranged from left to right in order of addresses, wherein instructions 11, 12, and 13 are both data access instructions, and instructions 21, 22, and 23 are both branch instructions for reverse jump. Therefore, the instructions between the three branch instructions and their branch target instructions form a loop. As shown in FIG. 1, a three-layer nested loop is formed, wherein the loop corresponding to the branch instruction 21 is the innermost loop, and the loop corresponding to the branch instruction 23 is the outermost loop. In this way, each data access instruction in this code can be given a special loop step storage module to provide different data step sizes when performing different levels of loop operations.

Please refer to FIG. 2, which is an embodiment of a loop step storage module according to the present invention. For convenience of description, the present embodiment describes how the data step size is provided by the loop step storage module, and how these data step sizes are stored in the loop step storage module, which will be further explained by the embodiment of FIG. 3. In this embodiment, the loop step storage module corresponds to a three-layer loop, and is composed of registers 31, 32, and 33, and selectors 41, 42, and 43. Among them, the register 31 and the selector 41 correspond to the first layer loop (the innermost loop), the register 32 and the selector 42 correspond to the second layer loop, and the register 33 and the selector 43 correspond to the third layer loop (the outermost loop).

In the present invention, each data access instruction of the predicted data address corresponds to a loop step storage module as shown in FIG. Taking the loop step storage module corresponding to the data access instruction 12 as an example (this data access instruction is located in a three-layer loop), the register 31 stores a corresponding execution of the first layer loop (ie, the loop corresponding to the branch instruction 21). The data step and the valid bit; the register 32 stores the data step and the valid bit corresponding to the execution of the second layer loop (ie, the loop corresponding to the branch instruction 22); the register 33 stores the execution of the third layer loop (ie, the branch) The data step and the valid bit corresponding to the loop corresponding to instruction 23. Since different data access instructions have their own loop step storage modules, and the corresponding registers in each loop step storage module can store different values, so that each data access instruction can have different data when it is located in different levels of loops. Step size. The initial value of the valid bits is '0'.

In this embodiment, the occurrence of these three layers of cycles is prioritized. For example, once the first layer loop occurs, the data step size corresponding to the first layer loop (ie, the value of the register 31) may be output regardless of the occurrence of the second and third layer loops. On the other hand, the second layer loop will only be entered when the first layer loop does not occur. Once the second layer loop occurs, the data step size in register 32 can be output regardless of the occurrence of the third layer loop. Similarly, the third layer loop will only be entered when the first and second layer loops do not occur. Once the third layer loop occurs, the data step size in register 33 is output. If the three-layer loop does not occur (indicating the first execution of this code, or because the outer loop occurs again, this code is executed again), the default value of the data step derived from bus 35 is output. In this way, based on the cyclic step storage module shown in the embodiment of FIG. 2, the corresponding selector can be controlled by the branch determination signal indicating whether the branch transfer of the branch instructions 21, 22, and 23 is output by the processor core, and the corresponding selector can be output. The data step size.

For example, for the code shown in FIG. 1, when the data access instructions 11, 12, and 13 are executed for the first time, the branch instructions 21, 22, and 23 have not yet been executed, so that the corresponding branch transfer does not occur, that is, the corresponding loop step size is stored. The selection signals of the selectors 41, 42 and 43 in the module are all '0', and both output default values of the data step derived from the bus 35.

When the branch instruction 21 first branches, the first layer loop is entered, and the data access instructions 12 and 13 are executed again, that is, the branch access instructions 12 and 13 are executed a second time. At this time, the valid bit read from the register 31 is '0', so the data step stored in the register 31 is invalid. The data step size is calculated at this time, and is stored in the register 31 under the control of the branch judgment signal of the branch instruction 22, and the effective position in the register 31 is '1'.

Assuming that the branch transfer of the branch instruction 21 occurs afterwards (ie, the first layer loop is always executed), the selection signal of the selector 41 in the corresponding loop step storage module is '1', and the loop step corresponding to the data access instructions 12 and 13 The memory modules respectively output the valid bits and data step sizes in the respective registers 31. Since the valid bit is '1' at this time, the data block size can be used to calculate the corresponding predicted data address. At the same time, the data step size is recalculated as described above and stored in the register 31 under the control of the branch decision signal of the branch instruction 22, and the valid bit remains "1". Thus, if the data step size when the same data access instruction is executed twice before, the value in the register 31 does not change; if the data step changes, the value in the register 31 is updated to the new data. Step value.

When the branch instruction 21 is executed again, the branch transfer does not occur, and when the branch instruction 22 is executed, the branch transfer occurs, and the selection signals of the selectors 41 and 42 in the corresponding loop step storage module are '0' and '1', respectively, and data access is performed. The loop step storage modules corresponding to instructions 11, 12 and 13 respectively output the data step size in register 32 to calculate the corresponding predicted data address. Thus, in different levels of loops, different register values in the loop step storage module are used as the data step size.

Further, in the data access command 11, when the branch transfer of the branch instruction 21 occurs, the selection signal 21 of the selector 41 is "1". At this time, although the data step size in the output register 31, since the data access instruction 11 is not included in the first layer loop, the data step size is ignored. In other cases, the data step size in the corresponding output register 32 or 33 is operated as previously described, or the data step default value derived from the bus 35 is output, thereby providing different data step sizes for different cycles.

Each of the loop step storage modules that provide a data step for the data access instruction in the loop of the present invention corresponds to a data access instruction. Extending the loop step storage module, increasing the number of registers and selectors (where each group of registers corresponds to a corresponding loop of the corresponding selector), corresponding to more hierarchical loops; and then for the more hierarchical loops Each (or part of) data access instruction provides a loop step storage module, thereby providing more accurate data steps for all (or part of) data access instructions in the more hierarchical loop according to the loop execution situation Long function.

Please refer to FIG. 3, which is an embodiment of a data address generation system according to the present invention. The memory array 52 corresponding to the plurality of data access instructions, the column decoder 50 decoded according to the branch judgment result, and the row decoder 54 decoded according to the instruction address of the data access instruction are included. And the address generator 60.60 is composed of a subtractor 61, an adder 62, a selector 63, and a comparator 64. The memory array 52, each of which corresponds to a data access instruction. Each of the rows 52 of the row decoder 54 is provided with a register to store the instruction address of the data access instruction corresponding to the row, and a comparator compares the contents of the register with the instruction address on the data access instruction address bus 53. If the register content of a row in 54 is the same as the address on 53, then 54 enables the word line of the row (word Line), so that the line can be read or written. Array 52 has two read/write ports. One of the read/write ports 37, 38 is dedicated to the rightmost column 58 of 52. Access to the column is only controlled by the row decoder 54, which stores the corresponding data address 67 of the data access instruction and Valid bit 68. The other read/write ports 34, 36 are shared by 52 columns other than 58 columns, such as 56 columns, etc., through which a row selected by the row decoder 54 and a column selected by the column decoder 50 are accessed. An entry whose format is to store a data step 65 and a valid bit 66 corresponding to the plurality of data access instructions in a specific hierarchy loop. The remaining columns of the array 52 except 58 have a number of columns corresponding to the number of loop levels that the step storage array can support at most. In each of the 50 corresponding columns 52, a comparator and an address register are provided, in which the instruction address of each branch instruction is stored. When the processor core makes an 'execution branch' decision, the instruction address of the corresponding branch instruction is sent to the 50 match on the bus 51, and the column in the array 52 corresponding to the matching hit register (except the 58 column) may Accessed through read/write ports 34,36.

The processor core executes the instructions in order, starting from left to right starting from the leftmost instruction in Figure 1. When the processor core decodes the instruction 11, it is found to be a data load instruction, and the instruction address of the instruction 11 is sent to the row decoder 54 via the bus 53 for matching. At this time, the matching miss is performed, and the row replacement logic in the 54 is the instruction. 11 allocates a row, that is, a row 11 above the array 52 in FIG. 3, and stores the instruction address of the instruction 11 into the register corresponding to the row in 54. The valid bits 66 and 68 of each column in the row are set to '0'. Based on the valid bit 68 of '0' in column 58, the data address generation system causes the processor core to execute the data load instruction 11 to generate a data address access data memory (such as a data buffer), and the read data is executed by the processor core. At the same time, the system also stores the above data address via the bus 57, the selector 63, and the write port 38 into the 67 field of the 11 rows and 58 columns of the array 52, and sets the valid bit 68 in the entry to '1'. Thereafter, the processor core executes subsequent instructions, and the row replacement logic of 54 also allocates one row for each of the data load instructions 12, 13 as described above, that is, the middle and lower rows 12, 13 of the array 52 in FIG. 3, and the processor core executes the instruction 12 as described above. The data addresses generated by 13 are stored in the entries of 12, 13 rows and 58 columns, respectively, and the valid bits 68 in the entries are set to '1'.

Thereafter, the processor core executes the branch instruction 21 and judges its branch as 'execution branch'. Therefore, the processor core reverses the branch target address instruction between instruction 11 and instruction 12, and executes in the instruction order from the instruction. At the same time, the instruction address of the branch instruction 21 is sent to the column decoder 50 via bus 51 for matching. If it does not match, The 50 column replacement logic assigns one of the arrays 52 (21 columns in the figure), and stores the instruction address of the branch instruction 21 into the corresponding register in 50. The address of the last successful branch is always maintained on the bus 51. At this time, the address on 51 matches the address in the register corresponding to 21 columns in 50, so the column decoder 50 selects 21 columns.

The processor core continues to execute instructions in program order, executing instruction 12 again, and decoding instruction 12 finds it a data load instruction. Therefore, the instruction address of the instruction 12 is sent to the row decoder 54 via the bus 53 for matching. This time, the match hits from the 12 rows and 58 columns of 52 through the read port 37 to read the valid bit 68 as '1', from 52 to 12. Line 21 is read through port 37 and the valid bit 66 is read as '0'. Based on the two significant bits of '10', the system causes the processor core to execute a data load instruction 12 to generate a data address to access the data memory read data to the processor core for execution. At the same time, the system also sends the above data address to the address generator 60 via the bus 57. The subtractor 61 subtracts the last data address 67 of the 12 rows and 58 columns read out from the read port 37 of the array 52 at this time. The address increment (step size) is written to array 52 via write port 34. At this time, the bus 53 is the address of the data load instruction 12, and the bus 51 is the address of the branch instruction 21, so the above step size is stored in the step field 65 of the entry of the 12 rows and 21 columns in the array 52. The system also sets the valid bit 66 of the 12 rows and 21 columns to '1'. In the same manner, the system also stores the step size of the data load instruction 13 into the 13 rows and 21 columns 65 of 52, and sets the 13-bit 21-bit valid bit 66 to '1'.

Thereafter, the processor core executes the branch instruction 21 again, and its branch decision is again an 'execution branch'. Therefore, the processor core reverses the branch target address instruction between instruction 11 and instruction 12, and executes in the instruction order from the instruction. At the same time, the instruction address of the branch instruction 21 is sent to the column decoder 50 via bus 51 for matching. This match hits, so column decoder 50 selects 21 columns.

The processor core executes the instructions in program order, executes instruction 12 again, and finds that it is a data load instruction when decoding instruction 12. Therefore, the instruction address of the instruction 12 is sent to the row decoder 54 via the bus 53 for matching. This time, the match hits from the 12 rows and 58 columns of 52 through the read port 37 to read the valid bit 68 as '1', from 52 to 12. Row 58 is read through port 36 and the valid bit 66 is read as '1'. According to the two valid bits of '11', the system uses the adder 62 to read the data address 67 in the 12 rows and 58 columns read from the read port 37 at this time and the 12 rows and 21 columns read from the read port 36. Step size 65 is added. The sum 38 accesses the data memory as a new data address, and the read data is sent to the processor core for processing. The data addresses on data addresses 57 and 38 generated by processor core execution instruction 12 are compared by comparator 64. If the comparison result is the same, the processor core continues to execute the subsequent instructions in program order, and the sum of the adder 62 is also written back to the 12 rows and 58 columns via the read port 38 and stored in 67. And the effective bit 68 of 12 rows and 58 columns is maintained as '1' unchanged. If the comparison result is different, the system control processor core discards the intermediate execution result of the data obtained based on the data address on 38, causes the processor core to execute the loading of the instruction 12 according to the data obtained by the data address 57, and then sequentially executes the subsequent instructions. The system controls the selector 63 to select the address on the 57 to be stored in the entry of the 12 rows and 58 columns via the read port 38, maintaining the valid bit 68 in the entry as '1'; but the valid bit 66 in the 12 rows and 21 columns. Set to '0' to record the step size in this entry is invalid and need to be re-learned. The system also processes the instructions 13 in the same manner.

Thereafter, the processor core executes the branch instruction 21 again, and this time the branch is judged as 'no branch'. Therefore the processor core continues to execute in the order of instructions. Thereafter, the processor core executes the branch instruction 22, and its branch is judged as 'execution branch'. Therefore, the processor core reversely jumps to the branch target address instruction before instruction 11, and executes from the instruction in the order of instructions. At the same time, the instruction address of the branch instruction 22 is sent to the column decoder 50 via bus 51 for matching. Because of the mismatch, The 50 column replacement logic assigns one of the arrays 52 (22 columns in the figure), and stores the instruction address of the branch instruction 22 into the corresponding register in 50. At this time, the address held on the bus 51 matches the address in the register of the corresponding 22 columns in 50, so the column decoder 50 selects 22 columns.

The processor core executes the instructions in program order, executes instruction 11 again, and finds that it is a data load instruction when decoding instruction 11. Therefore, the instruction address of instruction 11 is sent to match 54 via bus 53, this match hit, from 11 rows and 58 columns of array 52 through read port 37 read valid bit 68 is '1', from 52 to 11 rows and 22 columns The valid bit 66 is read as '0' via the read port 37. Therefore, the system executes according to the condition that the data load instruction address 67 is valid but the step size 65 is invalid, and the data address generated by the processor core reads data from the data memory for processing by the processor core, and the data address sent via the bus 57 is processed. The data address 67 in the 58 columns read out through the read port 37 is subtracted, and the difference is stored as a step size in the 11 rows and 22 columns 65 fields through the write port, and 66 is set to '1'. The system also stores the data address on 57 in the 67 field of the 11 row 58 list item, and maintains the valid bit 68 in the entry as '1' unchanged. The system processes instructions 12 and 13 in the same way. In the future, the operation will be repeated as described above.

In general, one row in array 52 stores the data memory address of a data access instruction, the step size and the corresponding valid bit; a column in 52 stores the step size and the valid bit of each data access instruction corresponding to the branch instruction succeeding in one branch. The special 58 columns store the data memory address, and its read and write is not affected by the state of the branch instruction. The system selects one of the arrays 52 via the bus 53 with the instruction address of the data access instruction, reads the valid bit 68 of the 58 columns, and selects a column via the bus 51 with the branch instruction address of the last successful branch, reading the valid bit 66 therein. According to the state of the valid bits 68 and 66 in a row, the system has the following three modes of operation. When the read 68 and 66 states are '00', the meaning is that the data address and the step size are invalid. At this time, the system stores the data address 57 generated by the processor core into the column 58 of the row 67 and sets the state to '10. '. When the read 68 and 66 states are '10', the meaning is that the data address is valid but the step size is invalid. At this time, the system calculates the difference between the data address 57 generated by the processor core and the data address stored in the column 58 of the row and Save it to the 65 field in the selected column at this time, and set the state to '11'. When the read 68 and 66 states are '11', the meaning is that the data address and the step size are valid. At this time, the system uses the data address in the 67 field in the 67 column and the step size in the 65 fields in the other columns selected by the branch judgment. A data address 38 is generated, the data memory is accessed, and the data is read for processing by the processor core. In this state, the system compares the data address 57 generated by the processor core with the data address 38, and takes corrective action as needed according to the comparison result, and modifies the state of 66.

The above system based on the step size advance data can be used in several ways. The first way is that when the read data is accepted by the processor core according to the data address 38 generated in one row of the array 52 (the address on the bus 38 is the same as the address on the bus 57), the system adds the address on the 38 to the read port 36. The step size is sent to the data cache match as the guess data address. If you don't match, you can start to fetch data from the lower storage level into the high-level cache. The processor core still generates data addresses via bus 57 to read data from the data cache. This approach can partially or completely mask the absence of the data cache in most cases. However, the guess data address is a step size selected based on the result of the last successful branch, executed before the next successful branch, and the processor core executes the same data load instruction read data after the next successful branch. The success of the last two successful branches is not necessarily the same branch instruction, so the step size used to generate the guess data address does not necessarily coincide with the increment used by the data address generated by the processor core. The second way is to read the entries in the selected column of the successful branch and the 58 list items from the array 52 after the successful branch with the previously executed data load instruction address, such as the entry status 68, 66 is '11'. , that is, the data address 38 is generated, and the data is read from the data memory and stored in a data read buffer with a shorter read latency (data Read buffer) for the processor core to take. The following is an embodiment of the second mode.

Please refer to FIG. 4, which is an embodiment of a step size table decoder according to the present invention. 4 shows the decode logic for a row in the corresponding array 52 in row decoder 54, where 72, 74 are registers, 76 is a selector, and 78 is a comparator. The output of register 72 is coupled to one input of comparator 78, and the output of selector 76 is coupled to the other input of comparator 78. Comparator 78 can do more than comparison, less than comparison, and a total of three comparison modes. The comparison mode of comparator 78 is linked to the selection of selector 76. When the selector 76 selects the bus 73 as an input, the comparator 78 makes a comparison of the address stored in the register 76 greater than or equal to the address on the 73; when the selector 76 selects the bus 51 as the input, the address 78 stored in the register 76 is less than 51. Comparison of addresses; when selector 76 selects bus 53 as an input, 78 makes a comparison of the address stored in register 76 equal to the address on 53; 75 is the result of the comparison output by comparator 78.

The bus 53 is the data load address output by the processor core. The address is compared with the contents of register 72 in all rows in row decoder 54 by comparators 78 of each row. If there is no match, the 54 row replacement logic will allocate a row for the data load instruction and store the upper address of 53 into register 72 of the row. The same address will then match the address in the row register 72 to enable the word line for that row, the operation of which has been detailed in the embodiment of FIG. Now when a reverse jump branch instruction successfully executes the branch, the system puts the instruction address of the instruction on the bus 51, and places the instruction address of the branch target instruction of the instruction on the bus 73. The system causes the contents of the registers 72 in all rows in 54 to be sequentially greater than or equal to the bus 73 and the address on the bus 51, and less than the comparison, and the comparison result 75 is stored in the register 74. When the address stored in the register 72 in a row is greater than or equal to the address on the bus 73, but less than the address on the bus 51, the corresponding data load instruction of the row is in the branch instruction (not included) and the branch target instruction (inclusive). For example, in the loop of the branch instruction 21 in the instruction 12, 13 of Fig. 1, the register 74 of the row is written into the comparison result '1'. The row whose comparison result does not satisfy the above condition is not between the branch target instruction and the branch instruction. For example, the instruction 11 in FIG. 1 is not in the loop of the branch instruction 21, and the register of the row is written into the comparison result '0'.

Thereafter, the system sequentially enables the word lines of the rows of all registers 74 to be '1', and reads the contents of the entries in the 58 columns from the array 52 and the column decoder 50 according to the branches on the 51 bus at this time. The contents of the entry in the row in the column selected by the instruction address. If the state of 68, 66 is '11', the system adds the step size in the 65 domain and the previous cyclic data address in the 67 domain, and obtains a new data address to address the data memory via the bus 38, and reads the data to Used by the processor core. The new data address is also written back to the 67 domain. The system does not operate on the line whose register 74 content is '0'. Then, when the processor core executes the data load instruction, the instruction address of the data load instruction sent via the bus 53 is compared to the contents of each of the row registers 72 in the row decoder 54 for equal comparison. The system detects the contents of the register 74 in the equal row. If the content is '0', the system reads the state 68, 66 in the row, and operates according to the state as described above; if the content of the register 74 is '1', the system The data address in the 67 field in the row and the data address sent by the processor core via bus 57 are read. If the two addresses are equal, the system sets the contents of the register 74 of the row to '0' without subsequent operations. If the two addresses are not equal, the system sets the contents of the register 74 of the row to '0', operating as the address in the 67 domain is not equal to the address on the bus 57.

Please refer to FIG. 5, which is a block diagram of a processor system using the data address generation system of the present invention. 50 is the column decoder of the step table, 52 is the step table array, 54 is the row decoder of the step table, 60 is the address generator, 80 is the processor core, 82 is the data buffer, 84 is Data memory. 51 is the branch instruction address bus of the successful branch, which is output from the processor core 80 to the column decoder 50 and the row decoder 54. 73 is the branch target instruction address bus, which is output from the processor core 80 to 54. 53 is data access. The instruction address bus is output from processor core 80 to 54. 57 is a data address bus, which is output by processor core 80 to address generator 60 and data memory 84. 38 is a data address bus that is output by address generator 60 to data memory. 84. Bus 85 outputs data from data memory 85 to data degree buffer 82 for temporary storage, and bus 87 outputs data from 82 to processor core 80. The number of columns in the step table (50, 52, 54) determines the number of loop layers or loops that it can handle. The number of rows in the step table determines the number of data access instructions it can process.

The step table is a one-dimensional (58 columns) plus a two-dimensional data structure. The one-dimensional data structure is addressed by the instruction address of the data access instruction, and the content of the data structure is the data address. One dimension of the two-dimensional data structure is addressed by the instruction address of the data access instruction, and the other dimension is addressed by the instruction address of the branch instruction, and the content of the data structure is the data address increment (step size). Such a step table maps the instruction address of the data access instruction to the corresponding data address. And the mapping is not fixed, but a dynamic mapping that varies with the number of executions of the data access instruction and its loop path. The system allocates one-dimensional storage resources (rows) in the data structure of the step table via the bus 53 with the data access instruction address provided by the processor core 80, and provides the allocated one-dimensional storage resource with the corresponding data address provided by the bus 57. Initial content (data address); another dimension resource (column) is allocated by the branch instruction address provided by the bus 51 by 80, and the difference between the data address provided again by the bus 57 via the bus 57 and the initial content in the step size table is stored in each other. One-dimensional resources (columns). Then, the system further transmits the instruction address of the reverse branch branch instruction of the successful branch of the bus 51 as the upper limit, and the corresponding branch target instruction address is transmitted as the lower limit by the bus 73, so that the step size table and the address generator 60 automatically set the instruction address at the lower limit and The corresponding data address of the data access instruction between the upper limits updates the data address according to the respective step size in the column selected by the state of the current branch cycle, and accesses the data memory 84 via the bus 38 with the updated data address, and reads the corresponding data and The data read buffer 82 is stored before the processor 80 outputs the corresponding data address 57.

The data read buffer 82 may be in the form of an address match read, in which each row of 82 has an entry for storing data and an entry for storing the corresponding data address. At this time, the processor core 80 sends the data address to the bus 82, and the data in the corresponding data entry matching the address table entry and the address on the 57 is sent to the processor core 80 via the bus 97. In this form, the row permutation logic can use a replacement method such as LRU (least recently used). The column replacement method can also use a form such as LRU. In this form, it should be noted that when the data is read from the lower limit to the upper limit, a mechanism must be executed in the order of the instruction address, because the corresponding data of the data access instruction near the lower limit is used by the processor core at the earliest.

Another form of data read buffer 82 may be a first in first out (FIFO). At this time, the row allocation logic in the row decoder 54 strictly allocates row resources in order of increasing row numbers. As in the embodiment of FIG. 3, the rows are sequentially allocated in the order of the addresses of the instructions 11, 12, and 13. Thus, when the processor core 80 provides the lower limit and the upper limit to the row decoder 54, the respective word lines are sequentially enabled from the lower limit to the upper limit in the order of the instruction addresses, so that the step size table and the address generator 60 are provided in the order of instructions. The data address accesses the data memory 84 so that the read data is stored in the first-in first-out data read buffer 82 in the order of instructions. At this point, each time the processor core 80 executes a data load instruction, a read request is provided to the first in first out 82, and 82 outputs a data to the 80. 82 is now a data queue arranged in the order of the instruction address.

This form of row permutation logic treats the step table as a circular buffer (circular Buffer), when the last line of the step table has been allocated, the next line is allocated next time, then the second line. The address stored in register 72 allows the aforementioned mechanism to continue to take effect. For example, in the embodiment of FIG. 1, there are other data access instructions before the data access instructions of 11, 12, and 13. The instruction 11 obtains the lowermost row in the allocation map 3, and the 12, 13 instructions are sequentially assigned the upper and middle rows. When instruction 22 or 23 branches successfully, the lower limit is at the bottom line (instruction 11) and the upper limit is at the middle line (instruction 13). When data reading from the lower limit to the upper limit is executed, it starts from the following line (instruction 11), and ends in the upper line (instruction 13) via the upper line (instruction 12). Columns can also be organized in the same circular buffer mode, that is, after the last column is allocated, the first column is allocated next time. A replacement method such as LRU can also be used.

A third way to use the above system based on step size to fetch data in advance is to combine the first and second modes. After the branch judgment is generated, the data address of the data access instruction between the corresponding lower limit and the upper limit is updated, and the data memory is accessed by these updated addresses, and the read data is stored in the data read buffer 82 for use by the processor core 80, and The updated data address is stored in column 58 of array 52 (i.e., the second mode described above). At this point, the current branch judges the selection step size plus the data address to generate the guess data address and sends it to the data buffer. If the corresponding data is not already in the cache, the prefetch data is stored in the cache to cover the cache miss; but the guess data address It is not stored in column 58 of array 52. It is the first way to guess the generation of the data address and to address the data memory with the data that is required to be executed the next time the same data load instruction is loaded into the data memory in advance, before the new branch judgment is generated.

Data memory 84 can be implemented with a buffer. Data buffers typically have a tag unit in which all or part of the memory address is stored. The memory address is sent to the tag unit to match, and the cache address is generated when the match is matched (for example, the path number in the cache of the multi-path organization form and the buffer part combined with the index part in the memory address and the offset address part in the block), Used to address the data store in the cache (data RAM). The step size disclosed in the present invention can directly store the buffer address in the 67 field of its 58 columns, so that the address sent via the bus 38 can directly address the data memory in the cache without going through the label unit mapping. At this time, the row address in the data memory is required to be continuous within a certain interval, so that the address generator 60 automatically calculates the next data address in increments of steps; for example, several rows of consecutive addresses in the data memory are stored in the multi-channel buffer. In the same road. When the buffer address spans a non-contiguous address space, there is a way to adjust the buffer address, such as changing the road number. Further, the memory address and its corresponding buffer address can be simultaneously stored in the above 67 domain, and the two addresses are updated with the same step size. The buffer address is directly addressed to the data store in the access cache, wherein the memory address is used to compare with the memory address output by the processor core 80 via the bus 57 to determine that the address generated by the address generator 60 is correct. In this format, when the buffer address spans a discontinuous address space, the memory address in 67 is matched in the tag unit of the data buffer to obtain a new buffer address; in the contiguous address space, The buffer address is updated incrementally to address the data store in the cache.

The above embodiments all take the data loading instruction as an example. In fact, the method and system of the present invention can also be applied to data storage instructions, such as the above-mentioned first way to make possible application to a write allocation strategy (write Allocate writeback cache (write back Cache), the address generator generates a memory address to be stored, and reads the corresponding data from the memory into the data buffer, so that the processor core stores the data into the data buffer to avoid the cache miss.

A step table is a two-dimensional data structure in which one dimension is addressed with one parameter and the other dimension is addressed with another parameter to access an entry in the data structure. It can be directly addressed with parameters. However, if the parameters are not continuous, the parameters can be compressed. As shown in FIG. 3, each register 72 in the row decoder 54 in the embodiment of FIG. 4 functions like a tag unit in the fully associative buffer, compressing the address space and the array. The holes in 52 (data access instructions account for about one-third of the total number of instructions, and their address space is not continuous). From this point of view, the form of the step table is actually a structure similar to a fully associative cache. Each address register in column decoder 50 also performs the same compression function (the branch instruction accounts for about one-sixth of the total number of instructions, and its address space is not continuous). Therefore, the step size table can be regarded as a two-dimensional fully associative compression structure.

Although the embodiments of the present invention are only described in terms of structural features and/or methods of the present invention, it should be understood that the claims of the present invention are not limited to the features and processes. Rather, the features and processes are merely illustrative of several embodiments of the invention.

It should be understood that the various components listed in the above embodiments are merely for convenience of description, and may include other components, or some components may be combined or omitted. The plurality of components may be distributed among multiple systems, may be physically present or virtual, or may be implemented in hardware (such as an integrated circuit), implemented in software, or implemented in a combination of hardware and software.

Obviously, according to the description of the above preferred embodiments, the present invention can be made by those skilled in the art according to the present invention, no matter how fast the technical development in the field is, and no matter what progress is currently difficult to predict in the future. The principles are adapted, adapted, and modified in accordance with the corresponding parameters and configurations, all of which are within the scope of the appended claims.

The system and method proposed by the present invention can be used in various processor related applications, including general purpose processors, microcontrollers, multi-lane processors, artificial intelligence processors, big data processors, digital signal processors, Graphics processors, etc., can improve the efficiency of the processor.

Claims

A data address generation system, comprising:

Step table for storing data address and address increment;

The data address generation system separately learns and records the data address generated by the processor core executing the data access instruction according to the data access instruction address, and increments the data address between the execution of the instruction twice, and stores the step size table;

The data address generation system generates a new data access address access data memory by the data access instruction address addressing step size table content, and obtains the data for use by the processor core.
The data address generating system according to claim 1, wherein the content of the entry in the step size table is the data address;

The step size table is addressed by the data access instruction address.
The data address generating system according to claim 1, wherein the content of the entry in the step size table is the data address increment;

One dimension of the step size table is addressed by the data access instruction address;

The other dimension of the step size table is addressed by the reverse branch instruction address of the successful branch.
The data address system of claim 1 wherein said data address generation system generates an address by:

The data address generation system adds the data address and the data address stored in the step size table to generate a new data address;

The new data address is stored back to the step size table.
The data address generation system of claim 1 comprising:

The data address generation system accesses the data access instruction address in the step branch table between the reverse branch instruction address of the successful branch and the branch target instruction address;

The data address generation system generates a new data storage address according to the content of the step table;

The data address generation system accesses the data storage with the new data storage address, and obtains data for processing by the processor;

The new data address is stored back to the step size table.
A data address generating method, comprising the steps of:

Learning a data address generated by the processor core executing the data access instruction according to the data access instruction address;

Learning, according to the data access instruction address, the processor core to perform the data address increment between the same data access instruction twice;

Recording the above data address and data increment in the step table;

The data access instruction address addresses the step table contents to generate a new data access address to access the data memory, and the data is obtained for use by the processor core.
The data address generating method according to claim 6, wherein the content of the entry in the step size table is the data address;

The step size table is addressed by the data access instruction address.
The data address generating method according to claim 6, wherein the content of the entry in the step size table is the data address increment;

One dimension of the step size table is addressed by the data access instruction address;

The other dimension of the step size table is addressed by the reverse branch instruction address of the successful branch.
The data address method according to claim 6, wherein said data address generation system generates an address by:

The data address generation system adds the data address and the data address stored in the step size table to generate a new data address;

The new data address is stored back to the step size table.
The method of generating a data address according to claim 6, comprising:

The data address generation system accesses the data access instruction address in the step branch table between the reverse branch instruction address of the successful branch and the branch target instruction address;

The data address generation system generates a new data storage address according to the content of the step table;

The data address generation system accesses the data storage with the new data storage address, and obtains data for processing by the processor;

The new data address is stored back to the step size table.