WO2016188392A1 - Generation system and method of data address - Google Patents

Generation system and method of data address Download PDF

Info

Publication number
WO2016188392A1
WO2016188392A1 PCT/CN2016/083018 CN2016083018W WO2016188392A1 WO 2016188392 A1 WO2016188392 A1 WO 2016188392A1 CN 2016083018 W CN2016083018 W CN 2016083018W WO 2016188392 A1 WO2016188392 A1 WO 2016188392A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
address
instruction
step size
branch
Prior art date
Application number
PCT/CN2016/083018
Other languages
French (fr)
Chinese (zh)
Inventor
林正浩
Original Assignee
上海芯豪微电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海芯豪微电子有限公司 filed Critical 上海芯豪微电子有限公司
Publication of WO2016188392A1 publication Critical patent/WO2016188392A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F12/00Accessing, addressing or allocating within memory systems or architectures
    • G06F12/02Addressing or allocation; Relocation
    • G06F12/08Addressing or allocation; Relocation in hierarchically structured memory systems, e.g. virtual memory systems

Definitions

  • the invention relates to the field of computers, communications and integrated circuits.
  • the role of the cache in the processor system is to copy a portion of the contents of the memory, so that the content can be quickly accessed by the processor core in a short time to ensure the continuous operation of the pipeline.
  • Cache is generally divided into instruction cache and data cache.
  • the locality of the instruction address is better. Therefore, the hit rate of the instruction cache is usually higher.
  • the locality of the data address generated when the data access instruction is executed is poor, resulting in the data cache hit rate. not tall.
  • the data address of the data access instruction located in the loop code has a certain rule to follow.
  • a constant which can be positive, negative or zero.
  • This constant is the data step (stride) corresponding to the corresponding data access instruction.
  • the data address of the same data access instruction is executed twice before and after, and the data step is obtained by subtracting the previous data address from the latter data address, and then adding the data step to the latter data address. It is possible to obtain the predicted data address when the data access instruction is executed next time. In this way, the corresponding data can be prefetched from the external memory into the data cache according to the predicted data address in advance.
  • the data cache must hit; if the two are not equal, the corresponding data is determined according to the actual data address. Hit in the cache.
  • the data step size is fixed in most cases, it will still change in some cases.
  • the increment of the data address that is, the data step size
  • the data address can be additionally incremented so that the difference between the data address and the previous data address is no longer the previous data step size.
  • the increment of the data address is restored to the previous value. In this case, if only one data step is recorded, an incorrect predicted data address is generated when the loop hierarchy changes, thereby affecting the increase in the data cache hit ratio.
  • the method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.
  • the present invention proposes a data address generation system and method that can generate a data address prior to the processor core to access the data memory and read the data for processing by the processor core.
  • the data address generation system and method learns the address of the data access instruction executed by the processor core and the corresponding data address generated by the processor core executing the instruction, and the address between the data address generated by the processor core executing the same data access instruction twice. Increment and store it in the step table.
  • the step table is a one-dimensional plus a two-dimensional data structure.
  • the one-dimensional data structure is addressed by the instruction address of the data access instruction, and the content is the data address.
  • One dimension of the two-dimensional data structure is addressed by the instruction address of the data access instruction, and the other dimension is addressed by the instruction address of the branch instruction, and the content of the data structure is the data address increment (step size).
  • Such a step table maps the instruction address of the data access instruction to the corresponding data address. And the mapping is not fixed, but a dynamic mapping that varies with the number of executions of the data access instruction and its loop path.
  • the system and method take the instruction address of the reverse branch instruction of the successful branch as the upper limit, and the corresponding branch target instruction address as the lower limit, and automatically corresponding the data address of the instruction address between the lower limit and the upper limit according to the current branch.
  • the state of the loop automatically increments the data address and accesses the data store with the updated data address, reading the corresponding data to the processor core.
  • the present invention provides a data address generation system, including:
  • the data address generation system separately learns and records a data address generated by the processor core executing the data access instruction according to the data access instruction address, and performs data between the instruction twice The address increment is stored in the step table;
  • the data address generation system generates a new data access address access data memory by the data access instruction address addressing step size table content, and obtains the data for use by the processor core.
  • the content of the entry in the step size table is the data address; the step size table is addressed by the data access instruction address.
  • the content of the entry in the step table is the data address increment; one dimension of the step table is addressed by the data access instruction address; another dimension of the step table is The reverse branch instruction address addressing of the successful branch.
  • the data address generation system generates an address by:
  • the data address generation system adds the data address stored in the step size table and the data address increment to generate a new data address; the new data address is stored back to the step size table.
  • the data address generation system accesses a step of a data access instruction address between a reverse branch instruction address of a successful branch and a branch target instruction address thereof; the data address generation system generates a new one according to the content of the step table a data storage address; the data address generation system accesses the data store with the new data storage address, and fetches data for processing by the processor core; the new data address is stored back to the step size table.
  • the invention provides a data address generating method, comprising the following steps:
  • the data access instruction address addresses the step table contents to generate a new data access address to access the data memory, and the data is obtained for use by the processor core.
  • the content of the entry in the step size table is the data address; the step size table is addressed by the data access instruction address.
  • the content of the entry in the step table is the data address increment; one dimension of the step table is addressed by the data access instruction address; another dimension of the step table is The reverse branch instruction address addressing of the successful branch.
  • the data address generation system generates an address by:
  • the data address generation system adds the data address stored in the step size table and the data address increment to generate a new data address; the new data address is stored back to the step size table.
  • the data address generation system accesses a step of a data access instruction address between a reverse branch instruction address of a successful branch and a branch target instruction address thereof; the data address generation system generates a new one according to the content of the step table a data storage address; the data address generation system accesses the data store with the new data storage address, and fetches data for processing by the processor core; the new data address is stored back to the step size table.
  • system and method of the present invention can generate a data address in advance from the data memory before the processor core is about to execute the data read instruction and send it to the processor core for use by the processor core, such that the processor core It can be accessed directly when the data needs to be read, which masks the delay in accessing the data memory and obscures the lack of the data buffer.
  • system and method of the present invention can automatically learn and record the data address generated by the processor core and its increment under different instruction loops; and automatically adjust the increment for generating the data address according to the level of the instruction loop. The amount of data is generated to make the data address more accurate.
  • 1 is a schematic diagram of data access instructions in the loop of the present invention
  • Figure 5 is a block diagram of a processor system using the data address generation system of the present invention.
  • a data access instruction may be in a multi-layer instruction loop, and each time the same level loop is executed, the corresponding data step size is the same, but when different levels of loops are executed, the corresponding data step size is different.
  • a data access instruction located in a two-layer loop each time the inner loop is executed, the data address is increased by '4', that is, the data step is '4'; but each time the outer loop is executed, the data address is increased. '20', ie the data step size is '20'.
  • '20' ie the data step size is '20'.
  • whether "4" or "20" is used as the data step size of the instruction causes a certain number of data address prediction errors.
  • different data step sizes can be given according to the relationship between the branch instruction and the data access instruction, and the same data access instruction is located at different levels, so that the prediction of the data address is more accurate.
  • FIG. 1 is a schematic diagram of data access instructions in the loop of the present invention.
  • the instructions are arranged from left to right in order of addresses, wherein instructions 11, 12, and 13 are both data access instructions, and instructions 21, 22, and 23 are both branch instructions for reverse jump. Therefore, the instructions between the three branch instructions and their branch target instructions form a loop.
  • a three-layer nested loop is formed, wherein the loop corresponding to the branch instruction 21 is the innermost loop, and the loop corresponding to the branch instruction 23 is the outermost loop.
  • each data access instruction in this code can be given a special loop step storage module to provide different data step sizes when performing different levels of loop operations.
  • FIG. 2 is an embodiment of a loop step storage module according to the present invention.
  • the present embodiment describes how the data step size is provided by the loop step storage module, and how these data step sizes are stored in the loop step storage module, which will be further explained by the embodiment of FIG. 3.
  • the loop step storage module corresponds to a three-layer loop, and is composed of registers 31, 32, and 33, and selectors 41, 42, and 43.
  • the register 31 and the selector 41 correspond to the first layer loop (the innermost loop)
  • the register 32 and the selector 42 correspond to the second layer loop
  • the register 33 and the selector 43 correspond to the third layer loop (the outermost loop).
  • each data access instruction of the predicted data address corresponds to a loop step storage module as shown in FIG.
  • the register 31 stores a corresponding execution of the first layer loop (ie, the loop corresponding to the branch instruction 21).
  • the data step and the valid bit the register 32 stores the data step and the valid bit corresponding to the execution of the second layer loop (ie, the loop corresponding to the branch instruction 22);
  • the register 33 stores the execution of the third layer loop (ie, the branch)
  • the data step and the valid bit corresponding to the loop corresponding to instruction 23 is the register 31 stores a corresponding execution of the first layer loop (ie, the loop corresponding to the branch instruction 21).
  • the data step and the valid bit the register 32 stores the data step and the valid bit corresponding to the execution of the second layer loop (ie, the loop corresponding to the branch instruction 22
  • the register 33 stores the execution of the third layer loop (ie, the branch)
  • the data step and the valid bit corresponding to the loop corresponding to instruction 23 the register 31 stores a
  • each loop step storage module can store different values, so that each data access instruction can have different data when it is located in different levels of loops. Step size.
  • the initial value of the valid bits is '0'.
  • the occurrence of these three layers of cycles is prioritized. For example, once the first layer loop occurs, the data step size corresponding to the first layer loop (ie, the value of the register 31) may be output regardless of the occurrence of the second and third layer loops. On the other hand, the second layer loop will only be entered when the first layer loop does not occur. Once the second layer loop occurs, the data step size in register 32 can be output regardless of the occurrence of the third layer loop. Similarly, the third layer loop will only be entered when the first and second layer loops do not occur. Once the third layer loop occurs, the data step size in register 33 is output.
  • the default value of the data step derived from bus 35 is output.
  • the corresponding selector can be controlled by the branch determination signal indicating whether the branch transfer of the branch instructions 21, 22, and 23 is output by the processor core, and the corresponding selector can be output.
  • the data step size is the number of bits in the branch.
  • the branch instructions 21, 22, and 23 have not yet been executed, so that the corresponding branch transfer does not occur, that is, the corresponding loop step size is stored.
  • the selection signals of the selectors 41, 42 and 43 in the module are all '0', and both output default values of the data step derived from the bus 35.
  • the branch instruction 21 When the branch instruction 21 first branches, the first layer loop is entered, and the data access instructions 12 and 13 are executed again, that is, the branch access instructions 12 and 13 are executed a second time. At this time, the valid bit read from the register 31 is '0', so the data step stored in the register 31 is invalid. The data step size is calculated at this time, and is stored in the register 31 under the control of the branch judgment signal of the branch instruction 22, and the effective position in the register 31 is '1'.
  • the selection signal of the selector 41 in the corresponding loop step storage module is '1', and the loop step corresponding to the data access instructions 12 and 13
  • the memory modules respectively output the valid bits and data step sizes in the respective registers 31. Since the valid bit is '1' at this time, the data block size can be used to calculate the corresponding predicted data address. At the same time, the data step size is recalculated as described above and stored in the register 31 under the control of the branch decision signal of the branch instruction 22, and the valid bit remains "1". Thus, if the data step size when the same data access instruction is executed twice before, the value in the register 31 does not change; if the data step changes, the value in the register 31 is updated to the new data. Step value.
  • the loop step storage modules corresponding to instructions 11, 12 and 13 respectively output the data step size in register 32 to calculate the corresponding predicted data address.
  • different register values in the loop step storage module are used as the data step size.
  • the selection signal 21 of the selector 41 is "1".
  • the data step size in the output register 31 since the data access instruction 11 is not included in the first layer loop, the data step size is ignored.
  • the data step size in the corresponding output register 32 or 33 is operated as previously described, or the data step default value derived from the bus 35 is output, thereby providing different data step sizes for different cycles.
  • Each of the loop step storage modules that provide a data step for the data access instruction in the loop of the present invention corresponds to a data access instruction. Extending the loop step storage module, increasing the number of registers and selectors (where each group of registers corresponds to a corresponding loop of the corresponding selector), corresponding to more hierarchical loops; and then for the more hierarchical loops Each (or part of) data access instruction provides a loop step storage module, thereby providing more accurate data steps for all (or part of) data access instructions in the more hierarchical loop according to the loop execution situation Long function.
  • FIG. 3 is an embodiment of a data address generation system according to the present invention.
  • the memory array 52 corresponding to the plurality of data access instructions, the column decoder 50 decoded according to the branch judgment result, and the row decoder 54 decoded according to the instruction address of the data access instruction are included.
  • the address generator 60.60 is composed of a subtractor 61, an adder 62, a selector 63, and a comparator 64.
  • the memory array 52 each of which corresponds to a data access instruction.
  • Each of the rows 52 of the row decoder 54 is provided with a register to store the instruction address of the data access instruction corresponding to the row, and a comparator compares the contents of the register with the instruction address on the data access instruction address bus 53.
  • Array 52 has two read/write ports. One of the read/write ports 37, 38 is dedicated to the rightmost column 58 of 52. Access to the column is only controlled by the row decoder 54, which stores the corresponding data address 67 of the data access instruction and Valid bit 68.
  • the other read/write ports 34, 36 are shared by 52 columns other than 58 columns, such as 56 columns, etc., through which a row selected by the row decoder 54 and a column selected by the column decoder 50 are accessed.
  • the remaining columns of the array 52 except 58 have a number of columns corresponding to the number of loop levels that the step storage array can support at most.
  • a comparator and an address register are provided, in which the instruction address of each branch instruction is stored.
  • the processor core makes an 'execution branch' decision, the instruction address of the corresponding branch instruction is sent to the 50 match on the bus 51, and the column in the array 52 corresponding to the matching hit register (except the 58 column) may Accessed through read/write ports 34,36.
  • the processor core executes the instructions in order, starting from left to right starting from the leftmost instruction in Figure 1.
  • the processor core decodes the instruction 11, it is found to be a data load instruction, and the instruction address of the instruction 11 is sent to the row decoder 54 via the bus 53 for matching.
  • the matching miss is performed, and the row replacement logic in the 54 is the instruction.
  • 11 allocates a row, that is, a row 11 above the array 52 in FIG. 3, and stores the instruction address of the instruction 11 into the register corresponding to the row in 54.
  • the valid bits 66 and 68 of each column in the row are set to '0'.
  • the data address generation system Based on the valid bit 68 of '0' in column 58, the data address generation system causes the processor core to execute the data load instruction 11 to generate a data address access data memory (such as a data buffer), and the read data is executed by the processor core. At the same time, the system also stores the above data address via the bus 57, the selector 63, and the write port 38 into the 67 field of the 11 rows and 58 columns of the array 52, and sets the valid bit 68 in the entry to '1'. Thereafter, the processor core executes subsequent instructions, and the row replacement logic of 54 also allocates one row for each of the data load instructions 12, 13 as described above, that is, the middle and lower rows 12, 13 of the array 52 in FIG. 3, and the processor core executes the instruction 12 as described above.
  • the data addresses generated by 13 are stored in the entries of 12, 13 rows and 58 columns, respectively, and the valid bits 68 in the entries are set to '1'.
  • the processor core executes the branch instruction 21 and judges its branch as 'execution branch'. Therefore, the processor core reverses the branch target address instruction between instruction 11 and instruction 12, and executes in the instruction order from the instruction.
  • the instruction address of the branch instruction 21 is sent to the column decoder 50 via bus 51 for matching. If it does not match, The 50 column replacement logic assigns one of the arrays 52 (21 columns in the figure), and stores the instruction address of the branch instruction 21 into the corresponding register in 50. The address of the last successful branch is always maintained on the bus 51. At this time, the address on 51 matches the address in the register corresponding to 21 columns in 50, so the column decoder 50 selects 21 columns.
  • the processor core continues to execute instructions in program order, executing instruction 12 again, and decoding instruction 12 finds it a data load instruction. Therefore, the instruction address of the instruction 12 is sent to the row decoder 54 via the bus 53 for matching. This time, the match hits from the 12 rows and 58 columns of 52 through the read port 37 to read the valid bit 68 as '1', from 52 to 12. Line 21 is read through port 37 and the valid bit 66 is read as '0'. Based on the two significant bits of '10', the system causes the processor core to execute a data load instruction 12 to generate a data address to access the data memory read data to the processor core for execution. At the same time, the system also sends the above data address to the address generator 60 via the bus 57.
  • the subtractor 61 subtracts the last data address 67 of the 12 rows and 58 columns read out from the read port 37 of the array 52 at this time.
  • the address increment (step size) is written to array 52 via write port 34.
  • the bus 53 is the address of the data load instruction 12
  • the bus 51 is the address of the branch instruction 21, so the above step size is stored in the step field 65 of the entry of the 12 rows and 21 columns in the array 52.
  • the system also sets the valid bit 66 of the 12 rows and 21 columns to '1'. In the same manner, the system also stores the step size of the data load instruction 13 into the 13 rows and 21 columns 65 of 52, and sets the 13-bit 21-bit valid bit 66 to '1'.
  • the processor core executes the branch instruction 21 again, and its branch decision is again an 'execution branch'. Therefore, the processor core reverses the branch target address instruction between instruction 11 and instruction 12, and executes in the instruction order from the instruction.
  • the instruction address of the branch instruction 21 is sent to the column decoder 50 via bus 51 for matching. This match hits, so column decoder 50 selects 21 columns.
  • the processor core executes the instructions in program order, executes instruction 12 again, and finds that it is a data load instruction when decoding instruction 12. Therefore, the instruction address of the instruction 12 is sent to the row decoder 54 via the bus 53 for matching. This time, the match hits from the 12 rows and 58 columns of 52 through the read port 37 to read the valid bit 68 as '1', from 52 to 12. Row 58 is read through port 36 and the valid bit 66 is read as '1'. According to the two valid bits of '11', the system uses the adder 62 to read the data address 67 in the 12 rows and 58 columns read from the read port 37 at this time and the 12 rows and 21 columns read from the read port 36. Step size 65 is added.
  • the sum 38 accesses the data memory as a new data address, and the read data is sent to the processor core for processing.
  • the data addresses on data addresses 57 and 38 generated by processor core execution instruction 12 are compared by comparator 64. If the comparison result is the same, the processor core continues to execute the subsequent instructions in program order, and the sum of the adder 62 is also written back to the 12 rows and 58 columns via the read port 38 and stored in 67. And the effective bit 68 of 12 rows and 58 columns is maintained as '1' unchanged.
  • the system control processor core discards the intermediate execution result of the data obtained based on the data address on 38, causes the processor core to execute the loading of the instruction 12 according to the data obtained by the data address 57, and then sequentially executes the subsequent instructions.
  • the system controls the selector 63 to select the address on the 57 to be stored in the entry of the 12 rows and 58 columns via the read port 38, maintaining the valid bit 68 in the entry as '1'; but the valid bit 66 in the 12 rows and 21 columns. Set to '0' to record the step size in this entry is invalid and need to be re-learned.
  • the system also processes the instructions 13 in the same manner.
  • the processor core executes the branch instruction 21 again, and this time the branch is judged as 'no branch'. Therefore the processor core continues to execute in the order of instructions. Thereafter, the processor core executes the branch instruction 22, and its branch is judged as 'execution branch'. Therefore, the processor core reversely jumps to the branch target address instruction before instruction 11, and executes from the instruction in the order of instructions.
  • the instruction address of the branch instruction 22 is sent to the column decoder 50 via bus 51 for matching. Because of the mismatch, The 50 column replacement logic assigns one of the arrays 52 (22 columns in the figure), and stores the instruction address of the branch instruction 22 into the corresponding register in 50. At this time, the address held on the bus 51 matches the address in the register of the corresponding 22 columns in 50, so the column decoder 50 selects 22 columns.
  • the processor core executes the instructions in program order, executes instruction 11 again, and finds that it is a data load instruction when decoding instruction 11. Therefore, the instruction address of instruction 11 is sent to match 54 via bus 53, this match hit, from 11 rows and 58 columns of array 52 through read port 37 read valid bit 68 is '1', from 52 to 11 rows and 22 columns The valid bit 66 is read as '0' via the read port 37. Therefore, the system executes according to the condition that the data load instruction address 67 is valid but the step size 65 is invalid, and the data address generated by the processor core reads data from the data memory for processing by the processor core, and the data address sent via the bus 57 is processed.
  • the data address 67 in the 58 columns read out through the read port 37 is subtracted, and the difference is stored as a step size in the 11 rows and 22 columns 65 fields through the write port, and 66 is set to '1'.
  • the system also stores the data address on 57 in the 67 field of the 11 row 58 list item, and maintains the valid bit 68 in the entry as '1' unchanged.
  • the system processes instructions 12 and 13 in the same way. In the future, the operation will be repeated as described above.
  • one row in array 52 stores the data memory address of a data access instruction, the step size and the corresponding valid bit; a column in 52 stores the step size and the valid bit of each data access instruction corresponding to the branch instruction succeeding in one branch.
  • the special 58 columns store the data memory address, and its read and write is not affected by the state of the branch instruction.
  • the system selects one of the arrays 52 via the bus 53 with the instruction address of the data access instruction, reads the valid bit 68 of the 58 columns, and selects a column via the bus 51 with the branch instruction address of the last successful branch, reading the valid bit 66 therein. According to the state of the valid bits 68 and 66 in a row, the system has the following three modes of operation.
  • the system stores the data address 57 generated by the processor core into the column 58 of the row 67 and sets the state to '10. '.
  • the meaning is that the data address is valid but the step size is invalid.
  • the system calculates the difference between the data address 57 generated by the processor core and the data address stored in the column 58 of the row and Save it to the 65 field in the selected column at this time, and set the state to '11'.
  • the meaning is that the data address and the step size are valid.
  • the system uses the data address in the 67 field in the 67 column and the step size in the 65 fields in the other columns selected by the branch judgment.
  • a data address 38 is generated, the data memory is accessed, and the data is read for processing by the processor core.
  • the system compares the data address 57 generated by the processor core with the data address 38, and takes corrective action as needed according to the comparison result, and modifies the state of 66.
  • the above system based on the step size advance data can be used in several ways.
  • the first way is that when the read data is accepted by the processor core according to the data address 38 generated in one row of the array 52 (the address on the bus 38 is the same as the address on the bus 57), the system adds the address on the 38 to the read port 36.
  • the step size is sent to the data cache match as the guess data address. If you don't match, you can start to fetch data from the lower storage level into the high-level cache.
  • the processor core still generates data addresses via bus 57 to read data from the data cache. This approach can partially or completely mask the absence of the data cache in most cases.
  • the guess data address is a step size selected based on the result of the last successful branch, executed before the next successful branch, and the processor core executes the same data load instruction read data after the next successful branch.
  • the success of the last two successful branches is not necessarily the same branch instruction, so the step size used to generate the guess data address does not necessarily coincide with the increment used by the data address generated by the processor core.
  • the second way is to read the entries in the selected column of the successful branch and the 58 list items from the array 52 after the successful branch with the previously executed data load instruction address, such as the entry status 68, 66 is '11'. , that is, the data address 38 is generated, and the data is read from the data memory and stored in a data read buffer with a shorter read latency (data Read buffer) for the processor core to take.
  • data Read buffer a shorter read latency
  • FIG. 4 is an embodiment of a step size table decoder according to the present invention.
  • 4 shows the decode logic for a row in the corresponding array 52 in row decoder 54, where 72, 74 are registers, 76 is a selector, and 78 is a comparator.
  • the output of register 72 is coupled to one input of comparator 78, and the output of selector 76 is coupled to the other input of comparator 78.
  • Comparator 78 can do more than comparison, less than comparison, and a total of three comparison modes.
  • the comparison mode of comparator 78 is linked to the selection of selector 76.
  • the comparator 78 makes a comparison of the address stored in the register 76 greater than or equal to the address on the 73; when the selector 76 selects the bus 51 as the input, the address 78 stored in the register 76 is less than 51. Comparison of addresses; when selector 76 selects bus 53 as an input, 78 makes a comparison of the address stored in register 76 equal to the address on 53; 75 is the result of the comparison output by comparator 78.
  • the bus 53 is the data load address output by the processor core.
  • the address is compared with the contents of register 72 in all rows in row decoder 54 by comparators 78 of each row. If there is no match, the 54 row replacement logic will allocate a row for the data load instruction and store the upper address of 53 into register 72 of the row. The same address will then match the address in the row register 72 to enable the word line for that row, the operation of which has been detailed in the embodiment of FIG.
  • the system puts the instruction address of the instruction on the bus 51, and places the instruction address of the branch target instruction of the instruction on the bus 73.
  • the system causes the contents of the registers 72 in all rows in 54 to be sequentially greater than or equal to the bus 73 and the address on the bus 51, and less than the comparison, and the comparison result 75 is stored in the register 74.
  • the corresponding data load instruction of the row is in the branch instruction (not included) and the branch target instruction (inclusive).
  • the register 74 of the row is written into the comparison result '1'.
  • the row whose comparison result does not satisfy the above condition is not between the branch target instruction and the branch instruction.
  • the instruction 11 in FIG. 1 is not in the loop of the branch instruction 21, and the register of the row is written into the comparison result '0'.
  • the system sequentially enables the word lines of the rows of all registers 74 to be '1', and reads the contents of the entries in the 58 columns from the array 52 and the column decoder 50 according to the branches on the 51 bus at this time.
  • the contents of the entry in the row in the column selected by the instruction address.
  • the system adds the step size in the 65 domain and the previous cyclic data address in the 67 domain, and obtains a new data address to address the data memory via the bus 38, and reads the data to Used by the processor core.
  • the new data address is also written back to the 67 domain.
  • the system does not operate on the line whose register 74 content is '0'.
  • the instruction address of the data load instruction sent via the bus 53 is compared to the contents of each of the row registers 72 in the row decoder 54 for equal comparison.
  • the system detects the contents of the register 74 in the equal row. If the content is '0', the system reads the state 68, 66 in the row, and operates according to the state as described above; if the content of the register 74 is '1', the system The data address in the 67 field in the row and the data address sent by the processor core via bus 57 are read. If the two addresses are equal, the system sets the contents of the register 74 of the row to '0' without subsequent operations. If the two addresses are not equal, the system sets the contents of the register 74 of the row to '0', operating as the address in the 67 domain is not equal to the address on the bus 57.
  • FIG. 5 is a block diagram of a processor system using the data address generation system of the present invention.
  • 50 is the column decoder of the step table
  • 52 is the step table array
  • 54 is the row decoder of the step table
  • 60 is the address generator
  • 80 is the processor core
  • 82 is the data buffer
  • 84 is Data memory.
  • 51 is the branch instruction address bus of the successful branch, which is output from the processor core 80 to the column decoder 50 and the row decoder 54.
  • 73 is the branch target instruction address bus, which is output from the processor core 80 to 54.
  • 53 is data access.
  • the instruction address bus is output from processor core 80 to 54.
  • 57 is a data address bus, which is output by processor core 80 to address generator 60 and data memory 84.
  • 38 is a data address bus that is output by address generator 60 to data memory.
  • Bus 85 outputs data from data memory 85 to data degree buffer 82 for temporary storage, and bus 87 outputs data from 82 to processor core 80.
  • the number of columns in the step table (50, 52, 54) determines the number of loop layers or loops that it can handle.
  • the number of rows in the step table determines the number of data access instructions it can process.
  • the step table is a one-dimensional (58 columns) plus a two-dimensional data structure.
  • the one-dimensional data structure is addressed by the instruction address of the data access instruction, and the content of the data structure is the data address.
  • One dimension of the two-dimensional data structure is addressed by the instruction address of the data access instruction, and the other dimension is addressed by the instruction address of the branch instruction, and the content of the data structure is the data address increment (step size).
  • Such a step table maps the instruction address of the data access instruction to the corresponding data address. And the mapping is not fixed, but a dynamic mapping that varies with the number of executions of the data access instruction and its loop path.
  • the system allocates one-dimensional storage resources (rows) in the data structure of the step table via the bus 53 with the data access instruction address provided by the processor core 80, and provides the allocated one-dimensional storage resource with the corresponding data address provided by the bus 57.
  • Initial content data address
  • another dimension resource (column) is allocated by the branch instruction address provided by the bus 51 by 80, and the difference between the data address provided again by the bus 57 via the bus 57 and the initial content in the step size table is stored in each other.
  • the system further transmits the instruction address of the reverse branch branch instruction of the successful branch of the bus 51 as the upper limit, and the corresponding branch target instruction address is transmitted as the lower limit by the bus 73, so that the step size table and the address generator 60 automatically set the instruction address at the lower limit and
  • the corresponding data address of the data access instruction between the upper limits updates the data address according to the respective step size in the column selected by the state of the current branch cycle, and accesses the data memory 84 via the bus 38 with the updated data address, and reads the corresponding data and
  • the data read buffer 82 is stored before the processor 80 outputs the corresponding data address 57.
  • the data read buffer 82 may be in the form of an address match read, in which each row of 82 has an entry for storing data and an entry for storing the corresponding data address.
  • the processor core 80 sends the data address to the bus 82, and the data in the corresponding data entry matching the address table entry and the address on the 57 is sent to the processor core 80 via the bus 97.
  • the row permutation logic can use a replacement method such as LRU (least recently used).
  • the column replacement method can also use a form such as LRU.
  • LRU least recently used
  • Another form of data read buffer 82 may be a first in first out (FIFO).
  • the row allocation logic in the row decoder 54 strictly allocates row resources in order of increasing row numbers.
  • the rows are sequentially allocated in the order of the addresses of the instructions 11, 12, and 13.
  • the processor core 80 provides the lower limit and the upper limit to the row decoder 54
  • the respective word lines are sequentially enabled from the lower limit to the upper limit in the order of the instruction addresses, so that the step size table and the address generator 60 are provided in the order of instructions.
  • the data address accesses the data memory 84 so that the read data is stored in the first-in first-out data read buffer 82 in the order of instructions.
  • a read request is provided to the first in first out 82, and 82 outputs a data to the 80.
  • 82 is now a data queue arranged in the order of the instruction address.
  • This form of row permutation logic treats the step table as a circular buffer (circular Buffer), when the last line of the step table has been allocated, the next line is allocated next time, then the second line.
  • the address stored in register 72 allows the aforementioned mechanism to continue to take effect. For example, in the embodiment of FIG. 1, there are other data access instructions before the data access instructions of 11, 12, and 13.
  • the instruction 11 obtains the lowermost row in the allocation map 3, and the 12, 13 instructions are sequentially assigned the upper and middle rows.
  • the lower limit is at the bottom line (instruction 11) and the upper limit is at the middle line (instruction 13).
  • a third way to use the above system based on step size to fetch data in advance is to combine the first and second modes.
  • the data address of the data access instruction between the corresponding lower limit and the upper limit is updated, and the data memory is accessed by these updated addresses, and the read data is stored in the data read buffer 82 for use by the processor core 80, and The updated data address is stored in column 58 of array 52 (i.e., the second mode described above).
  • the current branch judges the selection step size plus the data address to generate the guess data address and sends it to the data buffer. If the corresponding data is not already in the cache, the prefetch data is stored in the cache to cover the cache miss; but the guess data address It is not stored in column 58 of array 52. It is the first way to guess the generation of the data address and to address the data memory with the data that is required to be executed the next time the same data load instruction is loaded into the data memory in advance, before the new branch judgment is generated.
  • Data memory 84 can be implemented with a buffer.
  • Data buffers typically have a tag unit in which all or part of the memory address is stored.
  • the memory address is sent to the tag unit to match, and the cache address is generated when the match is matched (for example, the path number in the cache of the multi-path organization form and the buffer part combined with the index part in the memory address and the offset address part in the block), Used to address the data store in the cache (data RAM).
  • the step size disclosed in the present invention can directly store the buffer address in the 67 field of its 58 columns, so that the address sent via the bus 38 can directly address the data memory in the cache without going through the label unit mapping.
  • the row address in the data memory is required to be continuous within a certain interval, so that the address generator 60 automatically calculates the next data address in increments of steps; for example, several rows of consecutive addresses in the data memory are stored in the multi-channel buffer. In the same road.
  • the buffer address spans a non-contiguous address space, there is a way to adjust the buffer address, such as changing the road number.
  • the memory address and its corresponding buffer address can be simultaneously stored in the above 67 domain, and the two addresses are updated with the same step size.
  • the buffer address is directly addressed to the data store in the access cache, wherein the memory address is used to compare with the memory address output by the processor core 80 via the bus 57 to determine that the address generated by the address generator 60 is correct.
  • the buffer address in 67 is matched in the tag unit of the data buffer to obtain a new buffer address; in the contiguous address space, The buffer address is updated incrementally to address the data store in the cache.
  • the above embodiments all take the data loading instruction as an example.
  • the method and system of the present invention can also be applied to data storage instructions, such as the above-mentioned first way to make possible application to a write allocation strategy (write Allocate writeback cache (write back Cache), the address generator generates a memory address to be stored, and reads the corresponding data from the memory into the data buffer, so that the processor core stores the data into the data buffer to avoid the cache miss.
  • write allocation strategy write Allocate writeback cache (write back Cache)
  • the address generator generates a memory address to be stored, and reads the corresponding data from the memory into the data buffer, so that the processor core stores the data into the data buffer to avoid the cache miss.
  • a step table is a two-dimensional data structure in which one dimension is addressed with one parameter and the other dimension is addressed with another parameter to access an entry in the data structure. It can be directly addressed with parameters. However, if the parameters are not continuous, the parameters can be compressed. As shown in FIG. 3, each register 72 in the row decoder 54 in the embodiment of FIG. 4 functions like a tag unit in the fully associative buffer, compressing the address space and the array. The holes in 52 (data access instructions account for about one-third of the total number of instructions, and their address space is not continuous). From this point of view, the form of the step table is actually a structure similar to a fully associative cache.
  • Each address register in column decoder 50 also performs the same compression function (the branch instruction accounts for about one-sixth of the total number of instructions, and its address space is not continuous). Therefore, the step size table can be regarded as a two-dimensional fully associative compression structure.
  • the system and method proposed by the present invention can be used in various processor related applications, including general purpose processors, microcontrollers, multi-lane processors, artificial intelligence processors, big data processors, digital signal processors, Graphics processors, etc., can improve the efficiency of the processor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Advance Control (AREA)
  • Memory System Of A Hierarchy Structure (AREA)

Abstract

An automatic learning, recording and generation method and system of a data address. When the present invention is applied to the field of processors, the data required by an instruction is transmitted to a processor core in advance to be ready for use before the processor core executes a data reading instruction. In addition, a data address for next execution of the instruction is predicted, and the corresponding data is loaded to a data cache to reduce cache loss.

Description

一种数据地址产生系统和方法  Data address generation system and method
本发明涉及计算机,通讯及集成电路领域。 The invention relates to the field of computers, communications and integrated circuits.
处理器系统中缓存的作用是将内存中的一部分内容复制在其中,使这些内容能在短时间内由处理器核快速存取,以保证流水线的持续运行。缓存一般分为指令缓存和数据缓存,指令地址的局部性较好,因此指令缓存的命中率通常比较高;但执行数据访问指令时产生的数据地址的局部性较差,导致数据缓存命中率通常不高。 The role of the cache in the processor system is to copy a portion of the contents of the memory, so that the content can be quickly accessed by the processor core in a short time to ensure the continuous operation of the pipeline. Cache is generally divided into instruction cache and data cache. The locality of the instruction address is better. Therefore, the hit rate of the instruction cache is usually higher. However, the locality of the data address generated when the data access instruction is executed is poor, resulting in the data cache hit rate. not tall.
然而,位于循环(loop)代码中的数据访问指令的数据地址有一定的规律可循。通常在每次执行到一个循环中的某条数据访问指令时,其对应的数据地址都是增加一个常数(可以是正数、负数或零)。该常数就是相应的数据访问指令对应的数据步长(stride)。显然,对相邻的前后两次执行同一条数据访问指令时的数据地址,从后一个数据地址中减去前一个数据地址就可以得到数据步长,再对后一个数据地址加上数据步长就可以得到再下一次执行该数据访问指令时的预测数据地址。这样就可以提前根据所述预测数据地址从外部存储器将相应数据预取到数据缓存中。当再下一次实际执行到该数据访问指令时,若实际产生的数据地址与该预测数据地址相等,则数据缓存一定命中;若两者不等,则根据实际数据地址确定对应的数据是否在数据缓存中命中。However, the data address of the data access instruction located in the loop code has a certain rule to follow. Usually each time a data access instruction in a loop is executed, its corresponding data address is incremented by a constant (which can be positive, negative or zero). This constant is the data step (stride) corresponding to the corresponding data access instruction. Obviously, the data address of the same data access instruction is executed twice before and after, and the data step is obtained by subtracting the previous data address from the latter data address, and then adding the data step to the latter data address. It is possible to obtain the predicted data address when the data access instruction is executed next time. In this way, the corresponding data can be prefetched from the external memory into the data cache according to the predicted data address in advance. When the data access instruction is actually executed next time, if the actually generated data address is equal to the predicted data address, the data cache must hit; if the two are not equal, the corresponding data is determined according to the actual data address. Hit in the cache.
虽然多数情况下数据步长是固定不变,但在某些情况下依然会发生变化。例如对于两层循环嵌套的情况,在多次执行内层循环代码时,由于执行的代码相同,因此数据地址的增量(即数据步长)也往往相同。但一旦执行到外层循环代码时,数据地址可以额外增加了一个增量,使得数据地址与前一次数据地址之差不再是之前的数据步长。然而,当再次执行内层循环时,数据地址的增量就恢复到之前的值。在这种情况下,如果只记录一个数据步长,则在循环层次发生变化时产生错误的预测数据地址,从而影响数据缓存命中率的提高。Although the data step size is fixed in most cases, it will still change in some cases. For example, in the case of two-layer loop nesting, when the inner loop code is executed multiple times, since the executed code is the same, the increment of the data address (that is, the data step size) tends to be the same. However, once the outer loop code is executed, the data address can be additionally incremented so that the difference between the data address and the previous data address is no longer the previous data step size. However, when the inner loop is executed again, the increment of the data address is restored to the previous value. In this case, if only one data step is recorded, an incorrect predicted data address is generated when the loop hierarchy changes, thereby affecting the increase in the data cache hit ratio.
本发明提出的方法与系统装置能直接解决上述或其他的一个或多个困难。 The method and system apparatus proposed by the present invention can directly address one or more of the above or other difficulties.
本发明提出了一种数据地址产生系统和方法,可先于处理器核产生数据地址,以之访问数据存储器,读取数据供处理器核处理。该数据地址产生系统和方法学习处理器核执行的数据访问指令的地址及处理器核执行该指令产生的相应数据地址,以及处理器核两次执行同一数据访问指令产生的数据地址之间的地址增量并将其存储在步长表中。The present invention proposes a data address generation system and method that can generate a data address prior to the processor core to access the data memory and read the data for processing by the processor core. The data address generation system and method learns the address of the data access instruction executed by the processor core and the corresponding data address generated by the processor core executing the instruction, and the address between the data address generated by the processor core executing the same data access instruction twice. Increment and store it in the step table.
步长表是一个一维加上一个二维的数据结构。其中一维数据结构以数据访问指令的指令地址寻址,内容是数据地址。其中二维数据结构中一维以数据访问指令的指令地址寻址,另外一维以分支指令的指令地址寻址,而数据结构的内容是数据地址增量(步长)。如此的步长表将数据访问指令的指令地址映射为相应的数据地址。且该映射并非固定,而是随着该数据访问指令的执行次数以及其循环路径而变化的动态映射。系统与方法更以成功分支的反向跳转分支指令的指令地址作为上限,以相应分支目标指令地址作为下限,自动将指令地址在下限与上限之间的数据访问指令的相应数据地址根据当前分支循环的状态自动增量更新数据地址,并以更新的数据地址访问数据存储器,读取相应数据提供给处理器核。The step table is a one-dimensional plus a two-dimensional data structure. The one-dimensional data structure is addressed by the instruction address of the data access instruction, and the content is the data address. One dimension of the two-dimensional data structure is addressed by the instruction address of the data access instruction, and the other dimension is addressed by the instruction address of the branch instruction, and the content of the data structure is the data address increment (step size). Such a step table maps the instruction address of the data access instruction to the corresponding data address. And the mapping is not fixed, but a dynamic mapping that varies with the number of executions of the data access instruction and its loop path. The system and method take the instruction address of the reverse branch instruction of the successful branch as the upper limit, and the corresponding branch target instruction address as the lower limit, and automatically corresponding the data address of the instruction address between the lower limit and the upper limit according to the current branch. The state of the loop automatically increments the data address and accesses the data store with the updated data address, reading the corresponding data to the processor core.
本发明的目的在于提供一种数据地址产生系统和方法,提前产生数据地址以掩盖访问数据存储器的时延以及数据缓存器的缺失。 It is an object of the present invention to provide a data address generation system and method for generating a data address in advance to mask the delay in accessing the data memory and the absence of a data buffer.
为此,本发明提供了一种数据地址产生系统,包括:To this end, the present invention provides a data address generation system, including:
步长表,用以存储数据地址及地址增量;所述数据地址产生系统依据数据访问指令地址分别学习并记录处理器核执行数据访问指令产生的数据地址,及两次执行该指令之间数据地址增量,存入步长表;a step size table for storing a data address and an address increment; the data address generation system separately learns and records a data address generated by the processor core executing the data access instruction according to the data access instruction address, and performs data between the instruction twice The address increment is stored in the step table;
所述数据地址产生系统以数据访问指令地址寻址步长表内容产生新的数据访问地址访问数据存储器,取得数据供处理器核使用。The data address generation system generates a new data access address access data memory by the data access instruction address addressing step size table content, and obtains the data for use by the processor core.
可选的,所述步长表中的表项内容为所述数据地址;所述步长表由所述数据访问指令地址寻址。Optionally, the content of the entry in the step size table is the data address; the step size table is addressed by the data access instruction address.
可选的,所述步长表中的表项内容为所述数据地址增量;所述步长表的一维由所述数据访问指令地址寻址;所述步长表的另一维由成功分支的反向分支指令地址寻址。Optionally, the content of the entry in the step table is the data address increment; one dimension of the step table is addressed by the data access instruction address; another dimension of the step table is The reverse branch instruction address addressing of the successful branch.
可选的,所述数据地址产生系统通过如下方法产生地址:Optionally, the data address generation system generates an address by:
所述数据地址产生系统将所述步长表中存储的所述数据地址及所述数据地址增量相加以产生新的数据地址;所述新的数据地址被存回步长表。The data address generation system adds the data address stored in the step size table and the data address increment to generate a new data address; the new data address is stored back to the step size table.
可选的,包括:Optional, including:
所述数据地址产生系统访问数据访问指令地址在成功分支的反向分支指令地址及其分支目标指令地址之间的步长表内容;所述数据地址产生系统根据所述步长表内容产生新的数据存储地址;所述数据地址产生系统以所述新的数据存储地址访问所述数据存储器,取得数据供所述处理器核处理;所述新的数据地址被存回步长表。The data address generation system accesses a step of a data access instruction address between a reverse branch instruction address of a successful branch and a branch target instruction address thereof; the data address generation system generates a new one according to the content of the step table a data storage address; the data address generation system accesses the data store with the new data storage address, and fetches data for processing by the processor core; the new data address is stored back to the step size table.
本发明提供了一种数据地址产生方法,包括以下步骤:The invention provides a data address generating method, comprising the following steps:
依据数据访问指令地址学习处理器核执行数据访问指令产生的数据地址;Learning a data address generated by the processor core executing the data access instruction according to the data access instruction address;
依据数据访问指令地址学习处理器核两次执行同一数据访问指令之间数据地址增量;Learning, according to the data access instruction address, the processor core to perform the data address increment between the same data access instruction twice;
将上述数据地址及数据增量记录在步长表中;Recording the above data address and data increment in the step table;
以数据访问指令地址寻址步长表内容产生新的数据访问地址访问数据存储器,取得数据供处理器核使用。The data access instruction address addresses the step table contents to generate a new data access address to access the data memory, and the data is obtained for use by the processor core.
可选的,所述步长表中的表项内容为所述数据地址;所述步长表由所述数据访问指令地址寻址。Optionally, the content of the entry in the step size table is the data address; the step size table is addressed by the data access instruction address.
可选的,所述步长表中的表项内容为所述数据地址增量;所述步长表的一维由所述数据访问指令地址寻址;所述步长表的另一维由成功分支的反向分支指令地址寻址。Optionally, the content of the entry in the step table is the data address increment; one dimension of the step table is addressed by the data access instruction address; another dimension of the step table is The reverse branch instruction address addressing of the successful branch.
可选的,所述数据地址产生系统通过如下方法产生地址:Optionally, the data address generation system generates an address by:
所述数据地址产生系统将所述步长表中存储的所述数据地址及所述数据地址增量相加以产生新的数据地址;所述新的数据地址被存回步长表。The data address generation system adds the data address stored in the step size table and the data address increment to generate a new data address; the new data address is stored back to the step size table.
可选的,包括:Optional, including:
所述数据地址产生系统访问数据访问指令地址在成功分支的反向分支指令地址及其分支目标指令地址之间的步长表内容;所述数据地址产生系统根据所述步长表内容产生新的数据存储地址;所述数据地址产生系统以所述新的数据存储地址访问所述数据存储器,取得数据供所述处理器核处理;所述新的数据地址被存回步长表。 The data address generation system accesses a step of a data access instruction address between a reverse branch instruction address of a successful branch and a branch target instruction address thereof; the data address generation system generates a new one according to the content of the step table a data storage address; the data address generation system accesses the data store with the new data storage address, and fetches data for processing by the processor core; the new data address is stored back to the step size table.
对于本领域专业人士,还可以在本发明的说明、权利要求和附图的启发下,理解、领会本发明所包含其他方面内容。Other aspects of the present invention can be understood and appreciated by those skilled in the art in light of the description of the invention.
此外,本发明所述的系统和方法可以在处理器核即将执行到数据读取指令之前,提前产生数据地址从数据存储器中读出该数据并送往处理器核供其使用,使得处理器核在需要读取该数据时能直接取用,掩盖了访问数据存储器的时延,掩盖了数据缓存器的缺失。Moreover, the system and method of the present invention can generate a data address in advance from the data memory before the processor core is about to execute the data read instruction and send it to the processor core for use by the processor core, such that the processor core It can be accessed directly when the data needs to be read, which masks the delay in accessing the data memory and obscures the lack of the data buffer.
进一步,本发明所述的系统和方法可以自动学习并记录处理器核产生的数据地址及其在不同指令循环下的增量;并根据指令循环的层次,自动调整用以产生数据地址时的增量,使得产生的数据地址更准确。Further, the system and method of the present invention can automatically learn and record the data address generated by the processor core and its increment under different instruction loops; and automatically adjust the increment for generating the data address according to the level of the instruction loop. The amount of data is generated to make the data address more accurate.
对于本领域专业人士而言,本发明的其他优点和应用是显见的。Other advantages and applications of the present invention will be apparent to those skilled in the art.
图1是本发明所述循环中的数据访问指令的一个示意图;1 is a schematic diagram of data access instructions in the loop of the present invention;
图2是本发明所述的循环步长存储模块的实施例;2 is an embodiment of a cyclic step storage module according to the present invention;
图3是本发明所述其为本发明所述数据地址产生系统的一个实施例;3 is an embodiment of the data address generation system of the present invention;
图4是本发明所述步长表的行译码器的一个实施例;4 is an embodiment of a row decoder of the step size table of the present invention;
图5是使用本发明所述数据地址产生系统的处理器系统框图。Figure 5 is a block diagram of a processor system using the data address generation system of the present invention.
本发明的最佳实施方式是附图3 。 The preferred embodiment of the invention is shown in Figure 3.
以下结合附图和具体实施例对本发明提出的数据缓存系统和方法作进一步详细说明。根据下面说明和权利要求书,本发明的优点和特征将更清楚。需说明的是,附图均采用非常简化的形式且均使用非精准的比例,仅用以方便、明晰地辅助说明本发明实施例的目的。The data cache system and method proposed by the present invention are further described in detail below with reference to the accompanying drawings and specific embodiments. Advantages and features of the present invention will be apparent from the description and appended claims. It should be noted that the drawings are in a very simplified form and all use non-precise proportions, and are only for convenience and clarity to assist the purpose of the embodiments of the present invention.
需说明的是,为了清楚地说明本发明的内容,本发明特举多个实施例以进一步阐释本发明的不同实现方式,其中,该多个实施例是列举式并非穷举式。此外,为了说明的简洁,前实施例中已提及的内容往往在后实施例中予以省略,因此,后实施例中未提及的内容可相应参考前实施例。It should be noted that the various embodiments of the present invention are further described to illustrate the various embodiments of the present invention in order to clearly illustrate the present invention. Further, for the sake of brevity of explanation, the contents already mentioned in the foregoing embodiment are often omitted in the latter embodiment, and therefore, contents not mentioned in the latter embodiment can be referred to the previous embodiment accordingly.
虽然该发明可以以多种形式的修改和替换来扩展,说明书中也列出了一些具体的实施图例并进行详细阐述。应当理解的是,发明者的出发点不是将该发明限于所阐述的特定实施例,正相反,发明者的出发点在于保护所有基于由本权利声明定义的精神或范围内进行的改进、等效转换和修改。同样的元器件号码可能被用于所有附图以代表相同的或类似的部分。Although the invention may be modified in various forms of modifications and substitutions, some specific embodiments of the invention are set forth in the specification and detailed. It should be understood that the inventor's point of departure is not to limit the invention to the particular embodiments set forth, but the inventor's point of departure is to protect all improvements, equivalent transformations and modifications based on the spirit or scope defined by the claims. . The same component numbers may be used in all figures to represent the same or similar parts.
通常情况下,一条数据访问指令可能位于多层指令循环中,且每次执行同一层次循环时,对应的数据步长相同,但执行不同层次循环时,对应的数据步长不同。例如:对于一条位于两层循环中的数据访问指令,每次执行内层循环时,数据地址增加‘4’,即数据步长为‘4’;但每次执行外层循环时,数据地址增加‘20’,即数据步长为‘20’。此时,无论将‘4’还是‘20’作为该指令的数据步长,都会导致一定数目的数据地址预测错误。根据本发明技术方案,可以根据分支指令与数据访问指令之间的关系,对同一条数据访问指令位于不同层次循环的情况,给予不同的数据步长,使得数据地址的预测更准确。Normally, a data access instruction may be in a multi-layer instruction loop, and each time the same level loop is executed, the corresponding data step size is the same, but when different levels of loops are executed, the corresponding data step size is different. For example, for a data access instruction located in a two-layer loop, each time the inner loop is executed, the data address is increased by '4', that is, the data step is '4'; but each time the outer loop is executed, the data address is increased. '20', ie the data step size is '20'. At this time, whether "4" or "20" is used as the data step size of the instruction causes a certain number of data address prediction errors. According to the technical solution of the present invention, different data step sizes can be given according to the relationship between the branch instruction and the data access instruction, and the same data access instruction is located at different levels, so that the prediction of the data address is more accurate.
请参考图1,其为本发明所述循环中的数据访问指令的一个示意图。在图1中,指令按地址顺序从左向右排列,其中,指令11、12和13均是数据访问指令,而指令21、22和23均是反向跳转的分支指令。因此,这三条分支指令分别与其分支目标指令之间的指令构成循环。如图1所示,共构成三层嵌套的循环,其中,分支指令21对应的循环为最内层循环,而分支指令23对应的循环为最外层循环。这样,可以分别对这段代码中的每条数据访问指令给予专门的循环步长存储模块,从而在执行不同层次的循环操作时提供不同的数据步长。Please refer to FIG. 1, which is a schematic diagram of data access instructions in the loop of the present invention. In FIG. 1, the instructions are arranged from left to right in order of addresses, wherein instructions 11, 12, and 13 are both data access instructions, and instructions 21, 22, and 23 are both branch instructions for reverse jump. Therefore, the instructions between the three branch instructions and their branch target instructions form a loop. As shown in FIG. 1, a three-layer nested loop is formed, wherein the loop corresponding to the branch instruction 21 is the innermost loop, and the loop corresponding to the branch instruction 23 is the outermost loop. In this way, each data access instruction in this code can be given a special loop step storage module to provide different data step sizes when performing different levels of loop operations.
请参考图2,其为本发明所述的循环步长存储模块的实施例。为了便于描述,本实施例对如何由循环步长存储模块提供数据步长进行说明,至于这些数据步长是如何被存储到循环步长存储模块中,由图3实施例再做进一步说明。在本实施例中,该循环步长存储模块对应三层循环,由寄存器31、32和33,以及选择器41、42和43构成。其中,寄存器31和选择器41对应第一层循环(最内层循环),寄存器32和选择器42对应第二层循环,寄存器33和选择器43对应第三层循环(最外层循环)。Please refer to FIG. 2, which is an embodiment of a loop step storage module according to the present invention. For convenience of description, the present embodiment describes how the data step size is provided by the loop step storage module, and how these data step sizes are stored in the loop step storage module, which will be further explained by the embodiment of FIG. 3. In this embodiment, the loop step storage module corresponds to a three-layer loop, and is composed of registers 31, 32, and 33, and selectors 41, 42, and 43. Among them, the register 31 and the selector 41 correspond to the first layer loop (the innermost loop), the register 32 and the selector 42 correspond to the second layer loop, and the register 33 and the selector 43 correspond to the third layer loop (the outermost loop).
在本发明中,每条被预测数据地址的数据访问指令均对应如图2的一个循环步长存储模块。以数据访问指令12对应的所述循环步长存储模块为例(这条数据访问指令位于三层循环中),寄存器31中存储了执行第一层循环(即分支指令21对应的循环)时对应的数据步长及有效位;寄存器32中存储了执行第二层循环(即分支指令22对应的循环)时对应的数据步长及有效位;寄存器33中存储了执行第三层循环(即分支指令23对应的循环)时对应的数据步长及有效位。由于不同的数据访问指令均有各自的循环步长存储模块,且每个循环步长存储模块中的相应寄存器可以存储不同值,使得各条数据访问指令在位于不同层次循环时可以有不同的数据步长。所述有效位的初始值均为‘0’。In the present invention, each data access instruction of the predicted data address corresponds to a loop step storage module as shown in FIG. Taking the loop step storage module corresponding to the data access instruction 12 as an example (this data access instruction is located in a three-layer loop), the register 31 stores a corresponding execution of the first layer loop (ie, the loop corresponding to the branch instruction 21). The data step and the valid bit; the register 32 stores the data step and the valid bit corresponding to the execution of the second layer loop (ie, the loop corresponding to the branch instruction 22); the register 33 stores the execution of the third layer loop (ie, the branch) The data step and the valid bit corresponding to the loop corresponding to instruction 23. Since different data access instructions have their own loop step storage modules, and the corresponding registers in each loop step storage module can store different values, so that each data access instruction can have different data when it is located in different levels of loops. Step size. The initial value of the valid bits is '0'.
在本实施例中,这三层循环的发生是有优先关系的。例如,一旦第一层循环发生,则可以不考虑第二、三层循环的发生情况,输出第一层循环对应的数据步长(即寄存器31的值)即可。另一方面,只有当第一层循环没有发生时,才会进入第二层循环。而一旦第二层循环发生,则可以不考虑第三层循环的发生情况,输出寄存器32中的数据步长。类似地,只有当第一、二层循环没有发生时,才会进入第三层循环。而一旦第三层循环发生,则输出寄存器33中的数据步长。若三层循环均没有发生(表示第一次执行到这段代码,或由于更外层循环发生导致再次执行到这段代码),则输出来源于总线35的数据步长缺省值。这样,可以基于图2实施例所示的循环步长存储模块,分别用处理器核输出的表示分支指令21、22和23的分支转移是否发生的分支判断信号控制相应选择器,即可输出对应的数据步长。In this embodiment, the occurrence of these three layers of cycles is prioritized. For example, once the first layer loop occurs, the data step size corresponding to the first layer loop (ie, the value of the register 31) may be output regardless of the occurrence of the second and third layer loops. On the other hand, the second layer loop will only be entered when the first layer loop does not occur. Once the second layer loop occurs, the data step size in register 32 can be output regardless of the occurrence of the third layer loop. Similarly, the third layer loop will only be entered when the first and second layer loops do not occur. Once the third layer loop occurs, the data step size in register 33 is output. If the three-layer loop does not occur (indicating the first execution of this code, or because the outer loop occurs again, this code is executed again), the default value of the data step derived from bus 35 is output. In this way, based on the cyclic step storage module shown in the embodiment of FIG. 2, the corresponding selector can be controlled by the branch determination signal indicating whether the branch transfer of the branch instructions 21, 22, and 23 is output by the processor core, and the corresponding selector can be output. The data step size.
例如,对于图1所示代码,第一次执行到数据访问指令11、12和13时,分支指令21、22和23尚未被执行,因此相应的分支转移均没有发生,即相应循环步长存储模块中选择器41、42和43的选择信号均为‘0’,均输出来源于总线35的数据步长缺省值。For example, for the code shown in FIG. 1, when the data access instructions 11, 12, and 13 are executed for the first time, the branch instructions 21, 22, and 23 have not yet been executed, so that the corresponding branch transfer does not occur, that is, the corresponding loop step size is stored. The selection signals of the selectors 41, 42 and 43 in the module are all '0', and both output default values of the data step derived from the bus 35.
当分支指令21第一次发生分支转移时,进入第一层循环,再次执行到数据访问指令12和13,即第二次执行分支访问指令12和13。此时从寄存器31中读出的有效位是‘0’,因此寄存器31中存储的数据步长无效。此时计算数据步长,并在分支指令22的分支判断信号控制下存储到寄存器31中,并将寄存器31中的有效位置为‘1’。When the branch instruction 21 first branches, the first layer loop is entered, and the data access instructions 12 and 13 are executed again, that is, the branch access instructions 12 and 13 are executed a second time. At this time, the valid bit read from the register 31 is '0', so the data step stored in the register 31 is invalid. The data step size is calculated at this time, and is stored in the register 31 under the control of the branch judgment signal of the branch instruction 22, and the effective position in the register 31 is '1'.
假设之后分支指令21的分支转移均发生(即一直执行第一层循环),则相应循环步长存储模块中选择器41的选择信号为‘1’,数据访问指令12和13对应的循环步长存储模块分别输出各自寄存器31中的有效位和数据步长。由于此时有效位为‘1’,因此可以使用该数据步长计算相应的预测数据地址。同时,按前述方法重新计算数据步长,并在分支指令22的分支判断信号控制下存储到寄存器31中,有效位则保持为‘1’。这样,若前后两次执行同一条数据访问指令时的数据步长没有发生变化,则寄存器31中的值不变;若数据步长发生了变化,则寄存器31中的值被更新为新的数据步长值。Assuming that the branch transfer of the branch instruction 21 occurs afterwards (ie, the first layer loop is always executed), the selection signal of the selector 41 in the corresponding loop step storage module is '1', and the loop step corresponding to the data access instructions 12 and 13 The memory modules respectively output the valid bits and data step sizes in the respective registers 31. Since the valid bit is '1' at this time, the data block size can be used to calculate the corresponding predicted data address. At the same time, the data step size is recalculated as described above and stored in the register 31 under the control of the branch decision signal of the branch instruction 22, and the valid bit remains "1". Thus, if the data step size when the same data access instruction is executed twice before, the value in the register 31 does not change; if the data step changes, the value in the register 31 is updated to the new data. Step value.
当再次执行分支指令21时分支转移没有发生,且执行分支指令22时分支转移发生,则相应循环步长存储模块中选择器41和42的选择信号分别为‘0’和‘1’,数据访问指令11、12和13对应的循环步长存储模块分别输出寄存器32中的数据步长以计算相应的预测数据地址。由此,在不同层次的循环中,使用循环步长存储模块中不同寄存器值作为数据步长。When the branch instruction 21 is executed again, the branch transfer does not occur, and when the branch instruction 22 is executed, the branch transfer occurs, and the selection signals of the selectors 41 and 42 in the corresponding loop step storage module are '0' and '1', respectively, and data access is performed. The loop step storage modules corresponding to instructions 11, 12 and 13 respectively output the data step size in register 32 to calculate the corresponding predicted data address. Thus, in different levels of loops, different register values in the loop step storage module are used as the data step size.
此外,对于数据访问指令11,若分支指令21的分支转移发生,则选择器41的选择信号21为‘1’。此时,虽然输出寄存器31中的数据步长,但由于第一层循环中并不包含数据访问指令11,因此该数据步长被忽略。而在其他情况下,则按之前所述操作相应输出寄存器32或33中的数据步长,或输出来源于总线35的数据步长缺省值,从而对不同的循环提供不同的数据步长。Further, in the data access command 11, when the branch transfer of the branch instruction 21 occurs, the selection signal 21 of the selector 41 is "1". At this time, although the data step size in the output register 31, since the data access instruction 11 is not included in the first layer loop, the data step size is ignored. In other cases, the data step size in the corresponding output register 32 or 33 is operated as previously described, or the data step default value derived from the bus 35 is output, thereby providing different data step sizes for different cycles.
每个本发明所述为循环中数据访问指令提供数据步长的循环步长存储模块对应一条数据访问指令。对所述循环步长存储模块进行扩展,增加寄存器和选择器的数目(其中每组寄存器与相应选择器对应一层循环),即可对应更多层次循环;再为所述更多层次循环中的每条(或部分)数据访问指令提供一个所述循环步长存储模块,即可实现为所述更多层次循环中全部(或部分)数据访问指令分别按循环执行情况提供更准确的数据步长的功能。Each of the loop step storage modules that provide a data step for the data access instruction in the loop of the present invention corresponds to a data access instruction. Extending the loop step storage module, increasing the number of registers and selectors (where each group of registers corresponds to a corresponding loop of the corresponding selector), corresponding to more hierarchical loops; and then for the more hierarchical loops Each (or part of) data access instruction provides a loop step storage module, thereby providing more accurate data steps for all (or part of) data access instructions in the more hierarchical loop according to the loop execution situation Long function.
请参考图3,其为本发明所述数据地址产生系统的一个实施例。其中包含合称为步长表的,对应于复数条数据访问指令的存储阵列52,根据分支判断结果译码的列译码器50,根据数据访问指令的指令地址译码的行译码器54;以及地址产生器60。60由减法器61,加法器62,选择器63,及比较器64构成。存储阵列52,其中的每一行对应一条数据访问指令。行译码器54中对应52中每一行设有寄存器存储该行对应的数据访问指令的指令地址,并有比较器比较所述寄存器内容与数据访问指令地址总线53上的指令地址。如54中某行的寄存器内容与53上的地址比较相同,则54使能(enable)该行的字线(word line),使该行可被读或写。阵列52有两个读/写口。其中一个读/写口37,38由52中最右方的一列58专用,对该列的访问只受行译码器54控制,该列的表项格式存储数据访问指令的相应数据地址67以及有效位68。另一个读/写口34,36为52中除58列外的各列如56列等共用,可通过其访问由行译码器54选择的一行,及列译码器50选择的一列中的表项,其表项格式为存储所述复数条数据访问指令在特定层次循环中对应的数据步长65以及有效位66。上述阵列52中除58以外的其余各列,其列数对应此步长存储阵列最多可以支持的循环层次的数目。50中对应52中各列设有比较器以及地址寄存器,其中存储各分支指令的指令地址。当处理器核作出一个‘执行分支’的判断时,其相应分支指令的指令地址经总线51被送到50中匹配,匹配命中的寄存器所对应的阵列52中的列(除58列以外)可以通过读/写口34,36访问。Please refer to FIG. 3, which is an embodiment of a data address generation system according to the present invention. The memory array 52 corresponding to the plurality of data access instructions, the column decoder 50 decoded according to the branch judgment result, and the row decoder 54 decoded according to the instruction address of the data access instruction are included. And the address generator 60.60 is composed of a subtractor 61, an adder 62, a selector 63, and a comparator 64. The memory array 52, each of which corresponds to a data access instruction. Each of the rows 52 of the row decoder 54 is provided with a register to store the instruction address of the data access instruction corresponding to the row, and a comparator compares the contents of the register with the instruction address on the data access instruction address bus 53. If the register content of a row in 54 is the same as the address on 53, then 54 enables the word line of the row (word Line), so that the line can be read or written. Array 52 has two read/write ports. One of the read/write ports 37, 38 is dedicated to the rightmost column 58 of 52. Access to the column is only controlled by the row decoder 54, which stores the corresponding data address 67 of the data access instruction and Valid bit 68. The other read/write ports 34, 36 are shared by 52 columns other than 58 columns, such as 56 columns, etc., through which a row selected by the row decoder 54 and a column selected by the column decoder 50 are accessed. An entry whose format is to store a data step 65 and a valid bit 66 corresponding to the plurality of data access instructions in a specific hierarchy loop. The remaining columns of the array 52 except 58 have a number of columns corresponding to the number of loop levels that the step storage array can support at most. In each of the 50 corresponding columns 52, a comparator and an address register are provided, in which the instruction address of each branch instruction is stored. When the processor core makes an 'execution branch' decision, the instruction address of the corresponding branch instruction is sent to the 50 match on the bus 51, and the column in the array 52 corresponding to the matching hit register (except the 58 column) may Accessed through read/write ports 34,36.
处理器核按顺序执行指令,从图1中最左面的指令开始从左到右执行。当处理器核译码指令11时发现其为数据装载指令,指令11的指令地址经总线53被送到行译码器54中匹配,此时匹配未命中,则54中的行置换逻辑为指令11分配一行,即图3中阵列52上面一行11,将指令11的指令地址存入54中与该行对应的寄存器。该行中各列的有效位66及68等均被置‘0’。根据58列中为‘0’的有效位68,数据地址产生系统使处理器核执行数据装载指令11产生数据地址访问数据存储器(比如数据缓存),读数据到处理器核执行。同时系统也将上述数据地址经总线57,选择器63,写口38存进阵列52中11行58列的表项中67域,并将该表项中有效位68设为‘1’。其后处理器核执行后续指令,54中行置换逻辑也如上述为数据装载指令12,13各分配一行,即图3中阵列52中间及下面一行12,13,如上述将处理器核执行指令12,13产生的数据地址分别存入12,13行58列的表项中67并将表项中的有效位68置为‘1’。The processor core executes the instructions in order, starting from left to right starting from the leftmost instruction in Figure 1. When the processor core decodes the instruction 11, it is found to be a data load instruction, and the instruction address of the instruction 11 is sent to the row decoder 54 via the bus 53 for matching. At this time, the matching miss is performed, and the row replacement logic in the 54 is the instruction. 11 allocates a row, that is, a row 11 above the array 52 in FIG. 3, and stores the instruction address of the instruction 11 into the register corresponding to the row in 54. The valid bits 66 and 68 of each column in the row are set to '0'. Based on the valid bit 68 of '0' in column 58, the data address generation system causes the processor core to execute the data load instruction 11 to generate a data address access data memory (such as a data buffer), and the read data is executed by the processor core. At the same time, the system also stores the above data address via the bus 57, the selector 63, and the write port 38 into the 67 field of the 11 rows and 58 columns of the array 52, and sets the valid bit 68 in the entry to '1'. Thereafter, the processor core executes subsequent instructions, and the row replacement logic of 54 also allocates one row for each of the data load instructions 12, 13 as described above, that is, the middle and lower rows 12, 13 of the array 52 in FIG. 3, and the processor core executes the instruction 12 as described above. The data addresses generated by 13 are stored in the entries of 12, 13 rows and 58 columns, respectively, and the valid bits 68 in the entries are set to '1'.
其后处理器核执行分支指令21,对其分支判断为‘执行分支’。因此处理器核反向跳转到指令11与指令12之间的分支目标地址指令,从该指令开始按指令顺序执行。与此同时分支指令21的指令地址经总线51被送到列译码器50中匹配。如不匹配, 50中列置换逻辑为其分配阵列52中的一列(图中的21列),将分支指令21的指令地址存入50中对应该列的寄存器。总线51上会一直保持最后一个成功分支的指令地址,此时51上的地址与50中对应21列的寄存器中地址匹配,因此列译码器50选择21列。Thereafter, the processor core executes the branch instruction 21 and judges its branch as 'execution branch'. Therefore, the processor core reverses the branch target address instruction between instruction 11 and instruction 12, and executes in the instruction order from the instruction. At the same time, the instruction address of the branch instruction 21 is sent to the column decoder 50 via bus 51 for matching. If it does not match, The 50 column replacement logic assigns one of the arrays 52 (21 columns in the figure), and stores the instruction address of the branch instruction 21 into the corresponding register in 50. The address of the last successful branch is always maintained on the bus 51. At this time, the address on 51 matches the address in the register corresponding to 21 columns in 50, so the column decoder 50 selects 21 columns.
处理器核按程序顺序继续执行指令,再次执行指令12,译码指令12时发现其为数据装载指令。因此指令12的指令地址经总线53被送到行译码器54中匹配,这次匹配命中,从52中12行58列经读口37读出有效位68为‘1’,从52中12行21列经读口37读出有效位66为‘0’。根据该为‘10’的两个有效位,系统使处理器核执行数据装载指令12产生数据地址访问数据存储器读数据到处理器核执行。同时系统也将上述数据地址经总线57送到地址产生器60中由减法器61减去此时从阵列52的读口37读出的12行58列中上一次的数据地址67,其差作为地址增量(步长)经写口34写入阵列52。此时总线53上的是数据装载指令12的地址,总线51上是分支指令21的地址,因此上述步长被存入阵列52中12行21列的表项中步长域65。系统也将12行21列中有效位66设置为‘1’。依照同样的方式系统也将数据装载指令13的步长存入52中13行21列中65,将13行21位的有效位66设置为‘1’。The processor core continues to execute instructions in program order, executing instruction 12 again, and decoding instruction 12 finds it a data load instruction. Therefore, the instruction address of the instruction 12 is sent to the row decoder 54 via the bus 53 for matching. This time, the match hits from the 12 rows and 58 columns of 52 through the read port 37 to read the valid bit 68 as '1', from 52 to 12. Line 21 is read through port 37 and the valid bit 66 is read as '0'. Based on the two significant bits of '10', the system causes the processor core to execute a data load instruction 12 to generate a data address to access the data memory read data to the processor core for execution. At the same time, the system also sends the above data address to the address generator 60 via the bus 57. The subtractor 61 subtracts the last data address 67 of the 12 rows and 58 columns read out from the read port 37 of the array 52 at this time. The address increment (step size) is written to array 52 via write port 34. At this time, the bus 53 is the address of the data load instruction 12, and the bus 51 is the address of the branch instruction 21, so the above step size is stored in the step field 65 of the entry of the 12 rows and 21 columns in the array 52. The system also sets the valid bit 66 of the 12 rows and 21 columns to '1'. In the same manner, the system also stores the step size of the data load instruction 13 into the 13 rows and 21 columns 65 of 52, and sets the 13-bit 21-bit valid bit 66 to '1'.
其后处理器核再次执行分支指令21,其分支判断又为‘执行分支’。因此处理器核反向跳转到指令11与指令12之间的分支目标地址指令,从该指令开始按指令顺序执行。与此同时分支指令21的指令地址经总线51被送到列译码器50中匹配。这次匹配命中,因此列译码器50选择21列。Thereafter, the processor core executes the branch instruction 21 again, and its branch decision is again an 'execution branch'. Therefore, the processor core reverses the branch target address instruction between instruction 11 and instruction 12, and executes in the instruction order from the instruction. At the same time, the instruction address of the branch instruction 21 is sent to the column decoder 50 via bus 51 for matching. This match hits, so column decoder 50 selects 21 columns.
处理器核按程序顺序执行指令,再次执行指令12,译码指令12时发现其为数据装载指令。因此指令12的指令地址经总线53被送到行译码器54中匹配,这次匹配命中,从52中12行58列经读口37读出有效位68为‘1’,从52中12行58列经读口36读出有效位66为‘1’。根据该为‘11’的两个有效位,系统以加法器62将此时从读口37读出的12行58列中的数据地址67与从读口36读出的12行21列中的步长65相加。其和38作为新的数据地址访问数据存储器,读出数据送到处理器核处理。处理器核执行指令12产生的数据地址57与38上的数据地址由比较器64比较。如比较结果相同,则处理器核继续按程序顺序执行后续指令,加法器62的和也经读口38写回12行58列存入67, 并维持12行58列中有效位68为‘1’不变。如比较结果不同,则系统控制处理器核抛弃根据38上数据地址取得的数据的中间执行结果,使处理器核按数据地址57取得的数据执行指令12的装载,然后顺序执行后续指令。系统并控制选择器63选择57上的地址经读口38存入12行58列的表项中67,维持该表项中有效位68为‘1’;但将12行21列中有效位66设为‘0’,以此记录该表项中步长无效,需要重新学习。系统对指令13也按同样方式处理。The processor core executes the instructions in program order, executes instruction 12 again, and finds that it is a data load instruction when decoding instruction 12. Therefore, the instruction address of the instruction 12 is sent to the row decoder 54 via the bus 53 for matching. This time, the match hits from the 12 rows and 58 columns of 52 through the read port 37 to read the valid bit 68 as '1', from 52 to 12. Row 58 is read through port 36 and the valid bit 66 is read as '1'. According to the two valid bits of '11', the system uses the adder 62 to read the data address 67 in the 12 rows and 58 columns read from the read port 37 at this time and the 12 rows and 21 columns read from the read port 36. Step size 65 is added. The sum 38 accesses the data memory as a new data address, and the read data is sent to the processor core for processing. The data addresses on data addresses 57 and 38 generated by processor core execution instruction 12 are compared by comparator 64. If the comparison result is the same, the processor core continues to execute the subsequent instructions in program order, and the sum of the adder 62 is also written back to the 12 rows and 58 columns via the read port 38 and stored in 67. And the effective bit 68 of 12 rows and 58 columns is maintained as '1' unchanged. If the comparison result is different, the system control processor core discards the intermediate execution result of the data obtained based on the data address on 38, causes the processor core to execute the loading of the instruction 12 according to the data obtained by the data address 57, and then sequentially executes the subsequent instructions. The system controls the selector 63 to select the address on the 57 to be stored in the entry of the 12 rows and 58 columns via the read port 38, maintaining the valid bit 68 in the entry as '1'; but the valid bit 66 in the 12 rows and 21 columns. Set to '0' to record the step size in this entry is invalid and need to be re-learned. The system also processes the instructions 13 in the same manner.
其后处理器核再次执行分支指令21,这次的分支判断为‘不分支’。因此处理器核继续按指令顺序执行。其后处理器核执行分支指令22,其分支判断为‘执行分支’。因此处理器核反向跳转到指令11之前的分支目标地址指令,从该指令开始按指令顺序执行。与此同时分支指令22的指令地址经总线51被送到列译码器50中匹配。因不匹配, 50中列置换逻辑为其分配阵列52中的一列(图中的22列),将分支指令22的指令地址存入50中对应该列的寄存器。此时总线51上保持的地址与50中对应22列的寄存器中地址匹配,因此列译码器50选择22列。Thereafter, the processor core executes the branch instruction 21 again, and this time the branch is judged as 'no branch'. Therefore the processor core continues to execute in the order of instructions. Thereafter, the processor core executes the branch instruction 22, and its branch is judged as 'execution branch'. Therefore, the processor core reversely jumps to the branch target address instruction before instruction 11, and executes from the instruction in the order of instructions. At the same time, the instruction address of the branch instruction 22 is sent to the column decoder 50 via bus 51 for matching. Because of the mismatch, The 50 column replacement logic assigns one of the arrays 52 (22 columns in the figure), and stores the instruction address of the branch instruction 22 into the corresponding register in 50. At this time, the address held on the bus 51 matches the address in the register of the corresponding 22 columns in 50, so the column decoder 50 selects 22 columns.
处理器核按程序顺序执行指令,再次执行指令11,译码指令11时发现其为数据装载指令。因此指令11的指令地址经总线53被送到54中匹配,这次匹配命中,从阵列52中11行58列经读口37读出有效位68为‘1’,从52中11行22列经读口37读出有效位66为‘0’。因此系统按前述数据装载指令地址67有效但步长65无效时的状况执行,以处理器核产生的数据地址从数据存储器读取数据供处理器核处理,并将经总线57送来的数据地址与经读口37读出的58列中数据地址67相减,其差作为步长经写口存入11行22列65域,并将66设为‘1’。系统也将57上数据地址存入11行58列表项中67域,并维持该表项中有效位68为‘1’不变。系统按同样方法处理指令12及13。以后按前述周而复始操作。The processor core executes the instructions in program order, executes instruction 11 again, and finds that it is a data load instruction when decoding instruction 11. Therefore, the instruction address of instruction 11 is sent to match 54 via bus 53, this match hit, from 11 rows and 58 columns of array 52 through read port 37 read valid bit 68 is '1', from 52 to 11 rows and 22 columns The valid bit 66 is read as '0' via the read port 37. Therefore, the system executes according to the condition that the data load instruction address 67 is valid but the step size 65 is invalid, and the data address generated by the processor core reads data from the data memory for processing by the processor core, and the data address sent via the bus 57 is processed. The data address 67 in the 58 columns read out through the read port 37 is subtracted, and the difference is stored as a step size in the 11 rows and 22 columns 65 fields through the write port, and 66 is set to '1'. The system also stores the data address on 57 in the 67 field of the 11 row 58 list item, and maintains the valid bit 68 in the entry as '1' unchanged. The system processes instructions 12 and 13 in the same way. In the future, the operation will be repeated as described above.
总的来说,阵列52中的一行存储一条数据访问指令的数据存储器地址,步长以及相应有效位;52中的一列存储对应一条分支成功的分支指令对应各数据访问指令的步长及有效位,其中特殊的58列存储数据存储器地址,其读写不受分支指令的状态影响。系统经总线53以数据访问指令的指令地址选择阵列52中一行,读出其58列中有效位68;并经总线51以最后一条成功分支的分支指令地址选择一列,读出其中有效位66。根据一行中有效位68及66的状态,系统有如下三种操作模式。当读出的68及66状态为‘00’时,意义为数据地址及步长均无效,此时系统以处理器核产生的数据地址57存入该行58列中67并设状态为‘10’。当读出的68及66状态为‘10’时,意义为数据地址有效但步长无效,此时系统计算处理器核产生的数据地址57与该行58列中67存放的数据地址的差并将其存入此时分支判断所选择的列中65域,并设状态为‘11’。当读出的68及66状态为‘11’时,意义为数据地址及步长均有效,此时系统以58列中67域中数据地址与分支判断选择的其他列中65域中的步长相加以产生数据地址38,访问数据存储器,读取数据供处理器核处理。在此状态下系统并将处理器核产生的数据地址57与数据地址38相比较,根据比较结果视需要采取修正操作,并修改66的状态。In general, one row in array 52 stores the data memory address of a data access instruction, the step size and the corresponding valid bit; a column in 52 stores the step size and the valid bit of each data access instruction corresponding to the branch instruction succeeding in one branch. The special 58 columns store the data memory address, and its read and write is not affected by the state of the branch instruction. The system selects one of the arrays 52 via the bus 53 with the instruction address of the data access instruction, reads the valid bit 68 of the 58 columns, and selects a column via the bus 51 with the branch instruction address of the last successful branch, reading the valid bit 66 therein. According to the state of the valid bits 68 and 66 in a row, the system has the following three modes of operation. When the read 68 and 66 states are '00', the meaning is that the data address and the step size are invalid. At this time, the system stores the data address 57 generated by the processor core into the column 58 of the row 67 and sets the state to '10. '. When the read 68 and 66 states are '10', the meaning is that the data address is valid but the step size is invalid. At this time, the system calculates the difference between the data address 57 generated by the processor core and the data address stored in the column 58 of the row and Save it to the 65 field in the selected column at this time, and set the state to '11'. When the read 68 and 66 states are '11', the meaning is that the data address and the step size are valid. At this time, the system uses the data address in the 67 field in the 67 column and the step size in the 65 fields in the other columns selected by the branch judgment. A data address 38 is generated, the data memory is accessed, and the data is read for processing by the processor core. In this state, the system compares the data address 57 generated by the processor core with the data address 38, and takes corrective action as needed according to the comparison result, and modifies the state of 66.
可以用几种方式来使用上述基于步长提前取数据的系统。第一种方式是当根据阵列52中一行产生的数据地址38读取数据被处理器核接受时(总线38上地址与总线57上地址相同),系统即以38上地址加上读口36上的步长作为猜测数据地址送到数据缓存匹配。如不匹配即可开始从更低存储层次取数据填入高层次缓存。处理器核仍然经总线57产生数据地址从数据缓存读取数据。这种方式可以在大部分情况下部分或全部掩盖数据缓存的缺失。但是所述猜测数据地址是基于上一次成功分支的结果而选取的步长,在下一次成功分支之前执行,而处理器核执行同一数据装载指令读数据是下一次成功分支之后。上下两次成功分支的不一定是同一条分支指令,因此产生猜测数据地址所使用的步长与处理器核产生的数据地址使用的增量不一定一致。第二种方式是在成功分支以后以之前执行过的数据装载指令地址从阵列52中读出成功分支选定的列中的表项以及58列表项,如表项状态68,66为‘11’,即产生数据地址38,从数据存储器中读取数据存入一个读取延迟更短的数据读缓冲器(data read buffer),以备处理器核取用。以下即为第二种方式的实施例。The above system based on the step size advance data can be used in several ways. The first way is that when the read data is accepted by the processor core according to the data address 38 generated in one row of the array 52 (the address on the bus 38 is the same as the address on the bus 57), the system adds the address on the 38 to the read port 36. The step size is sent to the data cache match as the guess data address. If you don't match, you can start to fetch data from the lower storage level into the high-level cache. The processor core still generates data addresses via bus 57 to read data from the data cache. This approach can partially or completely mask the absence of the data cache in most cases. However, the guess data address is a step size selected based on the result of the last successful branch, executed before the next successful branch, and the processor core executes the same data load instruction read data after the next successful branch. The success of the last two successful branches is not necessarily the same branch instruction, so the step size used to generate the guess data address does not necessarily coincide with the increment used by the data address generated by the processor core. The second way is to read the entries in the selected column of the successful branch and the 58 list items from the array 52 after the successful branch with the previously executed data load instruction address, such as the entry status 68, 66 is '11'. , that is, the data address 38 is generated, and the data is read from the data memory and stored in a data read buffer with a shorter read latency (data Read buffer) for the processor core to take. The following is an embodiment of the second mode.
请参考图4,其为本发明所述步长表行译码器的一个实施例。图4显示行译码器54中对应阵列52中一行的译码逻辑,其中72,74为寄存器,76为选择器,78为比较器。寄存器72的输出接到比较器78的一个输入端,选择器76的输出接到比较器78的另一个输入端。比较器78除了做相等比较外,还可以做大于比较,小于比较,一共三种比较模式。比较器78的比较模式与选择器76的选择联动。当选择器76选择总线73作为输入时,比较器78做寄存器76中所存地址大于或等于73上地址的比较;当选择器76选择总线51作为输入时,78做寄存器76中所存地址小于51上地址的比较;当选择器76选择总线53作为输入时,78做寄存器76中所存地址等于53上地址的比较;75为比较器78输出的比较结果。Please refer to FIG. 4, which is an embodiment of a step size table decoder according to the present invention. 4 shows the decode logic for a row in the corresponding array 52 in row decoder 54, where 72, 74 are registers, 76 is a selector, and 78 is a comparator. The output of register 72 is coupled to one input of comparator 78, and the output of selector 76 is coupled to the other input of comparator 78. Comparator 78 can do more than comparison, less than comparison, and a total of three comparison modes. The comparison mode of comparator 78 is linked to the selection of selector 76. When the selector 76 selects the bus 73 as an input, the comparator 78 makes a comparison of the address stored in the register 76 greater than or equal to the address on the 73; when the selector 76 selects the bus 51 as the input, the address 78 stored in the register 76 is less than 51. Comparison of addresses; when selector 76 selects bus 53 as an input, 78 makes a comparison of the address stored in register 76 equal to the address on 53; 75 is the result of the comparison output by comparator 78.
其中总线53上的是处理器核输出的数据装载地址。该地址与行译码器54中所有的行中的寄存器72内容由每行的比较器78做相等比较。如无一匹配,则54中行替换逻辑会为该数据装载指令分配一行,将53上地址存入该行的寄存器72。之后53上同样地址会与该行寄存器72中地址匹配,使能该行的字线,其操作已在图3实施例中详述。现在当一条反向跳转分支指令成功执行分支时,系统将该指令的指令地址放上总线51,将该指令的分支目标指令的指令地址放上总线73。系统使54中所有行中的寄存器72内容依次与总线73,以及总线51上的地址做前述大于等于,及小于比较,比较结果75被存入寄存器74。当某行中寄存器72中存储的地址大于或等于总线73上的地址,但小于总线51上的地址时,该行相应的数据装载指令处于分支指令(不含)和分支目标指令(含)之间,例如图1中指令12,13在分支指令21的循环中,该行的寄存器74写进比较结果‘1’。比较结果不满足上述条件的行,不在分支目标指令与分支指令之间,例如图1中指令11不在分支指令21的循环中,该行的寄存器写进比较结果‘0’。The bus 53 is the data load address output by the processor core. The address is compared with the contents of register 72 in all rows in row decoder 54 by comparators 78 of each row. If there is no match, the 54 row replacement logic will allocate a row for the data load instruction and store the upper address of 53 into register 72 of the row. The same address will then match the address in the row register 72 to enable the word line for that row, the operation of which has been detailed in the embodiment of FIG. Now when a reverse jump branch instruction successfully executes the branch, the system puts the instruction address of the instruction on the bus 51, and places the instruction address of the branch target instruction of the instruction on the bus 73. The system causes the contents of the registers 72 in all rows in 54 to be sequentially greater than or equal to the bus 73 and the address on the bus 51, and less than the comparison, and the comparison result 75 is stored in the register 74. When the address stored in the register 72 in a row is greater than or equal to the address on the bus 73, but less than the address on the bus 51, the corresponding data load instruction of the row is in the branch instruction (not included) and the branch target instruction (inclusive). For example, in the loop of the branch instruction 21 in the instruction 12, 13 of Fig. 1, the register 74 of the row is written into the comparison result '1'. The row whose comparison result does not satisfy the above condition is not between the branch target instruction and the branch instruction. For example, the instruction 11 in FIG. 1 is not in the loop of the branch instruction 21, and the register of the row is written into the comparison result '0'.
其后系统将所有寄存器74内容为‘1’的行的字线依次使能,从阵列52中读出58列中该行的表项内容及此时列译码器50根据51总线上的分支指令地址选择的列中该行的表项内容。如其中68,66状态为‘11’,则系统以65域中的步长及67域中的上一循环数据地址相加,得到新的数据地址经总线38寻址数据存储器,读取数据以备处理器核使用。新的数据地址也被写回67域。而系统对寄存器74内容为‘0’的行不做操作。之后当处理器核执行数据装载指令时,经总线53送出数据装载指令的指令地址到行译码器54中与其中各行寄存器72的内容做相等比较。系统检测比较相等的行中的寄存器74的内容,如内容为‘0’,则系统读出该行中状态68,66,按该状态如前述操作;如寄存器74内容为‘1’,则系统读出该行中67域中数据地址与处理器核经总线57送来的数据地址。如两个地址相等,系统将该行的寄存器74内容置‘0’,不做后续操作。如两个地址不等,则系统将该行的寄存器74内容置‘0’,按前述67域中地址与总线57上地址不等的情况操作。Thereafter, the system sequentially enables the word lines of the rows of all registers 74 to be '1', and reads the contents of the entries in the 58 columns from the array 52 and the column decoder 50 according to the branches on the 51 bus at this time. The contents of the entry in the row in the column selected by the instruction address. If the state of 68, 66 is '11', the system adds the step size in the 65 domain and the previous cyclic data address in the 67 domain, and obtains a new data address to address the data memory via the bus 38, and reads the data to Used by the processor core. The new data address is also written back to the 67 domain. The system does not operate on the line whose register 74 content is '0'. Then, when the processor core executes the data load instruction, the instruction address of the data load instruction sent via the bus 53 is compared to the contents of each of the row registers 72 in the row decoder 54 for equal comparison. The system detects the contents of the register 74 in the equal row. If the content is '0', the system reads the state 68, 66 in the row, and operates according to the state as described above; if the content of the register 74 is '1', the system The data address in the 67 field in the row and the data address sent by the processor core via bus 57 are read. If the two addresses are equal, the system sets the contents of the register 74 of the row to '0' without subsequent operations. If the two addresses are not equal, the system sets the contents of the register 74 of the row to '0', operating as the address in the 67 domain is not equal to the address on the bus 57.
请参考图5,其为使用本发明所述数据地址产生系统的处理器系统框图。其中50为步长表的列译码器,52为步长表阵列,54为步长表的行译码器,60为地址产生器,80为处理器核,82为数据缓冲器,84为数据存储器。51为成功分支的分支指令地址总线,由处理器核80输出至列译码器50及行译码器54。73为分支目标指令地址总线,由处理器核80输出至54。53为数据访问指令地址总线,由处理器核80输出至54。57为数据地址总线,由处理器核80输出至地址产生器60及数据存储器84。38为数据地址总线,由地址产生器60输出至数据存储器84。总线85将数据从数据存储器85输出到数据度缓冲器82暂存,总线87将数据从82输出到处理器核80。步长表(50,52,54)的列数决定了其能处理的循环层数或循环个数。步长表的行数决定了其能处理的数据访问指令条数。Please refer to FIG. 5, which is a block diagram of a processor system using the data address generation system of the present invention. 50 is the column decoder of the step table, 52 is the step table array, 54 is the row decoder of the step table, 60 is the address generator, 80 is the processor core, 82 is the data buffer, 84 is Data memory. 51 is the branch instruction address bus of the successful branch, which is output from the processor core 80 to the column decoder 50 and the row decoder 54. 73 is the branch target instruction address bus, which is output from the processor core 80 to 54. 53 is data access. The instruction address bus is output from processor core 80 to 54. 57 is a data address bus, which is output by processor core 80 to address generator 60 and data memory 84. 38 is a data address bus that is output by address generator 60 to data memory. 84. Bus 85 outputs data from data memory 85 to data degree buffer 82 for temporary storage, and bus 87 outputs data from 82 to processor core 80. The number of columns in the step table (50, 52, 54) determines the number of loop layers or loops that it can handle. The number of rows in the step table determines the number of data access instructions it can process.
步长表是一个一维(58列)加上一个二维的数据结构。其中一维数据结构以数据访问指令的指令地址寻址,而数据结构的内容是数据地址。其中二维数据结构中一维以数据访问指令的指令地址寻址,另外一维以分支指令的指令地址寻址,而数据结构的内容是数据地址增量(步长)。如此的步长表将数据访问指令的指令地址映射为相应的数据地址。且该映射并非固定,而是随着该数据访问指令的执行次数以及其循环路径而变化的动态映射。系统以处理器核80提供的数据访问指令地址经总线53在步长表的数据结构中分配一维存储资源(行),以80通过总线57提供的相应数据地址给分配的一维存储资源提供初始内容(数据地址);以80通过总线51提供的分支指令地址分配另一维资源(列),以80通过总线57再次提供的数据地址与步长表中的初始内容的差存入各另一维资源(列)。之后系统更以总线51传送成功分支的反向跳转分支指令的指令地址作为上限,以总线73传送相应分支目标指令地址作为下限,使步长表及地址产生器60自动将指令地址在下限与上限之间的数据访问指令的相应数据地址根据当前分支循环的状态选定的列中的各自相应步长更新数据地址,并以更新的数据地址通过总线38访问数据存储器84,读取相应数据并在处理器80输出相应数据地址57之前存入数据读缓冲82。The step table is a one-dimensional (58 columns) plus a two-dimensional data structure. The one-dimensional data structure is addressed by the instruction address of the data access instruction, and the content of the data structure is the data address. One dimension of the two-dimensional data structure is addressed by the instruction address of the data access instruction, and the other dimension is addressed by the instruction address of the branch instruction, and the content of the data structure is the data address increment (step size). Such a step table maps the instruction address of the data access instruction to the corresponding data address. And the mapping is not fixed, but a dynamic mapping that varies with the number of executions of the data access instruction and its loop path. The system allocates one-dimensional storage resources (rows) in the data structure of the step table via the bus 53 with the data access instruction address provided by the processor core 80, and provides the allocated one-dimensional storage resource with the corresponding data address provided by the bus 57. Initial content (data address); another dimension resource (column) is allocated by the branch instruction address provided by the bus 51 by 80, and the difference between the data address provided again by the bus 57 via the bus 57 and the initial content in the step size table is stored in each other. One-dimensional resources (columns). Then, the system further transmits the instruction address of the reverse branch branch instruction of the successful branch of the bus 51 as the upper limit, and the corresponding branch target instruction address is transmitted as the lower limit by the bus 73, so that the step size table and the address generator 60 automatically set the instruction address at the lower limit and The corresponding data address of the data access instruction between the upper limits updates the data address according to the respective step size in the column selected by the state of the current branch cycle, and accesses the data memory 84 via the bus 38 with the updated data address, and reads the corresponding data and The data read buffer 82 is stored before the processor 80 outputs the corresponding data address 57.
数据读缓冲82可以是按地址匹配读取的形式,这种形式下82的每一行有存储数据的表项,以及存储相应数据地址的表项。此时处理器核80以总线57将数据地址送入82,82中地址表项与57上地址匹配的相应数据表项中的数据经总线97送入处理器核80处理。这种形式下,行置换逻辑可以使用LRU(最近最少使用)等置换方式。列的置换方式也可以使用LRU等形式。此种形式下,要注意的是执行下限到上限的数据读取时要有机制尽量按指令地址顺序执行,因为靠近下限的数据访问指令的相应数据最早被处理器核使用。The data read buffer 82 may be in the form of an address match read, in which each row of 82 has an entry for storing data and an entry for storing the corresponding data address. At this time, the processor core 80 sends the data address to the bus 82, and the data in the corresponding data entry matching the address table entry and the address on the 57 is sent to the processor core 80 via the bus 97. In this form, the row permutation logic can use a replacement method such as LRU (least recently used). The column replacement method can also use a form such as LRU. In this form, it should be noted that when the data is read from the lower limit to the upper limit, a mechanism must be executed in the order of the instruction address, because the corresponding data of the data access instruction near the lower limit is used by the processor core at the earliest.
数据读缓冲82的另一种形式可以是先入先出(FIFO)。此时行译码器54中的行分配逻辑严格按行号增序分配行资源,如图3实施例中将行按指令11,12,13的地址顺序分配。如此当处理器核80向行译码器54提供所述下限与上限时,54从下限到上限依指令地址次序依次使能各字线,使步长表及地址产生器60按指令次序提供相应数据地址访问数据存储器84,使读出的数据按指令次序存入先入先出式的数据读缓冲82。此时处理器核80每执行一条数据装载指令,就向先入先出82提供一个读请求,82就会向80输出一个数据。82此时是一个按指令地址顺序排列的数据队列。Another form of data read buffer 82 may be a first in first out (FIFO). At this time, the row allocation logic in the row decoder 54 strictly allocates row resources in order of increasing row numbers. As in the embodiment of FIG. 3, the rows are sequentially allocated in the order of the addresses of the instructions 11, 12, and 13. Thus, when the processor core 80 provides the lower limit and the upper limit to the row decoder 54, the respective word lines are sequentially enabled from the lower limit to the upper limit in the order of the instruction addresses, so that the step size table and the address generator 60 are provided in the order of instructions. The data address accesses the data memory 84 so that the read data is stored in the first-in first-out data read buffer 82 in the order of instructions. At this point, each time the processor core 80 executes a data load instruction, a read request is provided to the first in first out 82, and 82 outputs a data to the 80. 82 is now a data queue arranged in the order of the instruction address.
这种形式的行置换逻辑将步长表视为循环缓冲器(circular buffer),当步长表最后一行已经被分配后,下一次分配第一行,然后第二行。寄存器72中存储的地址可以使前述的机制继续生效。比如图1实施例中11,12,13三条数据访问指令前还有其他数据访问指令,指令11获得分配图三中最下面的一行,12,13指令依次获分配上面及中间的行。当指令22或23成功分支时,下限都是在最下面的行(指令11),上限都是在中间行(指令13)。执行从下限到上限的数据读取时,是从下面的行开始(指令11),经上面的行(指令12),到中间行(指令13)结束。列也可以按同样的循环缓存器方式组织,即分配到最后一列后,下次分配第一列。也可以使用LRU等置换方式。This form of row permutation logic treats the step table as a circular buffer (circular Buffer), when the last line of the step table has been allocated, the next line is allocated next time, then the second line. The address stored in register 72 allows the aforementioned mechanism to continue to take effect. For example, in the embodiment of FIG. 1, there are other data access instructions before the data access instructions of 11, 12, and 13. The instruction 11 obtains the lowermost row in the allocation map 3, and the 12, 13 instructions are sequentially assigned the upper and middle rows. When instruction 22 or 23 branches successfully, the lower limit is at the bottom line (instruction 11) and the upper limit is at the middle line (instruction 13). When data reading from the lower limit to the upper limit is executed, it starts from the following line (instruction 11), and ends in the upper line (instruction 13) via the upper line (instruction 12). Columns can also be organized in the same circular buffer mode, that is, after the last column is allocated, the first column is allocated next time. A replacement method such as LRU can also be used.
第三种使用上述基于步长提前取数据的系统的方式是将第一种及第二种方式结合。在分支判断产生之后,即将相应下限到上限之间的数据访问指令的数据地址更新,并以这些更新的地址访问数据存储器,读出数据存入数据读缓冲82以备处理器核80使用,并将更新的数据地址存入阵列52中58列中(即上述第二种方式)。此时即以当前分支判断选择步长加上数据地址产生猜测数据地址送到数据缓存器,如相应数据尚未在缓存中,即预取数据存入缓存,以掩盖缓存缺失;但该猜测数据地址并不存入阵列52中58列中。猜测数据地址的产生,以及以其寻址数据存储器,预先将可能下次执行同一数据装载指令需要的数据填充到数据存储器中,都发生在新的分支判断产生之前,因此是第一种方式。A third way to use the above system based on step size to fetch data in advance is to combine the first and second modes. After the branch judgment is generated, the data address of the data access instruction between the corresponding lower limit and the upper limit is updated, and the data memory is accessed by these updated addresses, and the read data is stored in the data read buffer 82 for use by the processor core 80, and The updated data address is stored in column 58 of array 52 (i.e., the second mode described above). At this point, the current branch judges the selection step size plus the data address to generate the guess data address and sends it to the data buffer. If the corresponding data is not already in the cache, the prefetch data is stored in the cache to cover the cache miss; but the guess data address It is not stored in column 58 of array 52. It is the first way to guess the generation of the data address and to address the data memory with the data that is required to be executed the next time the same data load instruction is loaded into the data memory in advance, before the new branch judgment is generated.
数据存储器84可以用缓存器实现。数据缓存器一般有一个标签单元(tag),其中存储全部或部分存储器地址。存储器地址被送到标签单元匹配,匹配命中时产生缓存地址(如多路组组织形式的缓存中的路号与存储器地址中的索引部分及块内偏移地址部分组合成的缓存器地址),用以寻址缓存中的数据存储器(data RAM)。本发明公开的步长表可在其58列中67域直接存储缓存器地址,使经总线38送出的地址可直接寻址缓存中的数据存储器而不需经过标签单元映射。此时要求数据存储器中行地址在一定区间内连续,以利地址产生器60按步长增量的方式自动计算下一数据地址;例如数据存储器中地址连续的若干行存储在多路组缓存器的同一路中。在缓存器地址跨越不连续的地址空间时,要有方法调整缓存器地址,如改变路号。进一步,可在上述67域中同时存储存储器地址与其对应的缓存器地址,两个地址用同一步长更新。其中缓存器地址直接寻址访问缓存中的数据存储器,其中存储器地址用于与处理器核80经总线57输出的存储器地址比较,以确定地址产生器60产生的地址正确无误。这种格式下可以在缓存器地址跨越不连续的地址空间时,以67中的存储器地址在数据缓存器的标签单元中匹配以获得新的缓存器地址;而在连续地址空间中,则直接以增量方式更新缓存器地址,以寻址缓存中的数据存储器。Data memory 84 can be implemented with a buffer. Data buffers typically have a tag unit in which all or part of the memory address is stored. The memory address is sent to the tag unit to match, and the cache address is generated when the match is matched (for example, the path number in the cache of the multi-path organization form and the buffer part combined with the index part in the memory address and the offset address part in the block), Used to address the data store in the cache (data RAM). The step size disclosed in the present invention can directly store the buffer address in the 67 field of its 58 columns, so that the address sent via the bus 38 can directly address the data memory in the cache without going through the label unit mapping. At this time, the row address in the data memory is required to be continuous within a certain interval, so that the address generator 60 automatically calculates the next data address in increments of steps; for example, several rows of consecutive addresses in the data memory are stored in the multi-channel buffer. In the same road. When the buffer address spans a non-contiguous address space, there is a way to adjust the buffer address, such as changing the road number. Further, the memory address and its corresponding buffer address can be simultaneously stored in the above 67 domain, and the two addresses are updated with the same step size. The buffer address is directly addressed to the data store in the access cache, wherein the memory address is used to compare with the memory address output by the processor core 80 via the bus 57 to determine that the address generated by the address generator 60 is correct. In this format, when the buffer address spans a discontinuous address space, the memory address in 67 is matched in the tag unit of the data buffer to obtain a new buffer address; in the contiguous address space, The buffer address is updated incrementally to address the data store in the cache.
上述实施例都以数据装载指令为例,实际上本发明所述方法与系统也可以适用于数据存储指令,如以上述第一种方式使可能应用到有写分配策略(write allocate) 的写回缓存(write back cache),由地址产生器产生将要存储的存储器地址,将相应数据从存储器读进数据缓存器中,使处理器核存储数据进数据缓存器时避免缓存缺失。The above embodiments all take the data loading instruction as an example. In fact, the method and system of the present invention can also be applied to data storage instructions, such as the above-mentioned first way to make possible application to a write allocation strategy (write Allocate writeback cache (write back Cache), the address generator generates a memory address to be stored, and reads the corresponding data from the memory into the data buffer, so that the processor core stores the data into the data buffer to avoid the cache miss.
步长表是一个二维的数据结构,其中一个维用一个参量寻址,另一个维用另一个参量寻址以访问数据结构中一个表项。可以用参量直接寻址。但如果参量间不连续,则可以压缩参量,如图3,图4实施例中行译码器54中的各寄存器72起的是类似全相联缓存中标签单元的作用,压缩了地址空间及阵列52中的空洞(数据访问指令约占指令总数的三分一,其地址空间不连续)。从这点看来步长表的形式实际上是一个类似全相联缓存的结构。列译码器50中各地址寄存器也起到同样的压缩作用(分支指令约占指令总数的六分一,其地址空间也不连续)。因此步长表可被视为一个二维的全相联压缩结构。A step table is a two-dimensional data structure in which one dimension is addressed with one parameter and the other dimension is addressed with another parameter to access an entry in the data structure. It can be directly addressed with parameters. However, if the parameters are not continuous, the parameters can be compressed. As shown in FIG. 3, each register 72 in the row decoder 54 in the embodiment of FIG. 4 functions like a tag unit in the fully associative buffer, compressing the address space and the array. The holes in 52 (data access instructions account for about one-third of the total number of instructions, and their address space is not continuous). From this point of view, the form of the step table is actually a structure similar to a fully associative cache. Each address register in column decoder 50 also performs the same compression function (the branch instruction accounts for about one-sixth of the total number of instructions, and its address space is not continuous). Therefore, the step size table can be regarded as a two-dimensional fully associative compression structure.
虽然本发明的实施例仅仅对本发明的结构特征和/或方法过程进行了描述,但应当理解的是,本发明的权利要求并不只局限与所述特征和过程。相反地,所述特征和过程只是实现本发明权利要求的几种例子。Although the embodiments of the present invention are only described in terms of structural features and/or methods of the present invention, it should be understood that the claims of the present invention are not limited to the features and processes. Rather, the features and processes are merely illustrative of several embodiments of the invention.
应当理解的是,上述实施例中列出的多个部件只是为了便于描述,还可以包含其他部件,或某些部件可以被组合或省去。所述多个部件可以分布在多个系统中,可以是物理存在的或虚拟的,也可以用硬件实现(如集成电路)、用软件实现或由软硬件组合实现。It should be understood that the various components listed in the above embodiments are merely for convenience of description, and may include other components, or some components may be combined or omitted. The plurality of components may be distributed among multiple systems, may be physically present or virtual, or may be implemented in hardware (such as an integrated circuit), implemented in software, or implemented in a combination of hardware and software.
显然,根据对上述较优的实施例的说明,无论本领域的技术发展有多快,也无论将来可能取得何种目前尚不易预测的进展,本发明均可以由本领域普通技术人员根据本发明的原理对相应的参数、配置进行相适应的替换、调整和改进,所有这些替换、调整和改进都应属于本发明所附权利要求的保护范围。Obviously, according to the description of the above preferred embodiments, the present invention can be made by those skilled in the art according to the present invention, no matter how fast the technical development in the field is, and no matter what progress is currently difficult to predict in the future. The principles are adapted, adapted, and modified in accordance with the corresponding parameters and configurations, all of which are within the scope of the appended claims.
本发明提出的系统和方法可以被用于各种与处理器相关的应用中,包括通用处理器,微控制器,多车道处理器,人工智能处理器,大数据处理器,数字信号处理器,图形处理器等,可以提高处理器的效率。 The system and method proposed by the present invention can be used in various processor related applications, including general purpose processors, microcontrollers, multi-lane processors, artificial intelligence processors, big data processors, digital signal processors, Graphics processors, etc., can improve the efficiency of the processor.

Claims (10)

  1. 一种数据地址产生系统,其特征在于,包括: A data address generation system, comprising:
    步长表,用以存储数据地址及地址增量;Step table for storing data address and address increment;
    所述数据地址产生系统依据数据访问指令地址分别学习并记录处理器核执行数据访问指令产生的数据地址,及两次执行该指令之间数据地址增量,存入步长表;The data address generation system separately learns and records the data address generated by the processor core executing the data access instruction according to the data access instruction address, and increments the data address between the execution of the instruction twice, and stores the step size table;
    所述数据地址产生系统以数据访问指令地址寻址步长表内容产生新的数据访问地址访问数据存储器,取得数据供处理器核使用。The data address generation system generates a new data access address access data memory by the data access instruction address addressing step size table content, and obtains the data for use by the processor core.
  2. 如权利要求1所述的数据地址产生系统,其特征在于,所述步长表中的表项内容为所述数据地址;The data address generating system according to claim 1, wherein the content of the entry in the step size table is the data address;
    所述步长表由所述数据访问指令地址寻址。The step size table is addressed by the data access instruction address.
  3. 如权利要求1所述的数据地址产生系统,其特征在于,所述步长表中的表项内容为所述数据地址增量;The data address generating system according to claim 1, wherein the content of the entry in the step size table is the data address increment;
    所述步长表的一维由所述数据访问指令地址寻址;One dimension of the step size table is addressed by the data access instruction address;
    所述步长表的另一维由成功分支的反向分支指令地址寻址。The other dimension of the step size table is addressed by the reverse branch instruction address of the successful branch.
  4. 如权利要求1所述的数据地址系统,其特征在于,所述数据地址产生系统通过如下方法产生地址: The data address system of claim 1 wherein said data address generation system generates an address by:
    所述数据地址产生系统将所述步长表中存储的所述数据地址及所述数据地址增量相加以产生新的数据地址;The data address generation system adds the data address and the data address stored in the step size table to generate a new data address;
    所述新的数据地址被存回步长表。The new data address is stored back to the step size table.
  5. 如权利要求1所述的数据地址产生系统,其特征在于,包括:The data address generation system of claim 1 comprising:
    所述数据地址产生系统访问数据访问指令地址在成功分支的反向分支指令地址及其分支目标指令地址之间的步长表内容;The data address generation system accesses the data access instruction address in the step branch table between the reverse branch instruction address of the successful branch and the branch target instruction address;
    所述数据地址产生系统根据所述步长表内容产生新的数据存储地址;The data address generation system generates a new data storage address according to the content of the step table;
    所述数据地址产生系统以所述新的数据存储地址访问所述数据存储器,取得数据供所述处理器处理;The data address generation system accesses the data storage with the new data storage address, and obtains data for processing by the processor;
    所述新的数据地址被存回步长表。The new data address is stored back to the step size table.
  6. 一种数据地址产生方法,其特征在于,包括以下步骤:A data address generating method, comprising the steps of:
    依据数据访问指令地址学习处理器核执行数据访问指令产生的数据地址;Learning a data address generated by the processor core executing the data access instruction according to the data access instruction address;
    依据数据访问指令地址学习处理器核两次执行同一数据访问指令之间数据地址增量;Learning, according to the data access instruction address, the processor core to perform the data address increment between the same data access instruction twice;
    将上述数据地址及数据增量记录在步长表中;Recording the above data address and data increment in the step table;
    以数据访问指令地址寻址步长表内容产生新的数据访问地址访问数据存储器,取得数据供处理器核使用。The data access instruction address addresses the step table contents to generate a new data access address to access the data memory, and the data is obtained for use by the processor core.
  7. 如权利要求6所述的数据地址产生方法,其特征在于,所述步长表中的表项内容为所述数据地址;The data address generating method according to claim 6, wherein the content of the entry in the step size table is the data address;
    所述步长表由所述数据访问指令地址寻址。The step size table is addressed by the data access instruction address.
  8. 如权利要求6所述的数据地址产生方法,其特征在于,所述步长表中的表项内容为所述数据地址增量;The data address generating method according to claim 6, wherein the content of the entry in the step size table is the data address increment;
    所述步长表的一维由所述数据访问指令地址寻址;One dimension of the step size table is addressed by the data access instruction address;
    所述步长表的另一维由成功分支的反向分支指令地址寻址。The other dimension of the step size table is addressed by the reverse branch instruction address of the successful branch.
  9. 如权利要求6所述的数据地址方法,其特征在于,所述数据地址产生系统通过如下方法产生地址:The data address method according to claim 6, wherein said data address generation system generates an address by:
    所述数据地址产生系统将所述步长表中存储的所述数据地址及所述数据地址增量相加以产生新的数据地址;The data address generation system adds the data address and the data address stored in the step size table to generate a new data address;
    所述新的数据地址被存回步长表。The new data address is stored back to the step size table.
  10. 如权利要求6所述的数据地址产生方法,其特征在于,包括:The method of generating a data address according to claim 6, comprising:
    所述数据地址产生系统访问数据访问指令地址在成功分支的反向分支指令地址及其分支目标指令地址之间的步长表内容;The data address generation system accesses the data access instruction address in the step branch table between the reverse branch instruction address of the successful branch and the branch target instruction address;
    所述数据地址产生系统根据所述步长表内容产生新的数据存储地址;The data address generation system generates a new data storage address according to the content of the step table;
    所述数据地址产生系统以所述新的数据存储地址访问所述数据存储器,取得数据供所述处理器处理;The data address generation system accesses the data storage with the new data storage address, and obtains data for processing by the processor;
    所述新的数据地址被存回步长表。The new data address is stored back to the step size table.
PCT/CN2016/083018 2015-05-23 2016-05-23 Generation system and method of data address WO2016188392A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510271803.5 2015-05-23
CN201510271803.5A CN106293624A (en) 2015-05-23 2015-05-23 A kind of data address produces system and method

Publications (1)

Publication Number Publication Date
WO2016188392A1 true WO2016188392A1 (en) 2016-12-01

Family

ID=57392522

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/083018 WO2016188392A1 (en) 2015-05-23 2016-05-23 Generation system and method of data address

Country Status (2)

Country Link
CN (1) CN106293624A (en)
WO (1) WO2016188392A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065947A (en) * 2021-11-15 2022-02-18 深圳大学 Data access speculation method and device, storage medium and electronic equipment

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110427332B (en) * 2019-08-05 2021-08-20 上海兆芯集成电路有限公司 Data prefetching device, data prefetching method and microprocessor
CN112732739B (en) * 2021-03-30 2021-07-20 南京粒聚智能科技有限公司 Method and device for analyzing data address of equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178690A (en) * 2007-12-03 2008-05-14 浙江大学 Design method of low-power consumption high performance high speed scratch memory
US20110238953A1 (en) * 2010-03-29 2011-09-29 Sony Corporation Instruction fetch apparatus and processor
CN103513957A (en) * 2012-06-27 2014-01-15 上海芯豪微电子有限公司 High-performance cache system and method
CN104050092A (en) * 2013-03-15 2014-09-17 上海芯豪微电子有限公司 Data caching system and method
WO2015070771A1 (en) * 2013-11-16 2015-05-21 上海芯豪微电子有限公司 Data caching system and method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1139873C (en) * 2001-01-18 2004-02-25 清华大学 Programmable video signal processor structure based on mixed video encoding method
US20040071199A1 (en) * 2002-07-03 2004-04-15 Commasic, Inc. Virtual finger method and apparatus for processing digital communication signals

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178690A (en) * 2007-12-03 2008-05-14 浙江大学 Design method of low-power consumption high performance high speed scratch memory
US20110238953A1 (en) * 2010-03-29 2011-09-29 Sony Corporation Instruction fetch apparatus and processor
CN103513957A (en) * 2012-06-27 2014-01-15 上海芯豪微电子有限公司 High-performance cache system and method
CN104050092A (en) * 2013-03-15 2014-09-17 上海芯豪微电子有限公司 Data caching system and method
WO2015070771A1 (en) * 2013-11-16 2015-05-21 上海芯豪微电子有限公司 Data caching system and method

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114065947A (en) * 2021-11-15 2022-02-18 深圳大学 Data access speculation method and device, storage medium and electronic equipment
CN114065947B (en) * 2021-11-15 2022-07-22 深圳大学 Data access speculation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN106293624A (en) 2017-01-04

Similar Documents

Publication Publication Date Title
US11074190B2 (en) Slot/sub-slot prefetch architecture for multiple memory requestors
US20040088489A1 (en) Multi-port integrated cache
US20050240733A1 (en) Apparatus and method for selecting instructions for execution based on bank prediction of a multi-bank cache
TWI489273B (en) Address range priority mechanism
JP2006012163A (en) Digital data processing apparatus having multi-level register file
US8954681B1 (en) Multi-stage command processing pipeline and method for shared cache access
JP2004038345A (en) Prefetch control device, information processor, and prefetch control process
JP4218820B2 (en) Cache system including direct mapped cache and full associative buffer, its control method and recording medium
US11301250B2 (en) Data prefetching auxiliary circuit, data prefetching method, and microprocessor
JP2008107983A (en) Cache memory
US7260674B2 (en) Programmable parallel lookup memory
US6038642A (en) Method and system for assigning cache memory utilization within a symmetric multiprocessor data-processing system
WO2015081889A1 (en) Caching system and method
WO2016188392A1 (en) Generation system and method of data address
US5893163A (en) Method and system for allocating data among cache memories within a symmetric multiprocessor data-processing system
US7761665B2 (en) Handling of cache accesses in a data processing apparatus
JP2010097557A (en) Set associative cache apparatus and cache method
JP2001290702A (en) Storage device
JP2001195304A (en) Cache storage device
US20180089141A1 (en) Data processing device
JP3970705B2 (en) Address translator, address translation method, and two-layer address translator
TWI723069B (en) Apparatus and method for shared least recently used (lru) policy between multiple cache levels
JP2008512758A (en) Virtual address cache and method for sharing data stored in virtual address cache
JPH076122A (en) Method and apparatus for request of data
CN111736900A (en) Parallel double-channel cache design method and device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16799280

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16799280

Country of ref document: EP

Kind code of ref document: A1