A kind of hardware optimization method of CPU
Technical field
The present invention relates to the hardware design field of processor, more particularly to CPU(Central processing unit)The hardware of data transfer
Optimization design.
Background technology
Data transfer component is always the important component part of CPU, during processor design is always to its Optimization Work
One of emphasis of performance optimization.In state's inner treater design the optimization of data transfer component design is mainly passed through to improve Cache
(Cache)Execution efficiency, solution read-write correlation, increase DMAC(Direct memory access controller)The modes such as part are entering
OK, the optimization to the execution flow process of data transfer instruction is seldom referred to.By taking the data transfer instruction of 32 x86 instruction sets as an example,
It is entirely according to traditional mode transmitted one by one, by byte to perform flow process(Byte), word(Word)Or double word(It is double
Word)One by one order passes to destination address from source address.This kind of instruction can in a large number occupy bus, cause the pause of CPU streamlines
With the increase of bus bandwidth load.
By taking 32 X86 instruction as an example, figure one and figure two are respectively typical instructions REP of memory to memory data transfer
MOVS(String transmission instruction)With typical instructions POPA of memory to register transfer(Pull instruction)Flow chart.For REP
For MOVS, it is assumed that ECX values are 100, then this instruction needs 100 step 1- steps 5 of repetition just can complete.It can be seen that this
Class instruction execution efficiency is low, bus bandwidth is loaded higher.
This intellectual achievement carries out labor by the data transfer instruction to x86 instruction set, it is proposed that for therein
The method that continuous data transmission instruction carries out hardware optimization, the execution cycle of reduction data transfer instruction and CPU are to bus access
(Particularly write access)Number of times, effectively increase the instruction execution speed of CPU.
The content of the invention
The invention reside in providing a kind of hardware optimization method of data transfer instruction, it is intended to improve holding for data transfer instruction
Line efficiency, and then lift the performance of processor.Its address parameter setting is simple, can only in system initialization(For example BIOS sets
It is fixed)When setting once, it is also possible to according to client need parameter setting is changed in system implementation;Performance is obviously improved, excellent
Data transfer instruction after change generally has more than one times of improved efficiency.
A kind of hardware optimization method of CPU of the present invention, including:
(1), design system hardware:System is by module compositions such as master controller, storage control, external bus interfaces:
Wherein, master controller is the main part of system, is that the storage that can be directed to different memory access addresses execution different modes is operated
CPU, specifically, this CPU can be set by way of BIOS program or setting internal register value can burst ground
Location area, for can the memory access of burst areas perform by the instruction scheme after optimization, can not burst areas perform by former instruction scheme, storage
Controller is written and read operation to memory by receiving master controller data message and control information, it would be preferable to support burst reads
Write and non-burst read-writes.External bus interface is then used for the inside and outside data communication of CPU.Packet after hardware optimization
Include:The source address information of data transfer, destination address information, data length information, control information, data message.
(2), setting burst regions.By burst addresses editing machine, burst areas that can be simply in set memory and
Can not burst areas.Can burst areas and can not burst areas be the configurable region determined according to the Memory Allocation mode of user.Can
Burst areas and can not burst areas be not required for it is completely continuous, the memory address of system can be assigned as it is multiple can burst areas and
It is multiple can not burst areas.Whether can be configured by hardware and software two ways using burst patterns.Outside can pass through
Single switch is turned on and off burst patterns, and inside can decide whether to open burst moulds by arranging particular register value
Formula.In addition to using burst editing machines, user can also be set by specific program can burst areas.It is generally initial in system
During change by BIOS setup once, it is also possible to reset burst areas when program is run as needed.The data of execution
When source address, the destination address of transmission instruction are in burst regions, the instruction scheme of optimization is enabled.
(3), using burst patterns.When data length is more than 32, you can enable burst patterns.Burst patterns are not
Depend on whether to open cache functions, can at most support the memory read-write of the data of a cache row size.Burst is most
This requirement of data of one cache row size of many supports is simultaneously revocable, and one why is selected in this intellectual achievement
The size of cache rows, as maximum, is conveniently to open carry out data exchange, burst reality with cache modules under cache patterns
Desirable maximum number of byte has no hard requirement.When CPU inside and outside carries out data transmission, when reading internal memory, can be with one
The secondary data by a cache row are read in the burst registers of data transmission unit, then carry out the behaviour such as register assignment
Make;During write internal memory, in the burst registers of data transmission unit, then once the data for needing write are temporarily stored in first
Property ground write internal memory.Open cache functions to be very helpful burst read-write capabilitys, CPU can be effectively improved and perform effect
Rate.
The number of times of cpu bus access is reduced by burst read-writes, traditional instruction execution flow is shortened, to specific
Instruction(Such as REP MOVSB)Instruction flow can be reduced to the 1/16 of traditional process after optimization(According to byte number contained by cache rows
It is fixed).
This optimization method is not limited only in the instruction set of X86 structures, and other CISC or risc instruction set are also applicable.
By taking risc instruction set as an example, there is an instruction to be LDMIA in the ARM instruction widely of application, multiple deposits can be completed
The transmission of device value, at most can transmit the value of 16 general registers.The way of ARM is to perform after load/store operations one by one, is deposited
Storage unit address is increased by word length.If using the burst read-writes in this patent and assignment mode, can disposably by register
Value all read again property and be all assigned to 16 general registers, the lifting of performance is obvious.
A kind of advantage of the hardware optimization method of CPU of the present invention is:It is completely compatible with x86 instruction set, operating system and should
With software without the need for any change;
The burst writable areas of the present invention are fully configurable, user can according to oneself need divide burst writable areas, it is right
Reading and writing data carries out subsection optimization;
For hardware design, peripheral circuit is not increased, cpu logic door increases number seldom, system cost is not affected;
Further, by analysing in depth to data transfer instruction, partial data transmission of this intellectual achievement to X86 refers to
The execution flow process of order is optimized, and substantially reduces the execution time of instruction, reduces the bus access number of times of CPU, for
Specific instruction(Such as REP MOVSB instructions)Bus access number of times can be reduced to 1/16(Depending on byte number contained by cache rows).By
More concentrate in bus access, so the bandwidth load of bus also has substantial degradation;
In extending to the CPU design of other CISC or risc instruction set, range of application is big.
Description of the drawings:
Fig. 1 is the typical instructions REP MOVS tradition execution flow charts of memory to memory data transfer.
Fig. 2 is typical instructions POPA tradition execution flow chart of the memory to register transfer.
Fig. 3 is general execution flow process comparison diagram before and after data transfer instruction optimization.
Fig. 4 is the execution flow chart after REP MOVS optimizations.
Fig. 5 is the execution flow chart after POPA optimizations.
Specific embodiment:
Shown in Ju Fig. 1~Fig. 5, a kind of hardware optimization method of CPU, its step is as follows:
1. burst regions are configured by burst addresses editing machine
Assignment is carried out to certain two general register in CPU so as to the lower address and upper address in value correspondence burst regions.
Again a certain reserved bit of EFLAG is carried out putting an operation, by the burst configuration register groups of the two address assignments to CPU,
Again zero-setting operation is performed to the reserved bit of EFLAG, to carry out burst regions configuration next time.This step is repeated several times with
Configure multiple burst regions.
2. instructed by the transmission of hardware optimization rapid memory to internal memory
By taking REP MOVS instructions as an example, first DS is judged by hardware:ESI and ES:Whether the address of EDI is in
In burst regions.If being in burst regions, optimization logic is performed, otherwise perform former logic.In optimization logic, ECX values are judged
Whether burst_num is more than(According to the difference of operand size M, burst_num can be 4,8,16 etc.)If being more than operand
Size, then from DS:ESI takes out N number of byte by a read burst(N represents cache row byte numbers)Data be temporarily stored in
In internal burst registers, then ES is write by a write burst:EDI, and ESI/ EDI are deducted or added(Plus
Or subtract value depending on DF)N, by ECX burst_num is deducted;If burst_num is less than, from DS:ESI is by once
Read burst take out(ECX*M)The data of individual byte are temporarily stored in internal burst registers, then by a write
Burst writes ES:EDI, and ESI/ EDI are deducted or plus ECX*M, and ECX is set to 0.
3. accelerate register to instruct to the transmission of internal memory by hardware optimization
By taking POPA/PUSHA instructions as an example, SS is first judged:Whether ESP is in burst regions.If being in burst regions,
Optimization logic is performed, former logic is otherwise performed.In optimization logic, for POPA:The stacked data of N number of byte is read and kept in
Internally in burst registers, the general register of the CPU such as DI, SI, BP is assigned to successively.For PUSHA:By DI, SI, BP
Value combination Deng the general register of CPU is assigned to internal burst registers, is then disposably write out by write burst.
In addition to above-described embodiment, can be many to use the method for the present invention to write out in other categorical data transmission technologys
Individual embodiment, here is not repeated one by one.