CN1838090A

CN1838090A - Increasing data locality of recently accessed resource

Info

Publication number: CN1838090A
Application number: CNA2005101040168A
Authority: CN
Inventors: S·班萨利; W·－K·陈; X·高
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2004-09-10
Filing date: 2005-09-09
Publication date: 2006-09-27

Abstract

Applications written in modem garbage collected languages like C# tend to have large dynamic working sets and poor data locality and are therefore likely to spend excess time on managing data movements between memory hierarchies. Instead, a low overhead dynamic technology improves data locality of applications. The technology monitors objects while the program runs and places recently accessed objects on the same page(s) on the heap. Providing increased page density is an effective method for reducing DTLB and/or data cache misses.

Description

Improve the data locality of nearest accessed resource

Related application

The present invention requires the right of priority of the U.S. Provisional Application the 60/608th, No. 734 (applicant's proxy number 305135.02) of 10 submissions September in 2004, and its content is included in this by reference.

Technical field

Technical field of the present invention generally relates to diode-capacitor storage to increase the efficient of data access, more specifically, relates to the data locality on the object that monitors and rearrange nearest visit is subjected to management automatically with improvement the heap.

Copyright authorization

The part of this patent documentation institute disclosure comprises material protected by copyright.The copyright owner does not oppose anyone facsimile copy with this patent document or this patent institute disclosure, as it appears in patent and trademark office patent document or the record, in any case but all keep all copyrights in addition.

Background technology

Unbecoming being widely known by the people that increases between processor speed and storer.Many application programs are with the language compilation of carrying out in the environment that carries out such as memory management techniques such as garbage collections.This type of language includes, but are not limited to, such as language such as C# and Java.Application program with these language compilation often has very big dynamic duty locked memory pages collection and very poor data locality.It is very poor that very poor data locality may make application program carry out, and can not do corresponding lifting well with the processor speed-raising.

Bigger and multistage high-speed cache helps the concealing memory stand-by period to a certain extent.But cache memory is very expensive, and because this cost, high-speed cache on the chip (for example, L1 high-speed cache, ITLB high-speed cache and DTLB high-speed cache) working load unlikely and the modern Application program is done the growth of same speed.In addition, the prefetching technique in the hardware can reduce memory latency time sometimes, but when serial relied in good time specializing of (for example, pointer is indirect) eliminating prefetch address, looking ahead of irregular data visit was very difficult.

Therefore, improve the application's data locality with software engineering is that people are interested always.In recent document, static and dynamic technique research and report have all been done.Static technique depends on process analysis in advance, uses usually and analyses and observe data, comes the colocated object based on locality of reference, or injects prefetched instruction in compilation time and come the concealing memory stand-by period.The major advantage of these methods is the overheads that do not have working time; But they are subjected to the common restriction of static method (for example, handle the procedure set of dynamic load and the difficulty of class, the compiler of instant method is made the cost of complete process analysis) possibly.A part uses a kind of replicanism to reorganize institute's object for allocation in working time based on the system of garbage collection (GC), and no matter these objects have is recently denied accessed mistake.But GC is mainly used to reclaiming memory, and as being the spinoff that fundamental purpose is compressed and reorganized heap to reclaim the space, has obtained preferable spatial locality passively.

Other method based on GC is also used detection to collect in working time to analyse and observe information, but these technology to analyse and observe cost too high.

Summary of the invention

Described various technology provides the method and system such as reinforcement memory managements such as garbage collections to increase data locality.Above mentioned problem achieve a solution by system and method disclosed herein at least in part.In one example, a kind of technology of low overhead is collected the heap visit information, subsequently it is used for instructing the reorganization of heap, thinks based on the application program in the system of garbage collection (GC) to obtain better data access locality.The reorganization of analysing and observe and piling concentrates on increases a page or leaf density, to create low cost but reduce the still effectively practical realization in page or leaf mistake and cache miss aspect at the same time.

In one example, GC mainly and initiatively is used to improve the storer locality, rather than the pure as in the past mechanism as a kind of passive recovery free memory space.In this type of example, in case detect some program behavior or performance, can also call or trigger the GC that is used for locality immediately for the space of new distribution even still have, therefore can not trigger the GC in space on the contrary.In this type of example, the quantity that the GC that triggering is used for locality can fully increase garbage collection (for example, surpass the quantity of collecting the required collection of the storage space that discharged in other cases 50%), and still can reach overall speed-raising because of the locality of improving.

In one example, realize this method with the CLR (CLR) of the .Net framework of Microsoft.CLR uses instant method (JIT) compiler with MSIL (Microsoft intermediate language) binary code translation cost machine code, and uses the useless memory cell gatherer of generation formula to manage heap.Assessed a kind of illustrative methods of improving data locality of collecting via useless memory cell with some exemplary application of writing with C#; But it is the application program with any language compilation of duplicating based on the system of GC that this method is applicable to target.But described technology does not need useless memory cell to collect.

In another example, a kind of method monitors accessed object on heap.In this type of example, (or counting) one or more bits are set to indicate certain object accessed.These one or more bits can perhaps can be arranged in other position of storer at inner or close this object of this object self.In another example, be not that all visits all are counted.On the contrary, this method periodically monitored accessed object according to the sampling period.In this type of example, this method also the supervisory programme behavior to judge the reorganization when carry out heap.When the program behavior that is monitored is indicated to some extent, this method promptly reorganizes heap.In one example, in the cluster of objects that the mechanism of reorganization will be recently accessed the same area in the heap.In another example, the mechanism of reorganization is incited somebody to action nearest cluster of objects accessed during a sampling period in the same area of heap.In one example, recently accessed trooping of object is placed on one or more pages or leaves of heap.In another example, this method is removed the accessed bit of certain object of indication, and returns with monitored object visit and program behavior.

The following specifically describes and will make further feature and advantage apparent, is the reference of accompanying drawing before this.

Description of drawings

Fig. 1 is a process flow diagram of optimizing the illustrative methods of the data locality of piling.

Fig. 2 is a block diagram of optimizing the example system of the data locality of piling.

Fig. 3 is the block diagram that has the example system of the hierarchical management storer that changes access speed via a plurality of.

Fig. 4 is the figure that creates the illustrative methods of carrying out the executable code of optimizing.

Shown in Figure 5 is example object to be distributed spread all over the exemplary plot that each page in the storer optimized with execution.

Fig. 6 is exemplary generation formula garbage collection method and system figure.

Fig. 7 is the block diagram with the example system of the data locality of garbage collection optimization heap.

Fig. 8 is a block diagram of realizing the Distributed Computer System of described technology.

Embodiment

Optimize the illustrative methods of data locality

Fig. 1 is a process flow diagram of optimizing the illustrative methods of the data locality of piling.As shown in the figure, method 100 monitors accessed object, monitors the tolerance of optimizing and reorganize accessed object on one or more heap pages or leaves.

102, this method monitors accessed object on heap.For example the jit compiling device is surveyed the operation that reads or writes the data object in the heap.Keep the accessed record of which object of indication.In an example, this record is to be configured to indicate this object accessed bit recently in the object itself.In another example, this record is a bit in certain independent bit vector.In another example, processor provides a kind of and writes down the address (for example, writing down all current addresses in high-speed cache) of all recently accessed objects or cause the mechanism or the means of the address of page or leaf mistake.

At that time, 104, this method monitors that each tolerance is to judge the optimization of when carrying out data locality.For example, performance metric can be object partition coefficient, DTLB miss rate, cache miss rate, performance counter, object reference counter or timer.When the indication of the tolerance that is monitored is this reorganize heap for locality when time, then this method execution in step 106.

106, this method at first identifies accessed object recently, and gathers on the page or leaf of plurality of continuous in the heap with being about to them.For example, the code through surveying is provided with a bit in accessed object, or a bit is set in the table of bits of accessed object.For example, a bit is set in a bit or the table of bits at accessed object in accessed object, then on heap, these cluster of objects is arrived together immediately if the code through surveying is provided with.In another example, the operation instructions heap that processor provided is gone up the nearest accessed mistake of which object address.In another example, the DTLB which Address requests of operation indication that processor provided causes loses.In an example, be reset immediately corresponding to the visit bit or the counter of accessed object.In another example, not that visit bit or counter are resetted at once, but pass by in time slowly to reset.

108, in case be data locality with heap optimization, then this method is returned step 102 and 104.

Optimize the example system of data locality

Computer system 200 comprises one or more processors 202, high-speed cache 204 on the chip, the one or more working procedures 208 that are monitored and are optimized for data locality, for the mould block 206 that improves the data locality watchdog routine, comprise that high-speed cache 212, disc driver or other binary 214 are connected 218 with network outside the random-access memory (ram) 210, chip of the heap 216 of the operational data page or leaf that uses for working procedure 208 (for example, page or leaf, section, or the like).Processor executive routine 208,206,220, these programs comprise instruction, data and/or state.When execution is monitored program 208, as required with the data page of this program from store 214 and/or network 218 get back to the heap 216.

In one example, being monitored program 208 is to want intermediate language (IL) code of further compile cost machine code before execution.In such example, compiler 206 compilings are also surveyed this program.In another example, this program has been the form of this machine binary code, and this this machine binary code is carried out detection 206.In another example, surveying 206 increases the instruction that processors are supported, accessed object or address to be identified at the term of execution.

Locator(-ter) 208 makes it note object in the heap 216 accessed when program 208 is carried out.Also survey this program, make it also monitor each tolerance and triggering optimization mould block 220.Can be used for triggering optimization such as various tolerance such as losing of TLB, DTLB or high-speed cache.Other the possible tolerance that is used to trigger optimization is memory allocation rate, object reference counting and other tolerance discussed below.In case carry out the program through surveying, and triggered optimization, then optimal module 220 reorganizes at least one storage page (or section) of heap 216.For example, the optimization mechanism object (for example, heat target) that all are accessed is placed into to pile and goes up on certain independent page or leaf (or certain group page or leaf).The page or leaf of this (or this group) heat target is called the heat page or leaf on the heap.

Therefore, system surveys a program accessed object (sequential monitoring) to monitor the term of execution, monitor and trigger the performance indicator (performance monitoring) that heap 220 is optimized in this program, and with accessed object reorganize in the storer one troop in (for example, heap is gone up one or more set of close object mutually to be reorganized on the single page or leaf, or reorganize on the page or leaf of plurality of continuous, or the like), thereby owing to the data locality (data locality optimization) that strengthens improves program feature.In addition, in case will carry out local optimization to data, system promptly begins to monitor the program through optimizing.Therefore, this system is dynamic and ongoing.

The example memory configuration

Modern computer 300 comprises that one or more CPU (central processing unit) 302 (CPU) are comprising one or more processors 304 and storer at different levels, include but not limited to, high-speed cache 306,308,310 on the chip, the outer high-speed cache 312,314 of chip, random-access memory (ram) 316, the storer of disk storage 318 and many other forms.Computing machine is carried out the program that has been treated to executable file.Processor is fetched instruction from storer, will instruct decoding, and carries out decoded instruction and carry out various functions.In order to improve performance and speed, computing machine uses the storer of various grades to increase when needs next instruction or data the possibility that this instruction or data can be used.For example, processor is checked data and instruction (resource) whether in high-speed cache, rather than all searches in RAM when needing resource at every turn.Obtain required resource than faster, because can in the less clock period, finish from high-speed cache from RAM or disk acquisition.Described method and system has improved calculated performance by reducing the required time of retrieve resources.

The progress of cpu performance is more faster than the progress of memory performance all the time.This has caused the travelling speed of program not raise speed much with the processor speed increment.A solution is to make up bigger cache memory more nearby from chip, but this solution is difficult, because cache memory is very expensive.Therefore, the domination that is subjected to memory speed of the relative velocity of processing is greater than the domination of the speed of being decoded and executing instruction.For example, Pentium (Pentium) IV has raised speed 3 or 4 times, but application program and fail to raise speed 3 times, because processor is waiting pending data or instruction to import into from storer, can obvious above-mentioned saying in this example.Time loss is at the physical address that virtual address translation is become in the storer, and this is finished by translation look-aside buffer (TLB) 306,308.Many systems have instruction look-aside buffer (ITLB) 306 and digital backup impact damper (DTLB) 308.Because program can have the virtual address space bigger than the free space among the RAM, subprogram executing state (for example, code/data page) transmits at RAM 316 with between storing 318 as required.

TLB is used for the current available content translation of virtual address space is become its location in physical storage.The TLB high-speed cache is expensive hardware, so the deviser would rather reduce this size.When certain address of program execution request, but the TLB high-speed cache is judged this address not in RAM 316, and then it has run into a page or leaf mistake.Reside in the storer if comprise the page or leaf of this address, then this page mistake is called soft page or leaf mistake, and needs hundreds of clock period to upgrade the TLB clauses and subclauses of this address.Do not reside in the storer if comprise the page or leaf of this address, then it must be called in the storer from storage.In the situation, the page or leaf mistake is called hard page or leaf mistake in this, and may need millions of instructions to call in from disk by page or leaf.When certain while available address in TLB and high-speed cache 310,312,314 of processor request, then translation will be very fast.Therefore, it is available that memory management relates to which page or leaf (block, section, or the like) of management, and how increasing required resource in the minimum clock period can be for the possibility of processor use, or the like.

Purely for the mistiming other relatively, the example of checking various high-speed caches and memory speed is very interesting.If the processor requested resource is available in one-level (L1) high-speed cache, then can in 1-3 clock period, obtain this resource.If the processor requested resource is available in the L2 high-speed cache, then can in 10-50 clock period, obtain this resource.Can in about 20-100 clock period, obtain the resource in the L3 high-speed cache, in about 500 clock period, obtain the resource in the storer, the resource during the remarkable much longer time of consuming could obtain to store.As required resource is called in the storer from storage by page or leaf (for example, the 4K byte), and will call in high-speed cache (for example, 64-128 byte) such as the resource of reduced sizes such as block on demand.Equally, these examples are not controlled this content of the discussions or it are limited in the context that relates to present or following reality or relative difference, and they are intended to provide context for this discussion purely.

Therefore, system survey a program with the term of execution monitor accessed object (sequential monitoring); Trigger the performance indicator (performance monitoring) that heap 320 is optimized in the supervisory programme; On the single page or leaf 322 that accessed object is reorganized in the storer, thereby owing to the data locality that improves has improved program feature (data locality optimization).In addition, in case will carry out local optimization to data, system promptly begins to monitor the program through optimizing.Therefore, this system is dynamic and ongoing.

Exemplary compiling

Fig. 4 is the figure that creates the illustrative methods 400 of executable code.In an example, compile 404 source codes as required, distribute 406 (for example, X86) and call in storer with executable form.For example, the binary code of some code can move in any compatible environment.In this type of example, compatible binary code is the transplantable executable code (PE) that can carry out in any X86 environment.This model can find in using such as senior programming language written program such as C and C++.

In another example, source code 402 is compiled into intergrade code 408 (for example, MSIL, Java, or the like), can be the processor of carrying out executable code (for example, X86, X64, or the like) with the further compiling 410 cost machine executable codes 412 of these codes.In this type of example, when needs are carried out (for example, when when storage, network etc. are called in storer), instant (JIT) compiler is translated cost machine code with intermediate code.

The memory management that comprises garbage collection is used in comprising many computing environment such as those environment that Fig. 4 conceives, and independent computing environment can be used any in these Compilation Methods.For example, detectable 414 existing binary codes those objects to be recorded in the term of execution are accessed, also can survey 416 during jit compiling.What is interesting is that local optimization described herein is often particularly useful in jit compiling, because these language depend on memory management more and come clear program no longer to quote the object of (for example, via approachability analysis, reference count, or the like) usually.

Some higher level lanquage requires the programmer that the storer that its program expection needs is carried out allocation and deallocation.Other Languages or run time library allow the programmer to rely on garbage collection to finish and remove distribution.Garbage collection relates in heap which object of sign still to be used, and throws aside any object that no longer is cited.

In the past, when there not being enough spaces just to call garbage collection when satisfying new request for allocation, being judged to be the occupied space of object that no longer is cited and being released and making it to use for new object.Therefore, GC mainly is regarded as a kind of method of reclaiming memory as required.For example, when program need be more than the storer of present quantity available, then which storer can reach the traversal heap on this heap to seek.The storer that can reach is assembled on the page or leaf, and makes d/d space can be used for processor institute requested resource.Therefore, only in the time can not satisfying new request for allocation, just trigger GC.

In present technique, GC regarded as a kind ofly improve data locality and thereby improve the method for performance.In this type of example, at first GC is regarded as the performance purpose and improve the method for locality, secondly it is considered as discharging as required the method for storer.Follow the tracks of heat target and accessed object is arranged in together overhead although have, GC still usually causes the raising of performance.In an example, even than being used to reclaim 50% GC of the performed execution often in space purely, clean performance is still preferable.Because heat target is assembled on the page or leaf,, processor increased so also needing to be moved to the possibility of the heat page or leaf that part in the high-speed cache assembles.Therefore group of objects is installed on a page or leaf or the one group of page or leaf not only on heap, and in the high-speed cache of placing this page part, improved spatial locality.Spinoff with heat target assembling page or leaf is the utilization factor that has increased high-speed cache.

What is interesting is, be enough to be used for following the tracks of accessed object for each accessed object is provided with a bit.But each object is counted access times with several bits and has been indicated which object the most frequent accessed.This allows that group of objects is installed to page or leaf upward has bigger granularity in the place.

Exemplary locality

Application program often shows the locality of timeliness when reference-to storage, if promptly certain object is accessed recently, then it can be accessed once more soon.In addition, if near the object of its object of visiting recently, then being called, application access has good spatial locality.Therefore very poor spatial locality can make the visit of object spread all over whole storer, causes the poor efficiency of TLB/ high-speed cache and RAM, and causes the low performance carried out.The most recent accessed and call optimization method and come these heat targets of colocated by monitoring which object, then in carrying out, improve spatial locality probably than after-stage.Suppose that the heat target group may change when procedure operation, then continue to monitor and constantly or periodically to optimize be very important.

And nonessential during garbage collection, or call this memory optimization technology from garbage cell.In one embodiment, local optimization is collected and to be independent of garbage collection and to operate.In this type of example, the pressure of storer makes existing garbage collection method discharge storer, but as described herein, in the time of any needs, all can call local optimization.

In another example, the local optimization method is provided, as the extra optimization of garbage collection, this is very easily, because the existing method and the data structure of management heap can be used for supporting local optimization.For example, during garbage collection, when having identified the object of living, also can discern alive and object heat, and it is assembled in the heat page or leaf.

Shown in Figure 5 is the synoptic diagram that spreads all over the example object of each page in the storer.In this example, each grid 502 is all represented the page or leaf of heap on 500, the object representation of grid inside the live object 504 or (for example, d/d or no longer the be cited) object 506 of being thrown aside.Heap occupies a part of RAM usually.At any given time, based on the size of available high-speed cache, the some parts of heap is available in cache memory faster.TLB is mapped to physical address with virtual address space, and which part of indication virtual address space can be translated into the physical address among the RAM rapidly.

Along with the time goes over, effectively object spreads all over overall stack, and is being mingled with dead object between the object along with living, and it is more and more incoherent that storer becomes, and this makes situation bad more.This causes very poor spatial locality.One of spinoff of very poor data locality be need to consume more time according to processor memory block need be called in and accessed high-speed cache, and when the address that visit does not present, upgrade DTLB in the DTLB clauses and subclauses.

At last, when heap has been full of object, and ask more object to divide timing, based on the pressure activated garbage collection method of storer.

In an example,, a bit 508 is set in each accessed object in order to overcome this very poor spatial locality.When having triggered local optimization, these objects with the bit that is set up are gathered in the storer independent one group continuously on the page or leaf.In another example, which object is a hot table of bits beyond the object indicate accessed.Which kind of method no matter, which object is the special data of this ratio of specific heat indicate will be put on the heat page or leaf.

The exemplary local optimization that garbage collection is supported

Fig. 6 is the figure of exemplary generation formula garbage collection method and system.To pile 600 in logic and be divided into some generations (for example, G0, G1 and G2).The various piece that these generations are considered as piling is divided heap with 3 generations usually in logic.For example, the object of the latest generation that has free space 602 of recent dispensing is arranged, than older generation's object 604 and older generation's object 606.Usually, check these objects according to the time that begins from original allocation, wherein up-to-date object and available free space are regarded as in logic in the first generation 602.The logical view of heap is stored in the data structure at indication generation edge 608, and has the indication of free space starting position.

When free space owing to new distribution reduces, trigger garbage collection based on the pressure of storer (for example, the demand of or expection actual) to multiple memory space more.For example, when request object distributes and available memory is too little, when perhaps available memory is lower than certain preferable or desirable threshold value.

During the garbage collection that storer forces, identify efficient object, (for example collected garbage cell, no longer the object that can reach, the object that no longer is cited, or the like) and it is removed from heap, and be moved to a up-to-date generation on the memory logic along with being released, each size is from generation to generation adjusted thereupon.

For example, can carry out the several times garbage collection to the first generation 602.When storer collection of new generation is no longer produced the free storage that surpasses desirable threshold value, can carry out garbage to an older generation 604 and collect.For example, when the garbage collection of the first generation that is triggered no longer produces enough free storages, it will trigger the garbage collection of the second generation.Before the garbage collection that carries out any third generation, the garbage collection of the several times second generation may take place.In case first and second generation garbage collection do not collect enough free storages, promptly trigger the garbage collection of the third generation.Along with the time goes over, dead object is removed, and the object of living is compressed.

The method that various garbage collections are arranged, such as " mark and cleaning ", wherein dead object is placed on the free list, and " mark and compression ", and the object of living in heap is compressed to together.In case those skilled in the art read this instructions, can adapt or strengthen any in these garbage collection methods (and variant, combination and improvement) to support local optimization as herein described.

One of spinoff of these existing garbage collection technology (for example, generation garbage collection) is to have improved data locality a little.For example, only depend on the object of checkmating from heap, to remove (no matter being from generation to generation it to be removed), promptly increased the possibility that each page provides better spatial locality from which.Only for purposes of illustration, this discussion will continue to duplicate garbage collection from generation to generation.

Duplicate from generation to generation that the clean effect of garbage collection is that object is retained on the heap according to the about order that distributes, and other logical partition makes up from generation to generation according to each, along with the time can make in all sorts of ways in the past this is adjusted.Dead object is removed, and is moved up with the object of thereafter work.Therefore, the order of distribution will be kept.This order that is based on distribution provides the theory of best locality, but this theory always is not true.

For example, people's such as Chilimbi " Using Generational Garbage Collection to ImplementCache Conscious Data Placement " (improving the known data layout of high-speed cache) in October, 1998 (Chilimbi), concentrate on high-speed cache close friend's method to come data are carried out layout to improve program efficiency with garbage collection from generation to generation.For example, Chilimbi monitored object sequence order decides and should object always be fitted together with what order.The order that this concept requirement monitored object is accessed, and attempt to rearrange these objects by this order.Obtain all these information and analyze the overhead of these information usually too high in working time.

On the contrary, described optimization mechanism monitors which object is accessed between per two suboptimization or during the time interval, and when having triggered optimization, and these group of objects are incorporated on one or more heat pages or leaves on the heap.What is interesting is, do not analyse and observe data although directly attempt to collect for the utilization factor that improves high-speed cache, heat target being put into one of spinoff on same or a plurality of pages or leaves of heap with method as herein described is that high-speed cache stops being utilized more cleverly.

Therefore, duplicating from generation to generation, garbage collection is a kind of interesting environment that is adapted to optimization as herein described.It travels through each object identifying objects that all are lived, and based on its assignment order it is compressed, and can utilize this to identify heat target, and at random its der group as required is fitted together.Therefore, these optimize the restriction of not duplicated the generation garbage collection, but simply by its support.

Analysing and observe of exemplary low overhead

In an example, the jit compiling device is adapted into the operation of detection visit heap.Code through surveying monitors when accessed the object on the heap is.In an example, introduce (for example, additional) memory area in that each object is inner, to indicate whether accessed mistake of this object.In another example, a known memory area that is not used of each object inside (for example, a bit, or a plurality of bit) is labeled to identify the accessed mistake of this object.For example, object header has the free space that can be used for various reasons.A bit is set, if or use a plurality of bits, then once visit of note, be assembled on the heat page or leaf at accessed object between per two suboptimization or during certain time interval, visit bit (or access counter) is eliminated, thus they promptly can be heat target next time the collection record they whether with accessed (or since 0 access times of remembering).But it is optional that the heat target indication bit is set in head.This (a bit) bit can be placed on any position or other place in the object.Note need not to write down any sequence information herein.

For example, the heat target table of bits can be bit of each object representation.This will need more storer, but it may be preferable in many examples, because after having identified heat target, and these bits of easier removing.

In another example, create bit vector, it is the less version of heap.In this type of example, each bit in the bit vector is corresponding to the address or the zone of heap, and a bit in the vector is configured to indicate certain address of heap or zone accessed.Can be between for example per twice heat page or leaf be optimized, maybe when having carried out garbage collection for the pressure of storer, this bit vector simply resets.

When program developed in the past along with the time, heat target was also in development, and the heat page or leaf develops along with the development of heat target.Analysing and observe of low overhead as herein described allows dynamically change heat page or leaf to improve program feature.

The object page or leaf of exemplary assembling

As will be discussed later, one or more tolerance alone or in combination can be used for judging and when will trigger local optimization (for example, partition coefficient, performance metric, or the like).In case triggered the optimization of data locality, promptly identified to have and represent its object, and put it on the heat page or leaf for the work of the corresponding bit of heat target.In another example, the page or leaf of a plurality of heat targets is arranged.In one implementation, at first all heat targets are duplicated out from heap, and fill it into adhoc buffer, as usual carry out with the traditional garbage collection that allows the space purpose, set with all heat targets subsequently is placed on the newer end of heap, and object for allocation will be placed near the object alive of these heat recently.This realizes having improved greatly with very low overhead the locality of data.

The exemplary optimized of the object page or leaf of assembling

In another example, when during the GC of locality purpose, running into heat target (for example, shown in the bit of setting), then assess these heat targets, also point to which other heat target to check them.In this type of example, when certain heat target points to another heat target, so not only two heat targets all are placed on the heat page or leaf, and they are placed close to each otherly.The method that this provides a kind of low overhead has increased when the part of heat page or leaf is moved in the high-speed cache, is moved to the possibility that the object in the high-speed cache is quoted possibly one by one.This usually can improve effective performance.

Exemplary process with very poor data locality

Following table A illustrates the page or leaf density of 4 test applications of writing with C#.These numerals are moved these application programs and record storage read and write with dynamic translators and are obtained.These numerals do not comprise quoting the storehouse page or leaf.Density Metric equals the number of unique byte of reading or writing on certain page of dividing by the size of page or leaf.In this example, be disposed on 10 ⁶Inferior quoting.The service efficiency of following table A explanation data page is very low, this often means that very poor spatial locality.

Table A
Table A			Application program C#	The page or leaf that each is visited at interval	Average page or leaf density
Test application 1	600	7.7％	Application program C#	The page or leaf that each is visited at interval	Average page or leaf density
Test application 1	600	7.7％	Test application 2	588	6.5％
The test of Xaml routine analyzer	602	6.0％	Test application 2	588	6.5％
The test of Xaml routine analyzer	602	6.0％	The Sat solver	1703	28％

Example system

In an example, optimize and duplicate generation garbage collection device to improve data locality.For example, can in the memory management that comprises garbage collection, use the mode of virtual machine to use this method.In this example, about other details aspect of great majority of virtual machine, this system is unknowable.

Fig. 7 illustrates the architecture overview 700 of a possible embodiment of this system.Immediately (JIT) compiler 702 expression 704 of being configured to get an intermediate language (for example, MSIL) and with it is compiled into the machine code 706 of certain particular architecture.The jit compiling device can be modified as the detection of in the code that has compiled, inserting lightweight.The nearest accessed object of instrumentation code 708 marks.Can insert in working time and monitor code (for example, CLR (CLR) or Java Runtime Library), thereby when application program is moved, collect tolerance.Monitor that code can use monitoring data and sound out the GC that triggers the locality purpose.During the GC of locality purpose, can identify the object that is labeled as accessed recently (heat), and with their colocated to some and the heap all the other pages separate the page or leaf on.In case can be independent of the conventional GC that is triggered when detecting memory pressure, trigger the GC of locality purpose.

Exemplary page or leaf is optimized cache optimization

When being locality purpose array data, two selections are or for a page or leaf locality purpose is optimized, perhaps are optimized for high-speed cache locality purpose.

In an example, the page or leaf density increase of data page is favourable.For example, collecting a page cost of analysing and observe information of optimizing may be lower.Because page or leaf (common 4 kilobyte) is wanted big several magnitudes than cache line (64-128 byte usually), so the precise time that does not need to obtain data access is in proper order effectively assembled.Because the size of " gatherer " is bigger, so can afford than loosely assembling data.Notice, only, just similarly increased the assembling data to obtain the chance of preferable high-speed cache utilization factor (by removing cold object between two parties) by increasing page or leaf density.This has caused significant high-speed cache concerning many programs interests are effective as the free spinoff that page or leaf is optimized.

In addition, the cost lost of page or leaf mistake and TLB is more much higher than the cost of L2 cache miss usually.Therefore the potential saving of page or leaf optimization is more much bigger than the potential saving of cache optimization.Certainly, this is two aspects---there is an independent heat target that this page or leaf is called in mistakenly on the cold page or leaf, and therefore struck out most interests of optimizing.Therefore, guarantee that it may be useful that the dsc data group is had good covering.

In some cases, usually with physical memory address, rather than virtual address is come index L2 high-speed cache (all being true for example) on all X86 architectures.If therefore the clauses and subclauses of certain page are disappeared in TLB, the few of help of data possibility is arranged in the L2 high-speed cache so.

The example probe model

In order to increase the utilization factor of page density rather than high-speed cache, do not need the specified data element in twos between precise time relation.On the contrary, in certain embodiments, only be recorded between per dynamic optimization that is triggered for twice or just enough by the object of frequent access during the time interval.These accessed objects (for example, object, method, process, data structure etc.) are regarded as the object of heat immediately.During optimizing, heat target is gathered on certain page or leaf (or certain group page or leaf) of piling in certain part.

In an example, usage counter determine which to as if heat.In another embodiment, use compiler (or jit compiling device) to insert and read obstacle for some key instruction of visit heap data.In this type of example, if read that the obstacle code can comprise that single counter is not set up then the call instruction of the help routine of refresh counter.Can write obstacle by the automatic generation of compiler and support GC from generation to generation, and can revise and write obstacle to insert the condition renewal of counter.

In this type of example, system comprises the realization of counter, the realization of reading obstacle and object-detection (for example, heap being carried out the operation of read/write), thereby to reading and/or write counting.

Can realize the object reference counter by different way.For example, the object reference counter can be embedded in the object.In another example, the object reference counter can be embodied as independent table.In another example, use 1 bit counter that is embedded in the object.In such example, if certain object is accessed, a bit then is set, and to be reflected in certain interim this object accessed at least once.In another example, can take on the counter that is reflected in the accessed how many times of interim object corresponding to some bits of an object.In this type of example, visit each time all is added in the counter corresponding to accessed object.The threshold number of visit determines whether certain object is heat target.One or many visit at interval can be used for identifying heat target.Heat target also can be included in the object of interim initialization (establishment).One or more bits corresponding to object can be stored in the object itself or other place.If corresponding to the counter of object outside object, but then preferably it is positioned at the position of certain fast access because the overhead of counting should be minimized.

In another example, counter (one or more bit) is stored in the object itself.For example, each object has the object header of 4 bytes among the CLR, can be used for various purposes (for example, realizing the lock of lightweight).In certain embodiments, may be with the one or more bits in these 32 bits as counter (or as ratio of specific heat spy).

Table B is an example code of reading the example of obstacle code.For example, can use and analyse and observe code and come the accessed object of mark.In this example, rg is the register of conservation object address, and object header is-4 from the side-play amount of the beginning of object.OBJECT_ACCESSED_BIT is the bitmask that is used for being provided with the object header individual bit.

Table B
Table B	test dword ptr[rg-4],OBJECT_ACCESSED_BIT jnz Bit_set lock or dword ptr[rg-4],OBJECT_ACCESSED_BIT;atomic update Bit_set: ret

In this type of example, use interlocked operation that this bit is set, because object header may be revised (for example, when this object of locking) simultaneously by other thread.Interlocked operation may cost very high (for example, 20-30 clock period) on the x86 architecture.In addition, its cache line of may during read operation, making dirty, this may damage the measurability of the application program on the multiprocessor.Therefore, in another example, available condition is read obstacle and is replaced and unconditionally read obstacle, has increased the size of reading the obstacle code even condition is read obstacle.In another example, the size of the code that increases in order to reduce, it is inline will not to read obstacle.On the contrary, will read the obstacle code and be embodied as help routine (for example, each register is).

In another example, a kind of optimized Algorithm reduces reads the quantity of obstacle, and has improved the amount of performance and/or minimizing additional code.In an example, the used obstacle of reading is different with conventional visit obstacle, because all do not insert calling it in each accessing points.For example, can eliminate (CSE) to common subexpression and optimize oblatio reading calling of obstacle code.Therefore in another example,, will not analyse and observe to call and be inserted in the exception handling code because unusual incidence is very low.Similarly, another example is ignored non-inline constructed fuction.

In addition, consider when reset counter (or ratio of specific heat spy) is desirable.In an example, when counter is embedded in the object, can't scan all objects of living and promptly remove the counter bit at low cost.In this type of example, when object was traveled through, counter was eliminated during garbage collection (GC).In another example, during the GC of locality purpose, run into heat target at every turn and promptly remove counter.

In an example that duplicates garbage collection from generation to generation, during the GC of locality purpose, remove counter and take place more frequently for garbage collection and lower-cost low object in from generation to generation has good effect.Because less the higher generation is collected, quote that bit may be pass by in time and no longer effective.Therefore, replace in the example, preferably provide a kind of and remove counter and need not to travel through the method for reachability graph or overall stack at one.For example, use a kind of card table (corresponding to object), it allows to remove counter, and need not to rely on the complete traversal to reachability graph or overall stack.In another example, corresponding to the hot table of bits of page or leaf and/or heap or the time of the special field help of ratio of specific heat minimizing removing counter/bit.

Example sampled

At an example, above-mentioned detection model has very low overhead, and is enough to quicken overall performance.But in some situations, dynamically heap reorganizes and may not improve the performance (for example, if data set is small enough to adapt to available storer) of improving application program.For this type of application program, through the cost possibility of instrumentation code (for example, reading obstacle) too high (application program is degenerated reach 40%).

In order further to reduce the overhead of surveying, an example is only analysed and observe code off and on.For example, if certain method is read obstacle and surveyed with analysing and observe, can generate the triplicate of this method so with detection.(that is, monitor, sampling, or the like) uses the version through surveying during analysing and observe.During routine operation, use the method that does not comprise detection.The prolog of each method be extended to this method through surveying or control is checked and shifted to unplumbed version.In certain embodiments, the back does not make an amendment.Unexpectedly, this simplification may not can reduce the effectiveness (for example, on following benchmark---except some has the synthetic method of the thermal cycle of long-time running) of the method.As further optimization, two copies can be placed in two independent code heaps.

There are some factors to sample for changing with control.For example, use frequency, in case sampling beginning then use how long through instrumented version through the instrumented version code.By adjusting this two parameters, can analyse and observe overhead and obtain the useful information of analysing and observe with reasonably low.

In an example, normally move the code of conventional version, and periodically move code in short time and only through instrumented version.For example, each ten thousand milliseconds of operation is through 10 milliseconds of instrumentation code.This will produce about the accessed information of which object during periodic samples.This information is used for the assembled heat page or leaf.

Exemplary heap reorganizes

CLR GC realizes a variant of mark-compression garbage collection device from generation to generation, and the small object heap is divided into three generations.In an example, the heap of locality purpose can be reorganized the generation that is limited to greater than 0.A reason of during 0 collection from generation to generation, not piling reorganization be those from generation to generation in 0 the object great majority just distribute recently.Because they just are assigned with recently, they in high-speed cache or in the working set, therefore unlikely improve from locality and obtain a lot of interests.In one embodiment, during GC, system banner all a) in the past locality gather up be marked as heat target and b) belong to certain object from generation to generation that is equal to or less than the generation that is being collected at present.In this example, only these are to liking the candidate of local optimization.After all these candidate targets were identified, local optimization can determine will be how to its layout, and where heat target should be put back to go the GC heap.

In an example, finish the layout of heat target with two duplicate stage.At first, decomposition order according to hierarchy, heat target (is for example copied to the adhoc buffer from heap, if certain heat target comprises the pointer that points to other heat target, then these heat targets are gathered together), thereby can obtain the interests of some high-speed cache locality with page or leaf locality interests, and not have other overhead.Initial position is marked as the free time and is reclaimed by gatherer.The second, put back to the newer end of heap with the set of the heat target that can will rearrange.In another example, avoided dual and duplicated (for example, by keeping a specified portions of heap).In another example, layout mechanism will be not admixed together from the object of different generations.

The set side of heat target there are some potential benefits at the newer end of heap.For example, the there has enough spaces to hold the set of heat target probably.In addition, preferably object is not done additional lifting, because collect the older generation than collecting new cost height from generation to generation.At last, some object of living is longly often die after being repeated to use, and degradation can quicken recovery that these objects are taken up space.Depend on embodiment, may be bad with many object degradations.But, still may optionally demote to heat target (they comprise the sub-fraction of heap).In addition, importantly do not create too much intersection pointer from generation to generation.

The exemplary optimized trigger policy

Another consideration is to judge the optimization that when triggers the locality purpose.In addition, it is inoperative when judgement triggers the locality purpose, thereby no longer vainly continue when its clean performance benefit reduces.

Some kinds of possibility or the combinations that are used to judge the optimization that when triggers the locality purpose are arranged.In an example, the monitoring hardware performance counter is to judge the miss rate of DTLB and L2 high-speed cache.When miss rate raises, trigger the optimization of locality purpose.

In another example, the GC collection rate of supervisory memory pressurized.In such example, whenever second, third ... carry out the heat target local optimization during collection that the N external memory triggers.In some example, it is favourable that the GC that each memory pressure is triggered carries out local optimization.

In another example, the monitored object partition coefficient.When the distribution of new object significantly reduces, promptly suppose this application program in repeated use, rather than distribute new object.In case partition coefficient reduces and becomes relatively stable, promptly triggers the optimization of locality purpose.

In another example, check that the reference count in the object is favourable.The reference count that object is very high indicates same object to be visited repeatedly, and optimization may turn out to be favourable.But, if reference count is very low, optimize so unlikely turn out to be favourable because processor is not asked same object repeatedly.

In another example, monitor partition coefficient and performance counter simultaneously.When partition coefficient increases and be very high, be not optimized.In case but partition coefficient descends and is tending towards relatively stable, the data structure of this program is set up probably, and more possible be that object can be repeated visit, created the chance that benefits from local optimization.In addition, in case partition coefficient is lower and relatively stable, provide good optimization to trigger probably as the high miss rate of indicated DTLB of performance counter or L2 high-speed cache.Because new the distribution seldom, RAM and high-speed cache are called in and accessed to the indication block of losing of DTLB and L2 high-speed cache apace, and the set heat target reduces losing of whole DTLB and L2 possibly.

In another example, can be in the optimization of regular or variable time interval trigger data locality purpose.In another example,, then can increase the time interval before next time optimizing if previous optimization shows that the time interval that very low performance improves between (for example, same DTLB miss rate is arranged after the optimization, or the like) and per two suboptimization is too small.If last suboptimization is excessive above certain threshold value and the time interval to the improvement of DTLB miss rate, then can triggers ahead of time next time and optimize.

The combination of these explorations also is possible.A shortcoming of performance counter method is usually that they are not virtualized is process (that is, they carry out overall situation counting), therefore numeral other application program distortion that may be moved in the system.But they have the advantage of no extra cost really on some chip.They are with the parallel counting of processing process.In certain embodiments, partition coefficient may be to trigger local optimization to measure quite reliably.

In an example, monitor that it is favourable carrying out the benefit of optimizing.For example, performance counter is used to measure the miss rate of data TLB and L2 high-speed cache after the suboptimization.If according to some relative baseline, the miss rate of data TLB or high-speed cache does not improve, stop or reducing the GC of triggering locality purpose or show that until the result benefit confirms that cost is reasonable certain period, and may be desirable.

Therefore, the reorganization that directly or indirectly triggers heap all is considered program behavior with the accessed object of trooping anything.When the program behavior that is monitored is made indication, pile promptly to be reorganized (for example, trooping).One or more heaps that are subjected to supervisory programme behavior indicator can trigger heat target alive reorganize.

The GC of exemplary raising leads

Many quantity that concentrate on the expensive GC of the minimizing collection of collecting about the document that triggers GC for memory pressure because of GC.On the contrary, the indication of described method and system when more GC is collected in when improved by local optimization, even when the quantity growth that GC collects, still can improve overall performance.In an example (Xaml routine analyzer test described as follows (XamlParserTest)), GC collects to have increased and reaches 50%, and overall performance still is improved.

The exemplary performance counter

Some processor provides the performance counter of various version.In an example, performance counter provides a kind of accessed method of which object (for example, IA64 processor) of judging.In this example, may not need instrumentation code.The performance counter indication is reset in the cycle or from counter, and which object (for example, the address in the heap) is accessed.This will be favourable, because the operation that does not need to survey the visit heap identifies heat target.For example, can provide TLB to lose the address through testing thus.

In addition, rumor is about to introduce the object instruction when accessed (for example, read/write) all notes down in the storer no matter.These addresses or object will be recorded to certain inner loop impact damper and available.

These accessed objects provide the required information of heat page or leaf of creating.

Exemplary trooping

Be assigned with to as if " live ", but only recently accessed object alives still " hot ".When carrying out garbage collection, heat target was unknown or was not considered in the past.During garbage collection, the object of living is compressed and dead object (for example, removing the object that distributes or discharge) is collected so that new free space to be provided.Use described technology, even without requiring to carry out garbage collection, the heat target of living is also trooped on heap.What is interesting is that the same area that not all accessed object need be placed in heap just can make this technology that sufficient interests are provided.For example, if use the sampling period to monitor accessed object and the visit bit is set, so only accessed object will be placed on during certain heat on the heap troops in the sampling period.Therefore, one or more the trooping on the heap is not to comprise that all nearest accessed objects just can fall within the scope of this technology.In addition, even be continuous (with respect to the cycle of periodic samples, or other) when object accesses monitors, not all accessed object all needs to be placed to same trooping could obtain significant interests from described technology.For example, troop, allow heat target will provide significant interests so in these two (or even several positions) rather than all pages or leaves of spreading all over heap if any position in heap is created two.Therefore, the heat target of having conceived the work that will pile is trooped to one or more hot significantly relatively positions of high concentration of object of living.Certainly, if heat target is placed in contiguous single the trooping, this provides even better result probably.For example, adapting to trooping in the page or leaf of heap may be considered desirable, but certainly not essential.In another example, if the working set of heat target may be favourable even as big as occupying a plurality of pages or leaves on the heap if so those a plurality of pages or leaves are closely placed mutually, but optional.In another example, if high-speed cache is even as big as receiving a heat target of trooping, this is helpful.In such example, if the working set of heat target occupies a plurality of trooping, call in the high-speed cache for one in so no matter when these being trooped, just exist this other heat target in trooping to be about to accessed probability.In addition, the exchange of trooping of high probability being gone into or exchange out high-speed cache needs seldom exchange number of times probably, and no matter each independent heat is trooped and where is included in go the heap.All these ideas have been conceived the accessed object recently of trooping.At last, near the object that is assigned with recently, troop extra benefit is provided.For example, in generation garbage collection heap, it is helpful trooping near the latest generation of heap, because the generally also older object Yao Gengre of the object that is assigned with recently.

Exemplary reference

In an example, observed experimental result.The GC heap that certain prototype has triggered heat target reorganizes, and reorganization provides favourable result in performance.Used prototype is to send out the workstation version of the invalid commercial CLR realization of GC on the Windows XP operating system altogether.Experiment is to carry out on several have the machine of configuration of different memory, cache memory sizes and CPU speed.The GC of locality purpose performance on the machine with less L2 high-speed cache and storer improves at most, and this is not unexpected.

Created 4 micro benchmarks, and obtain two with C# written application program for analyze using.Writing the performance that these 4 micro benchmarks (that is, tree, array, S tabulation and hash table) are used for testing the GC of locality purpose improves.Micro benchmark is from creating separately data structures with the staggered a large amount of data of creating at random of some junk datas.After a GC who trains circulation and once force, each benchmark is repeatedly searched for one group of data.A test application that is called " test of Xaml routine analyzer " is read the performance with the different assemblies of measuring and calculating routine analyzer three times from certain XAML file.XAML (XAML) is based on XML's.Used input file comprises single but 11000 grades of nodes that the degree of depth is nested is arranged.Another application program that is called " SAT solver " is to realize converting to the SAT routine analyzer of C# from its C++.This input file has been described a problem-instance with 246403250 variable C NF (conjunctive normal form).

Exemplary performance result

The execution time of each benchmark is shown in the table C.For whole 4 micro benchmarks of being created, optimization obtains effect as expected.But for two benchmark that obtained, the performance of locality purpose GC has been improved the test of Xaml routine analyzer fully, but has only improved the SAT solver a little.Performance benefit has reduced in the intensive application program of pointer.In such example, even also be not optimized (that is, monitoring and reorganize heat target during GC), the overhead of garbage collection itself is too high, accounts for about sixth to three of execution time/one.

Table C
Table C					Originally (second)	Optimize (second)	Progressive
The test of Xaml routine analyzer	117.98	66.25	43.8％		Originally (second)	Optimize (second)	Progressive
The test of Xaml routine analyzer	117.98	66.25	43.8％	The SAT solver	138.00	132.50	4.0％
Tree	4.20	3.16	24.8％	The SAT solver	138.00	132.50	4.0％
Tree	4.20	3.16	24.8％	Array	17.58	7.53	57.2％
The S tabulation	11.93	8.80	26.3％	Array	17.58	7.53	57.2％
The S tabulation	11.93	8.80	26.3％	Hash table	6.67	3.28	50.8％

Exemplaryly analyse and observe overhead

The overhead possibility basis of surveying takes the Chang Kaifang formula or sample mode changes.Table D relatively analyses and observe overhead between Chang Kai and sample mode.Sampling makes that the overhead of analysing and observe is acceptable for optimizing.

Table D
Table D							Originally (second)	Open in usually analyse and observe		Sampling type
Time (second)	Reduce	Time (second)	Reduce					Open in usually analyse and observe		Sampling type
Time (second)	Reduce	Time (second)	Reduce	The test of Xaml routine analyzer	117.98			123.10	4.3％	119.66	1.4％
The SAT solver	138.00	204.90	48.5％	The test of Xaml routine analyzer	117.98	138.90	0.7％	123.10	4.3％	119.66	1.4％
The SAT solver	138.00	204.90	48.5％	Maximum	97.70	138.90	0.7％	115.80	18.5％	98.30	0.6％
GenIBC	6.50	7.70	18.5％	Maximum	97.70	6.90	6.2％	115.80	18.5％	98.30	0.6％
GenIBC	6.50	7.70	18.5％	Tree	4.20	6.90	6.2％	5.30	26.2％	4.40	4.8％
Array	17.58	20.03	13.9％	Tree	4.20	17.98	2.3％	5.30	26.2％	4.40	4.8％
Array	17.58	20.03	13.9％	Tabulation	11.93	17.98	2.3％	14.90	24.9％	12.53	5.0％
Hash table	6.67	7.26	8.8％	Tabulation	11.93	6.81	2.1％	14.90	24.9％	12.53	5.0％

Exemplary page or leaf density improves

Show the E indication and optimize the par and the average page or leaf density of accessed page or leaf of (that is, triggering GC reorganizes heat target on one or more pages or leaves) each time interval of front and back.The measuring and calculating process is the track quoted of trapping memory, execution is divided into 1000000 quotes at interval (for the hash table micro benchmark, each comprises 10000000 at interval and quotes), and calculate the number percent that each at interval accessed number of pages and last 1000 each pages of interval are gone up actual accessed data.Generally speaking, by access module compound object according to program, " useful " data on optimization can reduce working set and can increase every page.For the SAT solver, each accessed at interval number of pages increases with optimizing, and relates to much more GC because optimize, and each GC scans the local or whole of heap, and the visit of the garbage collection device not being carried out is got rid of outside calculating.

Table E
Table E					Application program	The page or leaf (originally) that each is accessed at interval	Average page or leaf density (originally)	The page or leaf (optimization) that each is accessed at interval	Average page or leaf density (optimization)
The test of Xaml routine analyzer	602	6.0％	258	17.8％	Application program		Average page or leaf density (originally)		Average page or leaf density (optimization)
The test of Xaml routine analyzer	602	6.0％	258	17.8％	The SAT solver	1703	28.0％	1736	32.9％
Tree	2566	13.2％	1922	26.8％	The SAT solver	1703	28.0％	1736	32.9％
Tree	2566	13.2％	1922	26.8％	Array	1996	16.2％	808	21.0％
The S tabulation	672	17.8％	240	53.6％	Array	1996	16.2％	808	21.0％
The S tabulation	672	17.8％	240	53.6％	Hash table	336	5.9％	253	18.3％

Exemplary locality is improved

In order to verify the improvement of optimization on working set and page or leaf density, also collected and lost quantity data, as shown in table F about DTLB.Result as working set and the improvement of page or leaf density optimizes and has also reduced the quantity that DTLB loses.The L2 cache miss quantity of benchmark also is shown among the table F.Do not concentrate on the high-speed cache locality although optimize, it has good effect in improving the high-speed cache locality.

Table F
Table F								DTLB			The L2 high-speed cache
Originally	Optimize	Improve	Originally	Optimize	Improve			DTLB			The L2 high-speed cache
Originally	Optimize	Improve	Originally	Optimize	Improve	The test of Xaml routine analyzer		262178	71125	72.9％	1269248	30787	97.6％
The SAT solver	1594246	1452614	8.9％	1189753	1049775	The test of Xaml routine analyzer	11.8％	262178	71125	72.9％	1269248	30787	97.6％
The SAT solver	1594246	1452614	8.9％	1189753	1049775	Tree	11.8％	112636	58435	48.1％	40696	39493	3.0％
Array	1172048	521741	55.5％	173048	9268	Tree	94.6％	112636	58435	48.1％	40696	39493	3.0％
Array	1172048	521741	55.5％	173048	9268	The S tabulation	94.6％	999362	173410	82.6％	265106	98713	62.8％
Hash table	72176	48266	33.1％	36570	23714	The S tabulation	35.2％	999362	173410	82.6％	265106	98713	62.8％

The example calculation environment

Fig. 8 and following discussion aim to provide concise and to the point, the description of summarizing of the suitable computing environment of realization.Although describe the present invention in the general context of the computer executable instructions of the computer program that will on the computing machine and/or the network equipment, move, those skilled in the art will recognize that also and can realize the present invention in conjunction with other program module.Generally speaking, program module comprises the routine carrying out particular task or implement particular abstract, program, assembly, data structure or the like.In addition, it will be understood by those skilled in the art that and can implement the present invention with other computer system configurations, comprise microprocessor system, electronic equipment, microcomputer, mainframe computer, network equipment, wireless device based on microprocessor, or the like.Can be in the networking computing environment, or on stand-alone computer, implement various expansions.

With reference to figure 8, the example system that is used to realize comprises conventional computing machine 820 (such as personal computer, laptop computer, server, mainframe computer and other all computing machine), and this computing machine comprises processing unit 821, system storage 822 and will comprise that the various system components of system storage are coupled to the system bus 823 of processing unit 821.Processing unit can be any in the various commercial available processors, comprises Intel x86, Pentium and from the compatible microprocessors of Intel and other company, these compatible microprocessors comprise Cyrix, AMD and Nexgen; Alpha from Digital; MIPS from MIPS science and technology, NEC, IDT, Siemens and other company; SPARC from Sun and other company; With PowerPC from IBM and Motorola.Dual micro processor and other multiprocessor architecture also can be used as processing unit 821.

System bus can be any in the some kinds of bus structure, comprise memory bus or Memory Controller, peripheral bus and use any local bus in the various available bus architectures, only give some instances, these architectures comprise PCI, VESA, AGP, microchannel, ISA and EISA.System storage comprises ROM (read-only memory) (ROM) 824 and random-access memory (ram) 825.Comprise such as when starting, helping basic input/output (BIOS) to be stored among the nonvolatile memory ROM 824 in that computing machine 820 inner each interelement transmit the basic routine of information.

Computing machine 820 also comprises hard disk drive 827, for example reads and writes the disc driver 828 of moveable magnetic disc 829 and for example read CD-ROM dish 831 or read and write the CD drive 830 of other light medium.Hard disk drive 827, disc driver 828 and CD drive 830 are connected to system bus 823 by hard disk drive interface 832, disk drive interface 833 and CD drive interface 834 respectively.Each driver and the computer-readable medium that is associated thereof provide non-volatile storage of data, data structure, computer executable instructions etc. for computing machine 820.Although above description to computer-readable medium refers to hard disk, moveable magnetic disc and CD, it will be appreciated by those skilled in the art that the computer-readable medium such as other types such as tape cassete, flash card, digital video disc, Bei Nuli magnetic tape cassette also can use in the exemplary operation environment.

Some program modules can be stored among each driver and the RAM 825, except the realization 856 of described supervision and optimization, also comprise operating system 835, one or more application program 836, other program module 837 and routine data 838.

The user can will order with information by keyboard 840 with such as positioning equipments such as mouses 842 and be input in the computing machine 820.These and other input equipment normal open overcoupling is connected to processing unit 821 to the serial port interface 846 of system bus, but also can connect by other interface, such as parallel port, game port or USB (universal serial bus) (USB).The display device of monitor 847 or other type is also via being connected to system bus 823 such as interfaces such as video adapters 848.Except monitor, computing machine generally includes other peripheral output device (not shown), such as loudspeaker and printer.

The logic that computing machine 820 uses such as one or more remote computers such as remote computers 849 connects, and operates in networked environment.Remote computer 849 can be server, router, peer device or other common network node, and generally includes above with respect to computing machine 820 described many or whole elements, although memory storage devices 850 only is shown among the figure.Illustrated logic connects and comprises Local Area Network 851 and wide area network 852.This type of networked environment is common in computer network, Intranet and the Internet of office, enterprise-wide.

When using in the lan network environment, computing machine 820 is connected to local network 851 by network interface or adapter 853.When using in the WAN network environment, computing machine 820 generally includes modulator-demodular unit 854 or other means, is used for by setting up communication (for example, via LAN 851 and gateway or acting server 855) such as wide area networks such as the Internet 852.Can be that internal or external modulator-demodular unit 854 is connected to system bus 823 via serial port interface 846.In networked environment, can be stored in the remote memory storage devices with respect to computing machine 820 described program modules or its part.It is exemplary that network shown in being appreciated that connects, and can use other to set up the means of communication link in twos at computing equipment, no matter is wireless or wired.

Alternative

Abovely describe and illustrated the principle of our invention, can recognize with reference to the example shown in each, can arrange and details on these examples of modification, and can not depart from this type of principle.In addition, as conspicuous for the common computer scientist, part example or all examples can combine with the other parts of other example complete or partly.Should be appreciated that program described herein, process or method do not relate to or be limited to the computer equipment of any particular type, unless otherwise noted.According to religious doctrine described herein, various types of universal or special computer equipments can cooperate various manipulating, or can be used for carrying out various operations.Element with the illustrated embodiment of form of software performance can realize that vice versa by hardware.Technology from an example can merge in any other example.

Consider the many possible embodiment of the principle of the invention that can quote us, should recognize that each details is illustrative, and should not be considered the scope of invention that limits us.On the contrary, our requirement with all this type of embodiment of scope and spirit that may fall into appended claims and equivalence techniques scheme thereof as our invention.

Claims

1. a computerized method is characterized in that, comprising:

Monitor that an executive routine is to determine accessed object recently;

Handle at least one bit to indicate object visit;

Monitor a program behavior indicator;

Call optimization based on the program behavior indicator that is monitored; And

Carry out described optimization, comprising:

With the cluster of objects of being visited in storer, and

At least one bit that resets and handled for each object of visiting.

2. the method for claim 1 is characterized in that, described at least one bit is a bit counter more than, and described manipulation increases progressively described many bit counter.

3. the method for claim 1 is characterized in that, the accessed object of being trooped is positioned at the newer end with the heap of generation garbage collection.

4. the method for claim 1 is characterized in that, it is undertaken by virtual machine.

5. the method for claim 1 is characterized in that, described at least one bit is positioned at the head of institute's access object.

6. the method for claim 1 is characterized in that, described at least one bit is positioned at beyond institute's access object.

7. the method for claim 1, it is characterized in that, institute's access object is gathered with following order: if first accessed object comprises the pointer that points to second accessed object, in one troops described first and second group of objects are lumped together so.

8. the method for claim 1 is characterized in that, the accessed object of being trooped comprises one or more pages or leaves of heap memory.

9. the method for claim 1 is characterized in that, the accessed object of being trooped comprises a plurality of continuous pages or leaves of heap memory.

10. the method for claim 1 is characterized in that, the object of being trooped comprises that a plurality of of accessed object of the not adjacent domain that is arranged in described heap troop separately.

11. the method for claim 1 is characterized in that, the program behavior indicator is a performance counter.

12. the method for claim 1 is characterized in that, the program behavior indicator is a N external memory pressure garbage collection.

13. the method for claim 1 is characterized in that, the program behavior indicator is a partition coefficient.

14. the method for claim 1 is characterized in that, the program behavior indicator is the object reference counter.

15. the method for claim 1 is characterized in that, the program behavior indicator is the compiling of a plurality of behavior indicators.

16. the method for claim 1 is characterized in that, described method monitors two or more program behavior indicators.

17. a computer system is characterized in that, comprises:

Storer and execution are monitored the CPU (central processing unit) of program; And

Be used to monitor and optimize the described optimal module that is monitored program, and comprise,

Be used to survey the detecting module of described program with record object accesses the program term of execution,

Be used for the supervisory programme behavior and in response to being monitored the optimal module that program behavior calls optimization, described optimization is included in the accessed object of record recently of trooping in the storer.

18. computer system as claimed in claim 17 is characterized in that, described to be subjected to the supervisory programme behavior be DTLB cache miss rate.

19. computer system as claimed in claim 17 is characterized in that, described to be subjected to the supervisory programme behavior be L2 cache miss rate.

20. computer system as claimed in claim 17 is characterized in that, described to be subjected to the supervisory programme behavior be partition coefficient.

21. computer system as claimed in claim 17 is characterized in that, described optimal module is to duplicate the part of garbage collection module from generation to generation.

22. computer system as claimed in claim 17 is characterized in that, described detecting module comprises the jit compiling device.

23. computer system as claimed in claim 17 is characterized in that, the object accesses that is write down comprises the bit of setting corresponding to an accessed object.

24. computer system as claimed in claim 17 is characterized in that, described optimal module also comprises the object on the traversal heap and identifies which object via the bit that is provided with in the described object accessed recently.

25. computer system as claimed in claim 17 is characterized in that, a nearest accessed object that points to another nearest accessed object is placed in the place of close this object.

26. computer system as claimed in claim 17, it is characterized in that, described optimization comprises that also the accessed object of the nearest record that judgement is trooped occupies than the single heap page more storer of being held more, thereby superfluous nearest accessed object is flow in the trooping of second heap page or leaf.

27. computer system as claimed in claim 26 is characterized in that, the program through surveying writes down object accesses in a bit vector corresponding to heap address.

28. a computer-readable medium that has computer-readable instruction on it is characterized in that described computer instruction comprises:

Be used for of the instruction of detection application program with the record object accesses;

Be used to monitor the instruction of the behavior of described application program;

Be used for calling the instruction of heap optimization based on the behavior that is monitored of described application program; And

Described heap optimization comprises and being used on described heap the instruction of recently accessed object tools to position close to each other.

29. computer-readable medium as claimed in claim 28 is characterized in that, also comprises being used for detected object with to its instruction that accessed how many number of times are counted between heap optimization.

30. computer-readable medium as claimed in claim 28 is characterized in that, also comprises being used for detected object so that the instruction of a bit in its head to be set when the data field of described object is accessed.

31. a system that is used to application program to improve data locality is characterized in that, described system comprises:

Instant compiler, it is configured to get a kind of intermediate language of described application program and represents, and it is compiled into the machine code that is used for particular architecture, wherein, described instant compiler is configured to generate the code through surveying, wherein, described code through surveying is configured to mark accessed object recently;

Monitor code, it is configured to collect tolerance when described application program is moved, wherein, described supervision code is configured to monitor the object of institute's mark, and the garbage collection of triggering locality purpose, wherein, the garbage collection of described locality purpose comprises:

To be marked as on the page or leaf that recently accessed object placement is separated to the remainder with described heap.

32. system as claimed in claim 31 is characterized in that, the garbage collection of described locality purpose is to be independent of for reclaiming the conventional garbage collection that triggers in the space to trigger.

33. system as claimed in claim 31 is characterized in that, the object reference counter that described code through surveying is configured to be embedded in the described object by renewal comes the described object of mark.

34. system as claimed in claim 31 is characterized in that, described code through surveying is configured to come the described object of mark by the object reference counter that updates stored in the table that is separated with described object.

35. system as claimed in claim 32 is characterized in that, described instant compiler generates to analyse and observe in the method for the key instruction with visit heap data reads obstacle.

36. system as claimed in claim 31 is characterized in that, described instant compiler generates the method for two versions, and wherein, second version of first version of described method and described method is placed in the code heap that separates.