The following garbage collection scheme is loosely based on a generational mark-sweep garbage collector with compaction. The motivation for compacting garbage collection is that to compact an individual generation, one does not need to reserve a semi-space of equal size to the entire heap to perform a full collection, in the event that a full collection occurs, and so greatly reduces memory usage without much if any cost to performance. The fact that it is optionally non-relocating (can sweep without compaction) also allows it to be made partially incremental unless an actual compaction needs to occur. The mark-sweep heap is divided into generations. The size of these generations provides an upper bound on the number of forwarding blocks needed to compact a generation. A card-marking system is used to keep track of any intergenerational roots in the system. Whenever a store to an object that might possibly be in a non-young generation is done, the card corresponding to the object is marked. During the marking phase, the GC scans the mark vector looking for marked cards. Any objects in the marked cards are scanned for reference to young objects. If any are found, the marking phase pushes those objects onto the mark stack for tracing. Similarly, during the compaction phase, the card marks may be used to later relocate any pointers young objects. The marks are stored in a separate linear structure for the entire heap to make marking the cards cheap. *** Options for card-marking include: 1) coarse, where each card represents a fixed region (of anywhere from 128-512 bytes) of heap space In this scheme the marks are stored in a bit/byte vector. However, care is necessary to deal with any objects storing raw data within the cards, as these will potentially masquerade as pointers or even object headers, making the headers of all objects within the card difficult to find. So, either 1-2 bytes of data would also need to be stored for every card stating the offset of the first valid tagged pointer within the card, or otherwise all objects holding raw data (byte arrays, float arrays, maps, etc.) could simply be stored in separate generations that don't need to be marked or scanned for intergenerational references, as they can't hold any pointers to other objects anyway. Possibilities are to use a bit vector where native bit diddling instructions are provided (such as x86). Otherwise, byte vectors on cheaper to access on most other architectures, but consume slightly more space. 2) fine-grained, using 1 bit for every word in the heap On writing to a slot, the bit corresponding to the exact word that was modified is marked. Uses more space, but it avoids the need to scan entire objects, and also avoids the need to treat raw objects specially since the exact pointers that need to be examined are marked. This also allows for marking objects independently of the object header in one single data structure, rather than necessitating a second one. In this scheme, 1 card of approximately 32 words (128 bytes) would necessarily correspond to 1 word of memory, or 3% of total heap space, which may be worth the possible benefits. However, one drawback is decreased throughput in scanning for dirty cards because of a less compact representation. One way to deal with this may be to align all groups of data slots in objects to 2 or 4 word boundries, thus increasing throughput by 2-4x, but this may not be at all necessary (throughput worries may be a red herring), or could end up being as complex as using the smarter coarse-grained card scheme. 3) remembered sets, as in Squeak's GC Just mark in the object's header whether it is root. Modifying an object checks if the root bit is set in the object. If it is not, and if the object is old, the object is added to the remembered set and its root bit is set. Cheap trick: all young objects can just have their root bit set, but not actually be added to the root set, so that this can be inlined in very few instructions into the site of the store. One drawback is that large remembered sets that overflow the set size will trigger full GCs. Possibly more precise than cards, but less precise than word-marking. *** Options for the tracing phase include: 1) put the mark bit in the object header Aggravates virtual memory, because marking dirties all pages in an entire generation, which is not a big deal if compaction is always done after every mark-sweep. 2) put the mark bits in a separate bit vector, storing 1 bit for every word in the heap Doesn't have the same problems with virtual memory dirtying since all the mark bits are stored in one place. Allows for the fine-grained word-marking scheme to be done with the same exact bit vector used for tracing the heap. Really only beneficial if used in conjunction with fine-grained word marking. 3) Use a marking stack. This is orthogonal to the above schemes. If the stack is overflowed, just note that it has done so and continue marking, just dropping any objects that won't fit on the marking stack. When the stack is emptied again, sweep the entire generation looking for any objects that have been marked but that have unmarked children - push those onto the stack and let the GC rip again! This is WAAAY simpler than pointer reversal, and WAAAY faster. Only requires a fixed size marking stack which can be used for forwarding tables on subsequent phases as well. *** Options for the compaction phase include: 1) Do as the Squeakers do and use any unused space for the forwarding table during compaction. Drawbacks to this mean very poor compaction efficiency when the heap is full (very little room for forwarding entries), unless a configurable section of the heap is reserved for forwarding purposes at all times, which adds tuning complexity, but on the other hand allows the amount of memory wasted to be tuned. Another drawbacks is it requires possible assumptions about where free regions of memory are in the generation, requiring more frequent compaction. This more or less is the same as the next option, unless generations are ditched. 2) Compact one generation at a time, and reserve extra space separate from the generation, which need only be as big as the largest generation for best performance, also allowing for less if compaction is to be done in smaller units than the size of a generation. This requires no tuning if generations are all fixed size, as an empty generation can be reserved at all times for relocation purposes, simply using the generation that is being compacted as the actual forwarding table - like a copy collector, but without the drawbacks in terms of reserving a semi-space large enough to copy the whole heap due to the added marking phase. 3) Use a break table. This algorithm operates completely in-place and requires no extra space. Only draw-back is relocation can be very slow if there are a lot of disparate unused areas of the heap. Using generations on top of this could seriously cover up noticeable speed problems, but would probably be no better than the second option in the end. This is most useful in generations where the death rate is either really high - increasing the chances of large swaths of unused space that are not fragmented being available - or is really low - increasing the chance that there are very few distinct areas of unused space that needs to be relocated around. This probably is not useful at all except in really, really memory challenged embedded environments where every drop of space counts. 4) Squashing. This scheme is really orthogonal and works with both 2 and 3. The idea is simply that, when compacting, all references from higher addressed objects to lower addressed objects are relocated at the exact same time. Any references from lower addressed objects to higher addressed objects are marked in the card table, pretending as if they were roots. This way, the second compaction pass only needs to visit part of the young generation to relocate it, rather than all of it, and this can be folded into the relocation of inter-generational references. This can, in the best case, compact a generation in a single sweeping pass. See Clarke's paper for reference. *** Suggested approaches. 1) Definitely try out squashing with any of the generational schemes. Seems like a big win for generational collectors. 2) 1 byte mark per 128 byte card, 1 byte valid tagged pointer offset, mark bit stored in object, reserved space for forwarding tables Contain a table of marks, 1 byte for each card, as above. Then, use another table with 1 byte each per card, that marks the offset to the first valid tagged pointer in the card. This second table would mostly be zeroed and never touched, except when allocating an object containing raw data over a card boundry. This is probably the simplest and most general way to handle stuff on diverse hardware. 3) word-marking scheme, mark bit stored in word-marking vector, reserved space for forwarding tables Contain a bit vector of marks, 1 bit for each word. Mark objects in the word-marking vector. This could work exceptionally well for hardware with bit diddling instructions like x86. Extremely precise, but trades 1/4 the throughput in scanning the dirty bits at the expense of only having to check 1/32th of a card (1 word) for an intergenerational reference. With large distributions of intergenerational references, this is will require much less work scanning individual cards for references. Also requires much less dirtying of virtual memory pages (which is very good in the hypothetical SlateOS sense). The write barrier is just as fast on x86 as the first scheme, but possibly up to 2x as expensive on hardware without bit diddling instructions. On the right hardware (x86), this scheme probably could be awesome for its simplicity and performance, but on other platforms the write barrier speed may become prohibitive. However, the speed gained in card scanning may offset the write barrier slowdown for these platforms. 4) remembered sets, mark bit in objects, reserved space for forwarding tables, i.e. leave in the Squeak GC Remembered sets are fastest for scanning dirty objects, but may require tuning, or otherwise it devolves into full GC once the heap gets large, and never recovers unless the whole heap is rebuilt or a shitload of memory is freed. This is not going to be good for longer lived systems. Probably will exhibit worse performance on large or old heaps because of the need of many full GCs. 5) Have a copying nursery! Rather orthogonal. This seems pretty obvious, regardless of what GC scheme is used. Since the nurseries are fixed size, there is negligible memory waste. This allows young object allocation to still be fast and cheap, i.e. a pointer bump. This only needs to be on the first generation and has benefits for concurrency, since allocation need not be as tightly synchronized. Another way to implement this is just to allocate a fixed size chunk of memory from a mark sweep generation, or just allocate an entire empty mark-sweep generation and use this for allocation purposes. When this fills up, simply dump it back into the mark-sweep heap and let the GC do its job as normal. Refs: "Compacting Garbage Collection Can Be Fast and Simple." Charles Clarke and David Mason. non-journal version at http://www.sarg.ryerson.ca/sarg/papers/199602-spe/