📄 scale.html
字号:
<HTML><HEAD><TITLE>Garbage collector scalability</TITLE></HEAD><BODY><H1>Garbage collector scalability</h1>In its default configuration, the Boehm-Demers-Weiser garbage collectoris not thread-safe. It can be made thread-safe for a number of environmentsby building the collector with the appropriate<TT>-D</tt><I>XXX</i><TT>-THREADS</tt> compilationflag. This has primarily two effects:<OL><LI> It causes the garbage collector to stop all other threads whenit needs to see a consistent memory state.<LI> It causes the collector to acquire a lock around essentially allallocation and garbage collection activity.</ol>Since a single lock is used for all allocation-related activity, only onethread can be allocating or collecting at one point. This inherentlylimits performance of multi-threaded applications on multiprocessors.<P>On most platforms, the allocator/collector lock is implemented as aspin lock with exponential back-off. Longer wait times are implementedby yielding and/or sleeping. If a collection is in progress, the purespinning stage is skipped. This has the advantage that uncontested andthus most uniprocessor lock acquisitions are very cheap. It has thedisadvantage that the application may sleep for small periods of timeeven when there is work to be done. And threads may be unnecessarilywoken up for short periods. Nonetheless, this scheme empiricallyoutperforms native queue-based mutual exclusion implementations in mostcases, sometimes drastically so.<H2>Options for enhanced scalability</h2>Version 6.0 of the collector adds two facilities to enhance collectorscalability on multiprocessors. As of 6.0alpha1, these are supported only under Linux on X86 and IA64 processors, though ports to otherotherwise supported Pthreads platforms should be straightforward.They are intended to be used together.<UL><LI>Building the collector with <TT>-DPARALLEL_MARK</tt> allows the collector torun the mark phase in parallel in multiple threads, and thus on multipleprocessors. The mark phase typically consumes the large majority of thecollection time. Thus this largely parallelizes the garbage collectoritself, though not the allocation process. Currently the marking isperformed by the thread that triggered the collection, together with<I>N</i>-1 dedicatedthreads, where <I>N</i> is the number of processors detected by the collector.The dedicated threads are created once at initialization time.<P>A second effect of this flag is to switch to a more concurrentimplementation of <TT>GC_malloc_many</tt>, so that free lists can bebuilt, and memory can be cleared, by more than one thread concurrently.<LI>Building the collector with -DTHREAD_LOCAL_ALLOC adds support for threadlocal allocation. It does not, by itself, cause thread local allocationto be used. It simply allows the use of the interface in <TT>gc_local_alloc.h</tt>.<P>Memory returned from thread-local allocators is completely interchangeablewith that returned by the standard allocators. It may be used by otherthreads. The only difference is that, if the thread allocates enoughmemory of a certain kind, it will build a thread-local free list forobjects of that kind, and allocate from that. This greatly reduceslocking. The thread-local free lists are refilled using <TT>GC_malloc_many</tt>.<P>An important side effect of this flag is to replace the defaultspin-then-sleep lock to be replace by a spin-then-queue based implementation.This <I>reduces performance</i> for the standard allocation functions,though it usually improves performance when thread-local allocation isused heavily, and thus the number of short-duration lock acquisitionsis greatly reduced.</ul><P>The easiest way to switch an application to thread-local allocation is to<OL><LI> Define the macro <TT>GC_REDIRECT_TO_LOCAL</tt>,and then include the <TT>gc.h</tt>header in each client source file.<LI> Invoke <TT>GC_thr_init()</tt> before any allocation.<LI> Allocate using <TT>GC_MALLOC</tt>, <TT>GC_MALLOC_ATOMIC</tt>,and/or <TT>GC_GCJ_MALLOC</tt>.</ol><H2>The Parallel Marking Algorithm</h2>We use an algorithm similar to<A HREF="http://www.yl.is.s.u-tokyo.ac.jp/gc/">that developed byEndo, Taura, and Yonezawa</a> at the University of Tokyo.However, the data structures and implementation are different,and represent a smaller change to the original collector source,probably at the expense of extreme scalability. Some ofthe refinements they suggest, <I>e.g.</i> splitting largeobjects, were also incorporated into out approach.<P>The global mark stack is transformed into a global work queue.Unlike the usual case, it never shrinks during a mark phase.The mark threads remove objects from the queue by copying them to alocal mark stack and changing the global descriptor to zero, indicatingthat there is no more work to be done for this entry.This removalis done with no synchronization. Thus it is possible for more thanone worker to remove the same entry, resulting in some work duplication.<P>The global work queue grows only if a marker thread decides toreturn some of its local mark stack to the global one. Thisis done if the global queue appears to be running low, or ifthe local stack is in danger of overflowing. It does requiresynchronization, but should be relatively rare.<P>The sequential marking code is reused to process local mark stacks.Hence the amount of additional code required for parallel markingis minimal.<P>It should be possible to use generational collection in the presence of theparallel collector, by calling <TT>GC_enable_incremental()</tt>.This does not result in fully incremental collection, since parallel markphases cannot currently be interrupted, and doing so may be tooexpensive.<P>Gcj-style mark descriptors do not currently mix with the combinationof local allocation and incremental collection. They should work correctlywith one or the other, but not both.<P>The number of marker threads is set on startup to the number ofavailable processors (or to the value of the <TT>GC_NPROCS</tt>environment variable). If only a single processor is detected,parallel marking is disabled.<P>Note that setting GC_NPROCS to 1 also causes some lock acquisitions insidethe collector to immediately yield the processor instead of busy waitingfirst. In the case of a multiprocessor and a client with multiplesimultaneously runnable threads, this may have disastrous performanceconsequences (e.g. a factor of 10 slowdown). <H2>Performance</h2>We conducted some simple experiments with a version of<A HREF="gc_bench.html">our GC benchmark</a> that was slightly modified torun multiple concurrent client threads in the same address space.Each client thread does the same work as the original benchmark, but they sharea heap.This benchmark involves very little work outside of memory allocation.This was run with GC 6.0alpha3 on a dual processor Pentium III/500 machineunder Linux 2.2.12.<P>Running with a thread-unsafe collector, the benchmark ran in 9seconds. With the simple thread-safe collector,built with <TT>-DLINUX_THREADS</tt>, the execution timeincreased to 10.3 seconds, or 23.5 elapsed seconds with two clients.(The times for the <TT>malloc</tt>/i<TT>free</tt> versionwith glibc <TT>malloc</tt>are 10.51 (standard library, pthreads not linked),20.90 (one thread, pthreads linked),and 24.55 seconds respectively. The benchmark favors agarbage collector, since most objects are small.)<P>The following table gives execution times for the collector builtwith parallel marking and thread-local allocation support(<TT>-DGC_LINUX_THREADS -DPARALLEL_MARK -DTHREAD_LOCAL_ALLOC</tt>). We testedthe client using either one or two marker threads, and runningone or two client threads. Note that the client uses thread localallocation exclusively. With -DTHREAD_LOCAL_ALLOC the collectorswitches to a locking strategy that is better tuned to less frequentlock acquisition. The standard allocation primitives thus peformslightly worse than without -DTHREAD_LOCAL_ALLOC, and should beavoided in time-critical code.<P>(The results using <TT>pthread_mutex_lock</tt>directly for allocation locking would have been worse still, atleast for older versions of linuxthreads.With THREAD_LOCAL_ALLOC, we first repeatedly try to acquire thelock with pthread_mutex_try_lock(), busy_waiting between attempts.After a fixed number of attempts, we use pthread_mutex_lock().)<P>These measurements do not use incremental collection, nor was prefetchingenabled in the marker. We used the C version of the benchmark.All measurements are in elapsed seconds on an unloaded machine.<P><TABLE BORDER ALIGN="CENTER"><TR><TH>Number of threads</th><TH>1 marker thread (secs.)</th><TH>2 marker threads (secs.)</th></tr><TR><TD>1 client</td><TD ALIGN="CENTER">10.45</td><TD ALIGN="CENTER">7.85</td><TR><TD>2 clients</td><TD ALIGN="CENTER">19.95</td><TD ALIGN="CENTER">12.3</td></table><PP>The execution time for the single threaded case is slightly worse than withsimple locking. However, even the single-threaded benchmark runs faster thaneven the thread-unsafe version if a second processor is available.The execution time for two clients with thread local allocation time isonly 1.4 times the sequential execution time for a single thread in athread-unsafe environment, even though it involves twice the client work.That represents close to afactor of 2 improvement over the 2 client case with the old collector.The old collector clearlystill suffered from some contention overhead, in spite of the fact that thelocking scheme had been fairly well tuned.<P>Full linear speedup (i.e. the same execution time for 1 client on oneprocessor as 2 clients on 2 processors)is probably not achievable on this kind ofhardware even with such a small number of processors,since the memory system isa major constraint for the garbage collector,the processors usually share a single memory bus, and thusthe aggregate memory bandwidth does not increase inproportion to the number of processors. <P>These results are likely to be very sensitive to both hardware and OSissues. Preliminary experiments with an older Pentium Pro machine runningan older kernel were far less encouraging.</body></html>
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -