📄 mprof.tex

📁 早期freebsd实现
💻 TEX
📖 第 1 页 / 共 3 页
字号:
rather than size: cons cell (c), floating point number (f), structureor vector (s), and other (o).} of the memory allocated by eachfunction as a fraction of the memory allocated by all functions.  Inthis example, 99\% of the memory allocated by the program wasallocated in {\tt make\_widget} for medium-sized objects.  Blankcolumns indicate values less than 1\%.  The other size breakdown givenin the direct allocation table is for the memory that was allocatedand never freed.  The {\tt \%~all~kept} field contains a sizebreakdown of the memory not freed by a particular function as afraction of all the memory not freed.  In the example, 99\% of theunfreed memory was of medium-sized objects allocated by {\ttmake\_widget}.\subsection{The Allocation Call Graph}Understanding the memory allocation behavior of a programs sometimesrequires more information than just knowing the functions that aredirectly responsible for memory allocation.  Sometimes, as happens inFigure~\ref{C example figure}, the same allocation function is calledby several different functions for different purposes.  The allocationcall graph shows all the functions that were indirect callers offunctions that allocated memory.Because the dynamic caller/callee relations of a program are numerous,the dynamic call graph is a complex table with many entries.  Often,the information provided by the first three tables is enough to allowprogrammers to understand the memory allocation of their program.Nevertheless, for a full understanding of the allocation behavior ofprograms the allocation call graph is useful.  Figure \ref{call graphfigure} contains the allocation call graph for the producer/consumerexample and corresponds to the call graph profile generated by {\ttgprof}. 	\begin{figure}[ht]\begin{singlespace}{\scriptsize\include{extable2}}\end{singlespace}\caption{Allocation Call Graph for Producer/Consumer Example}\label{call graph figure}\end{figure}The allocation call graph is a large table with an entry for eachfunction that was on a call chain when memory was allocated.  Eachtable entry is divided into three parts.  The line for the functionitself (called the {\em entry function\/}); lines above that line,each of which represents a caller of the entry function (theancestors), and lines below that line, each of which represents afunction called by the entry function (the descendents).  The entryfunction is easy to identify in each table entry because a large ruleappears in the {\tt frac} column on that row.  In the first entry ofFigure \ref{call graph figure}, {\tt main} is the entry function;there are no ancestors and two descendents.The entry function line of the allocation call graph containsinformation about the function itself.  The {\tt index} field providesa unique index to help users navigate through the call graph.  The{\tt self~$+$~desc} field contains the percent of total memoryallocated that was allocated in this function and its descendents.The call graph is sorted by decreasing values in this field.  The {\ttself} field contains the number of bytes that were allocated directlyin the entry function.  The {\tt size-func} fields contain a sizebreakdown of the memory allocated in the function itself.  Somefunctions, like {\tt main} (index 0) allocated no memory directly, sothe {\tt size-func} fields are all blank.  The {\tt called} fieldshows the number of times this function was called during a memoryallocation, with the number of recursive calls recorded in theadjacent field.Each caller of the entry function is listed on a separate line aboveit.  A summary of all callers is given on the top line of the entry ifthere is more than one ancestor.  The {\tt self} field of ancestorslists the number of bytes that the entry function and its descendentsallocated on behalf of the ancestor.  The {\tt size-ances} fieldbreaks down those bytes into size categories, while the {\ttfrac-ances} field shows the size breakdown of the bytes requested bythis ancestor as a fraction of bytes allocated at the request of allancestors.  For example, in the entry for function {\tt make\_widget}(index 1), the ancestor {\tt make\_red\_widget} can be seen to haverequested 1,023,876 bytes of data from {\tt make\_widget}, 99\% of whichwas of medium-sized objects.  Furthermore, calls from {\ttmake\_red\_widget} accounted for 50\% of the total memory allocated by{\tt make\_widget} and its descendents.  Other fields show how manycalls the ancestor made to the entry function and how many calls theancestor made in total.  In a similar fashion, information about thefunction's descendents appears below the entry function.Had the memory leak table not already told us what objects were notbeing freed, we could use the allocation call graph for the samepurpose.  The direct allocation table told us that {\tt make\_widget}allocated 1,023,876 bytes of unfreed memory, all for medium-sizedobjects.  From the allocation call graph, we can see that the function{\tt make\_red\_widget} was the function calling {\tt make\_widget}that requested 1,023,876 bytes of medium-sized objects.  Cycles in the call graph are not illustrated in Figure~\ref{call graphfigure}.  As described in the next section, cycles obscure allocationinformation among functions that are members of a cycle.  When theparent/child relationships that appear in the graph are betweenmembers of the same cycle, most of the fields in the graph must beomitted.\section{Implementation}\label{implementing mprof}We have implemented {\tt mprof} for use with C and Common Lispprograms.  Since the implementations are quite similar, the Cimplementation will be described in detail, and the minor differencesin the Lisp implementation will be noted at the end of the section.\subsection{The Monitor} The first phase of {\tt mprof} is a monitor that is linked into theexecuting application.  The monitor includes modified versions of {\ttmalloc} and {\tt free} that record information each time they areinvoked.  Along with {\tt malloc} and {\tt free}, {\tt mprof} providesits own {\tt exit} function, so that when the application programexits, the data collected by the monitor is written to a file.  Themonitor maintains several data structures needed to construct thetables.To construct the leak table, the monitor associates a list of the lastfive callers in the call chain, the {\em partial call chain\/}, withthe object allocated.  {\tt mprof} augments every object allocatedwith two items: an integer which is the object size as requested bythe user (since the allocator may allocate an object of a differentsize for convenience), and a pointer to a structure that contains theobject's partial call chain and a count of allocations and frees ofobjects with that call chain.  A hash table is used to map a partialcall chain to the structure containing the counters.  When an objectis allocated, its partial call chain is used as a hash key to retrievethe structure containing the counters.  A pointer to the structure isplaced in the allocated object and the allocation counter isincremented.  When the object is later freed, the pointer is followedand the counter of frees is incremented.  Any partial call chain inwhich the number of allocations does not match the number of freesindicates a memory leak and is printed in the leak table.To construct the allocation bin table, the monitor has a 1026-elementarray of integers to count allocations and another 1026-element arrayto count frees.  When objects of a particular size from 0--1024 bytesare allocated or freed, the appropriated bin is incremented.  Objectslarger than 1024 bytes are grouped into the same bin.The construction of the direct allocation table falls out directlyfrom maintaining the allocation call graph information, which isdescribed in the next section.\subsection{Constructing the Allocation Call Graph}To construct the allocation call graph, the monitor must associate thenumber of bytes allocated with every function on the current dynamiccall chain, each time {\tt malloc} is called.  Consider the samplecall chain in Figure~\ref{cchain:fig}, which we abbreviate:{\tt main}-$>${\tt foo}-$>${\tt bar}(24).\begin{figure}[htbp]\begin{verbatim}     CALL STACK:                     MPROF RECORDS:     main calls foo                       24 bytes over main -> foo           foo calls bar                  24 bytes over foo -> bar              bar calls malloc(24)        24 bytes allocated in bar\end{verbatim}\caption{Example of a Dynamic Call Chain}\label{cchain:fig}\end{figure}In {\tt mprof}, the monitor traverses the entire call chain byfollowing return addresses.  This differs from {\tt gprof}, where onlythe immediate caller of the current function is recorded.  {\tt gprof}makes the assumption that each call takes an equal amount of time anduses this assumption to reconstruct the complete dynamic call graphfrom information only about the immediate callers.  In {\tt mprof}, weactually traverse the entire dynamic call chain and need to make noassumptions. In choosing to traverse the entire call chain, we have elected toperform an operation that is potentially expensive both in time andspace.  One implementation would simply record every function in everychain and write the information to a file (i.e., in the example wewould output [{\tt main}-$>${\tt foo}-$>${\tt bar}, 24]).  Considering thatmany programs execute millions of calls to {\tt malloc} and that thedepth of a call chain can be hundreds of functions, the amount ofinformation could be prohibitive.An alternative to recording the entire chain of callers is to breakthe call chain into a set of caller/callee pairs, and associate thebytes allocated with each pair in the chain.  For the call in theexample, we could maintain the pairs [{\tt main}, {\tt foo}] and [{\tt foo},{\tt bar}], and associate 24 bytes with each pair.  Conceptually, thedata structure our monitor maintains is an association betweencaller/callee pairs and the cumulative bytes allocated over the pair,which we denote ([{\tt main}, {\tt foo}], 24).  To continue with theexample, if the next allocation was: {\tt main}-$>${\tt foo}-$>${\ttotherbar}(10), where this is the first call to {\tt otherbar}, wewould update the byte count associated with the [{\tt main}, {\tt foo}] pairto 34 from 24.  Furthermore, we would create a new association between[{\tt foo}, {\tt otherbar}] and the byte count, 10.  A disadvantagewith this implementation is that the exact call chains are no longeravailable.  However, from the pairs we can construct the correctdynamic call graph of the program, which is the information that weneed for the allocation call graph.For the overhead imposed by the monitor to be reasonable, we have tomake the association between caller/callee pairs and cumulative bytecounts fast.  We use a hash table in which the hash function is asimple byte-swap XOR of the callee address.  Each callee has a list ofits callers and the number of allocated bytes associated with eachpair.  In an effort to decrease the number of hash lookups, we notedthat from allocation to allocation, most of the call chain remains thesame.  Our measurements show that on the average, 60--75\% of the callchain remains the same between allocations.  This observation allowsus to cache the pairs associated with the current caller chain and touse most of these pairs the next time a caller chain is recorded.Thus, on any particular allocation, only a few addresses need to behashed.  Here are the events that take place when a call to {\ttmalloc} is monitored:\begin{enumerate}  \item The chain of return addresses is stored in a vector.  \item The new chain is compared with the previous chain, and the pointat which they differ is noted.  \item For the addresses in the chain that have not changed, thecaller/callee byte count for each pair is already available and isincremented.   \item For new addresses in the chain, each caller/callee byte count islooked up and updated.  \item For the tail of the chain (i.e., the function that called {\ttmalloc} directly), the direct allocation information is recorded.\end{enumerate}Maintaining allocation call graph information requires a byte countfor every distinct caller/callee pair in every call chain thatallocates memory.  Our experience is that there are a limited numberof such pairs, even in very large C programs, so that the memoryrequirements of the {\tt mprof} monitor are not large (seesection~\ref{mmeas:sec}).\subsection{Reduction and Display}The second phase of {\tt mprof} reads the output of the monitor,reduces the data to create a dynamic call graph, and displays the datain four tables.  The first part of the data reduction is to map thecaller/callee address pairs to actual function names.  A program {\ttmpfilt} reads the executable file that created the monitor trace(compiled so that symbol table information is retained), and outputs anew set of function caller/callee relations.  These relations are thenused to construct the subset of the program's dynamic call graph thatinvolved memory allocation.The call graph initially can contain cycles due to recursion in theprogram's execution.  Cycles in the call graph introduce spuriousallocation relations, as is illustrated in Figure~\ref{recursive callfigure}.  In this example, {\tt main} is credited as being indirectlyresponsible for 10 bytes, but because we only keep track ofcaller/callee pairs, {\tt F} appears to have requested 20 bytes from{\tt G}, even though only 10 bytes were allocated.\begin{figure}[htbp]\begin{verbatim}        CALL STACK:                     MPROF RECORDS:        main calls F                    (10 bytes over main -> F)             F calls G                  (10 bytes over F -> G)               G calls F                (10 bytes over G -> F)                 F calls G              (10 MORE bytes over F -> G)                   G calls malloc(10)   (10 bytes allocated in G)\end{verbatim}\caption{Problems Caused by Recursive Calls}
💿 文件大小 40554 K
👤 上传用户 luyibo54618
📂 所属分类 Linux/Unix编程
🏷️ 相关标签

#freebsd
⌨️ 快捷键说明

复制代码 Ctrl + C
搜索代码 Ctrl + F
全屏模式 F11
切换主题 Ctrl + Shift + D
显示快捷键 ?
增大字号 Ctrl + =
减小字号 Ctrl + -