Accounting for memory use in a program is often difficult. The TfMallocTag system is designed to track memory use in a hierarchical fashion. Memory use in this context refers specifically to allocations that you make using malloc() (and its variants) and free(). Note that this includes the C++ new and delete operators, since they (in general) simply call through to malloc() and free() (but see Best Profiling Results as well).
The basic idea is that at any point during program execution, you can push a memory "tag" onto the local call stack. Allocations made while that tag is active are "billed" to that tag; at any point, you can query and see how much outstanding memory is due to that particular tag. If additional tags are pushed onto the stack, memory allocations are billed to these "children" tags but are also included in the bill to the parent tags.
Each line of code in the program that pushes a tag is called a call-site. A sequence of call-sites is called a path node and describes the hierarchy under which allocations take place. You use an object whose constructor pushes the tag and whose destructor pops the tag to control the lifetime of a tag. Consider the following example:
Suppose now that you invoke TopFunction(). The program has three different call-sites, Top, A and B. However, running TopFunction() generates the following distinct path nodes: Top, Top/A, Top/A/B, and Top/B.
The total memory billed to path node Top is 400. Calling TopFunction() results in allocations in itself, its direct calls FuncA() and FuncB(), and its indirect call to FuncB() from FuncA(). The direct memory billed to the path node Top is simply 100. That is the memory allocation noted by the line in the example marked note1. Even though this call comes after FuncA() has been called, the call-site tag A is no longer active, since it was popped off the stack when FuncA() exited.
Continuing the analysis, the total memory billed to Top/A is 200, the memory billed to Top/A/B is 100, and the memory billed to Top/B is also 100. To access these statistics, you call the GetCallTree() function on the TfMallocTag object. Note that the system does not begin any actual accounting until Initialize() is called. Any memory allocations that occur prior to this point are "off the radar."
The following example shows a typical use of tags in a library. Note that memory tags in a program need to be distinct from other people's tags, so you should follow the same basic guidelines that apply to avoiding name-conflicts in functions and classes.
Thus far, all of the tags shown in examples have been string literals. Sometimes it is useful to let tags be constructed on the fly, as in the following example:
This technique can generate any number of different call-sites (each distinct name
passed in generates a different call-site) and thus a whole sequence of different path nodes.
Note that even if the tagging system is not being used (see Performance, below), the cost of building the call-site name string is still incurred in the above example. While this should almost never be a problem, extremely performance-intensive code should probably pass a string literal or previously constructed string, as opposed to creating a string (even an empty string).
Occasionally, using local variables to delimit the scope of a tag isn't possible. You can make manual calls to Push() and Pop(), but whenever possible you should use TfAutoMallocTag.
The memory-cost for using the TfMallocTag system is as follows:
Obviously, a program that does nothing but allocate memory can be substantially impacted by turning tagging on. For typical applications, however, (such as Renderman, or loading models in Menv30) the actual runtime hit has proven (so far) to be in the 2-3 percent range when tagging is active. Programs with tags, but which have not called Initialize(), have no measurable increase in running times.
The above statement applies to ptmalloc3. For the jemalloc allocator, it's unclear right now what performance impact this might have for prman.
The memory statistics can be fooled by programs which contain their own allocators: in this case, the memory requested by the allocator itself does not correlate well with what tags are active. It is best to turn off as many internal allocators as possible.
In particular, the C++ STL library maintains its own allocator for small requests, as does the TF Library. The former can be turned off by setting the environment variable GLIBCXX_FORCE_NEW to any value. For simplicity, this will also deactivate the Tf Library allocator (see TfFixedSizeAllocator).
The TfMallocTag system is completely thread-safe. Each thread has its own local stack of call-site objects. You can call TfMallocTag::GetCallTree() or TfMallocTag::GetTotalBytes() at any time and from any thread.