Wednesday, June 06, 2012

perf: the good, the bad, the ugly

For the past few months, I've spent a good deal of time using the Linux utility perf to profile PostgreSQL.  On the whole, I've been favorably impressed.  Before I started using perf, trying to profile generally meant using oprofile or gprof.  Unfortunately, both of those tools have serious problems.

gprof requires recompiling your entire application, which is not an intolerable burden for the types of performance work that I do, but it imposes so much overhead that the behavior of your application may be be quite different with profiling enabled than without.   In fact, it's hard to overstate the overhead: it's huge.  There are times you can get away with this and get some useful numbers regardless... but not many.

oprofile, like perf, makes use of facilities provided by the Linux kernel to do statistical profiling, which makes it much more efficient.  Unfortunately, I found that a number of the machines that I had available for performance testing simply wouldn't run oprofile; and on the flip side, one 64-core machine that did run oprofile crashed when running oprofile under heavy load (64 concurrent clients, mostly CPU-bound).  Even after I drastically ramped back the sampling rate, the profiling overhead was still massive, so it was hard to be sure that the results were reflective of what would have happened without the profiling.  A more significant problem is that oprofile has a terrible user interface.  Changing configuration parameters didn't seem to work reliably unless I unloaded and reloaded the kernel module, and it was always difficult to be sure it was doing what I wanted it to.

perf, in contrast, has a pleasantly simple user interface.  You run 'perf record somecommand', and it profiles the command and writes the output to a file called perf.data.  If you want to profile the whole system (yes, please!) then you write 'perf record -a somecommand' and it profiles the system for so long as somecommand is running.  If you throw in -g, it gives you a call graph.  Even with -g, the overhead seems to be a lot less than for oprofile.  To view the results, you can simply run 'perf report', which at least on my system brings up a curses-based interface that lets you incrementally expand the call graph, if there is one, or zoom into functions and see which individual machine-language instructions are hot.  All of this is great.  It took me a while to figure out how to produce text-format output suitable for posting to the pgsql-hackers mailing list, but someone eventually tipped me off: just redirect the output of perf report to a file, and it'll produce an ASCII-format report instead of using the GUI.

But that brings me to the biggest weakness of perf, as I see it, which is that the documentation is very thin.  The tool is simple enough that the basics don't need much documentation.  However, as soon as you go beyond the basics, you do need documentation, and it isn't there.  For example, I learned pretty quickly that the 'perf list' command can be used to list all the events that oprofile can collect, and it's very easy to select whichever ones you want: just run 'perf record -e EVENTNAME' and you're there.  But there's no explanation - anywhere, as far as I can find - as to what those events are actually measuring.  Some of the events are hardware events, like L1-icache-loads, and I can guess that L1-icache-loads means loads into the CPU's level-one instruction cache, and I even have some idea what that means.  But what does LLC-loads mean?  Beats me, and Google's not much help either: it's not a limited liability company, and it's not IEEE 802.2 either.  What does node-loads mean?  No clue.

The sort-of-good news about hardware events is that if you can figure out what the hardware event names mean, then you can go read a book about CPU architecture and have some idea what they are measuring.  But perf also includes software events like cpu-clock and task-clock that have no meaning apart from the Linux kernel, and there's no documentation about what those mean, either.  I get approximately equivalent results from profiling with the default event, cycles, with task-clock, and with cpu-clock, and I haven't been able to find an explanation anywhere as to what the difference is.  For perf to be broadly useful, beyond the circle of developers who created it, it needs documentation explaining what all of these events are, and how to make effective use of them.

On the plus side, there are some events for which the meaning is not too difficult to guess.  So far, by far the most useful of these has been the context-switches event, which can be abbreviated to cs.  'perf record -g -e cs' allows you to see what's causing your program to context switch, and this turns out to be useful in troubleshooting lwlock contention within the PostgreSQL backend.  This command led directly to a patch to avoid repeated CLOG access from heap_hot_search_buffer, because in looking through the context switch call graph I discovered an unexpected and unnecessary contributor to lock contention.  The problem was easy to fix once spotted, but without perf, I wouldn't have realized that we had the problem in the first place.

One limitation of this technique is that it only finds context switches based on frequency.  If we're context-switching out at one point in the code very frequently, but only for short periods of time, and at some other point in the code less frequently, but for much longer periods of time, only the former is visible with this technique.  What I'd really like to have is a profile based on wall-clock time.  I'd like to have an event that would allow me to see the places where the program spends most of its elapsed time, as opposed to most of its cycles.   I thought that perhaps task-clock or cpu-clock would do this, but they don't seem to; thus far, the only profiling tool I've found that does this is HP's Caliper.  If anyone out there can figure out a way to make perf do this, I owe you a beer.

Similarly, while perf can trace system calls, it can only tell you which system calls are happening frequently, not which ones are taking the most system time.  It seems to get overloaded pretty easily when tracing system calls, too.  Fortunately, for perf, getting overloaded means losing some data rather than (as with oprofile) taking down the machine.

Now that I've started to feel comfortable with the core perf record/perf report functionality, I've been branching out and trying some of the other tools that come as part of the package.  I tried 'perf sched', but didn't find the results very informative; it seems like this may only be intended to troubleshoot bugs in the kernel scheduler.  I also tried 'perf diff', but it seg faulted.  So, instead of using the Fedora 17 packages, I downloaded the Linux kernel and compiled from source, and that did fix the seg fault, but it still didn't do what I wanted.  'perf diff' is supposed to allow differential profiling - essentially, you can subtract the results of two profiles and see what the differences are.  However, it's really only intended to let you difference the results of two runs against the same set of binaries.  If you have, say, binary A, built from the upstream branch tip, and binary B, built with one or more patches, and you do a profiling run with each one, perf diff won't realize that those binaries are "the same program" and tell you about the symbols that are hotter in one profile than the other.  Instead, it'll observe that binary A isn't called at all in the profiling run using B, so when diffing A against B, all the deltas for the symbols in B are the same as their absolute values.  There are probably situations in which the actual behavior here is quite useful, but it wasn't what I was looking for.

All in all, perf seems to be the best of the Linux profiling tools that are currently available by a considerable margin, and I think that they've made good decisions about the functionality and user interface.  The lack of proper documentation is probably the tool's biggest weakness right now, but hopefully that is something that will be addressed over time.

22 comments:

  1. Matthew WoodcraftJune 06, 2012 10:20 AM

    For what it's worth, "LLC" stands for "Last-level cache"

    ReplyDelete
    Replies
    1. i.e. if your CPU has a L1, L2, and L3 cache, LLC means L3? Makes sense, but hard to guess that.

      Delete
    2. I just saw an explanation of LLC and hope all can be clear about this conception.
      FLC, MLC, LLC
      Respectively the first level cache, second level cache, and last level cache of a hierarchy.
      FLC
      First level cache. "FLC" is rarely used, since there are usually split L1 I$ and D$. The innermost cache.
      MLC
      Middle level cache. Usually an L2$.
      LLC
      Last level cache. Usually an L3$. The outermost cache.

      Delete
  2. don't know if it will be useful but try to look at information you'll find in tools/perf/Documentation/* in kernel source tree

    ReplyDelete
    Replies
    1. direct link to git source tree for
      3.4 kernel branch

      http://goo.gl/Ldf9P

      Delete
    2. I've seen that documentation, but as you'll see if you look at it, there's no description at all of (for example) what the different events mean. Also, some of it is fairly opaque: perf bench is documented as 'general framework for benchmark suites', but it's really 'a suite of simple benchmarks that some kernel developers thought would be useful to people running perf'. Based on the description, I thought it was something to help me run MY benchmark suites more easily, but it isn't that at all. As documentation goes, it's clearly better than nothing, and it does tell you what options are available and what they do, but it's still a bit rough in places, and general information on what the tools do and how to use them is extremely sparse.

      Delete
    3. sorry for the noise then :/

      unfortunately I think that there's no
      other way of getting more doc a part form
      mining kernel mailing list archives, but
      obviously this is very time consuming.

      Delete
    4. [prev comments deleted due to grammar errors]

      A "maybe" useful link that I've stumbled across,
      just in case you haven't already found it:

      http://web.eecs.utk.edu/~vweaver1/projects/perf-events/

      Delete
    5. @ Andrea Suisani:

      "The Unofficial Linux Perf Events Web-Page
      because the perf_events developers don't seem that excited about writing documentation"

      That paragraph there made my day. :)

      Thanks for the link.

      Delete
    6. Updated URL of "The Unofficial Linux Perf Events Web-Page": http://web.eece.maine.edu/~vweaver/projects/perf_events/

      Delete
  3. For tracing system calls, strace -c -p shows you the # of calls to each syscall, total system time in the syscall, avg time in syscall. If you use -f to also trace forked processes I think the aggregation is across all processes being traced.

    Not sure if that that's the info you were trying to get...?

    ReplyDelete
    Replies
    1. strace -c and strace -cf are very useful, but unfortunately they have much higher overhead than perf. So it would be nice to have perf be able to do it.

      Delete
    2. Ah right. I haven't had to be concerned with the overhead as much in the times I've done this sort of thing. What about system tap?

      Delete
    3. Haven't tried systemtap yet...

      Delete
  4. @Robert

    Any reason why you're not using DFrace on Postgres?

    Also, any updates on FreeBSD

    ReplyDelete
  5. Hi Robert

    I too would love to see you spend more time on FreeBSD as well (as it relates to Postgres).

    And as Rick above points out, there exist no better platform to performance test Postgres against than to use the tools FreeBSD gives via DTrace.

    ReplyDelete
  6. +1 for DTrace and FreeBSD.

    ReplyDelete
  7. Event description can't be put into profiling tool. You can find description in manual for your processor. For example, for intel search google for 'intel optimization manual', you'll find 'Intel® 64 and IA-32 Architectures Optimization Reference Manual' which has an appendix B 'Using performance monitoring events'.

    ReplyDelete
  8. “It took me a while to figure out how to produce text-format output suitable for posting to the pgsql-hackers mailing list”

    --stdio option to perf-report

    “ I learned pretty quickly that the 'perf list' command can be used to list all the events that oprofile can collect,”

    s/oprofile/perf/?

    “But perf also includes software events like cpu-clock and task-clock that have no meaning apart from the Linux kernel, and there's no documentation about what those mean, either. I get approximately equivalent results from profiling with the default event, cycles, with task-clock, and with cpu-clock, and I haven't been able to find an explanation anywhere as to what the difference is.”

    cpu-clock is for profiling using high-res timers. It is the fallback for systems without a means of doing hardware profiling. task-clock is similar but imagine a clock that only runs when the task is running.


    “’perf record -g -e cs' allows you to see what's causing your program to context switch, . . . One limitation of this technique is that it only finds context switches based on frequency.”

    add the -c 1 argument which tells perf “I want all context switches" (ie., count of 1 so all occurrences of an event).

    “What I'd really like to have is a profile based on wall-clock time. I'd like to have an event that would allow me to see the places where the program spends most of its elapsed time, as opposed to most of its cycles. I thought that perhaps task-clock or cpu-clock would do this, but they don't seem to; thus far, the only profiling tool I've found that does this is HP's Caliper. If anyone out there can figure out a way to make perf do this, I owe you a beer.

    profiling means taking samples at some frequency — which means delta-time. H/W profiling with cycles event, and S/W profiling with cpu-clock and task-clock all do that. Time based sampling. Can you elaborate on what you mean by ‘profile based on wall-clock time’?


    “perf can trace system calls, it can only tell you which system calls are happening frequently, not which ones are taking the most system time. It seems to get overloaded pretty easily when tracing system calls, too.”

    Download perf for 3.12 and try out ‘perf trace’. Should address that problem.

    And in 3.13 you will get the syscall summary along with statistics — min, max, average. You can get that now by grabbing Arnaldo’s tree (git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux and checking out the perf/core branch). And features coming soon included tagging system calls that blocked or caused a context switch and an option to dump callstacks (so many features, so little time).

    “I tried 'perf sched', but didn't find the results very informative; it seems like this may only be intended to troubleshoot bugs in the kernel scheduler.”

    Try out:
    https://github.com/dsahern/linux/tree/perf-sched-timehist-3.11

    perf sched record -g; perf sched timehist —events

    “I also tried 'perf diff', but it seg faulted.”

    Another one that has received a lot of attention recently. Try out a new verison.

    ReplyDelete
    Replies
    1. > task-clock is similar but imagine a clock that only runs when the task is running.

      I'm not sure: task-clock consistently (but slightly) higher than cpu-clock.

      Delete
  9. It is pretty easy to make google-perftools profile on SIGALRM events instead of SIGPROF that it uses by default. Do a search and replace:

    ITIMER_PROF -> ITIMER_REAL
    SIGPROF -> SIGALRM

    Note that this will not end well if the subject uses SIGALRM for some other purpose.

    ReplyDelete
  10. This is a helpful post on your experiences with profiling and perf for PostgreSQL. Thank you Robert.

    ReplyDelete