For the past few months, I've spent a good deal of time using the Linux utility perf to profile PostgreSQL. On the whole, I've been favorably impressed. Before I started using perf, trying to profile generally meant using oprofile or gprof. Unfortunately, both of those tools have serious problems.
gprof requires recompiling your entire application, which is not an intolerable burden for the types of performance work that I do, but it imposes so much overhead that the behavior of your application may be be quite different with profiling enabled than without. In fact, it's hard to overstate the overhead: it's huge. There are times you can get away with this and get some useful numbers regardless... but not many.
oprofile, like perf, makes use of facilities provided by the Linux kernel to do statistical profiling, which makes it much more efficient. Unfortunately, I found that a number of the machines that I had available for performance testing simply wouldn't run oprofile; and on the flip side, one 64-core machine that did run oprofile crashed when running oprofile under heavy load (64 concurrent clients, mostly CPU-bound). Even after I drastically ramped back the sampling rate, the profiling overhead was still massive, so it was hard to be sure that the results were reflective of what would have happened without the profiling. A more significant problem is that oprofile has a terrible user interface. Changing configuration parameters didn't seem to work reliably unless I unloaded and reloaded the kernel module, and it was always difficult to be sure it was doing what I wanted it to.
perf, in contrast, has a pleasantly simple user interface. You run 'perf record somecommand', and it profiles the command and writes the output to a file called perf.data. If you want to profile the whole system (yes, please!) then you write 'perf record -a somecommand' and it profiles the system for so long as somecommand is running. If you throw in -g, it gives you a call graph. Even with -g, the overhead seems to be a lot less than for oprofile. To view the results, you can simply run 'perf report', which at least on my system brings up a curses-based interface that lets you incrementally expand the call graph, if there is one, or zoom into functions and see which individual machine-language instructions are hot. All of this is great. It took me a while to figure out how to produce text-format output suitable for posting to the pgsql-hackers mailing list, but someone eventually tipped me off: just redirect the output of perf report to a file, and it'll produce an ASCII-format report instead of using the GUI.
But that brings me to the biggest weakness of perf, as I see it, which is that the documentation is very thin. The tool is simple enough that the basics don't need much documentation. However, as soon as you go beyond the basics, you do need documentation, and it isn't there. For example, I learned pretty quickly that the 'perf list' command can be used to list all the events that oprofile can collect, and it's very easy to select whichever ones you want: just run 'perf record -e EVENTNAME' and you're there. But there's no explanation - anywhere, as far as I can find - as to what those events are actually measuring. Some of the events are hardware events, like L1-icache-loads, and I can guess that L1-icache-loads means loads into the CPU's level-one instruction cache, and I even have some idea what that means. But what does LLC-loads mean? Beats me, and Google's not much help either: it's not a limited liability company, and it's not IEEE 802.2 either. What does node-loads mean? No clue.
The sort-of-good news about hardware events is that if you can figure out what the hardware event names mean, then you can go read a book about CPU architecture and have some idea what they are measuring. But perf also includes software events like cpu-clock and task-clock that have no meaning apart from the Linux kernel, and there's no documentation about what those mean, either. I get approximately equivalent results from profiling with the default event, cycles, with task-clock, and with cpu-clock, and I haven't been able to find an explanation anywhere as to what the difference is. For perf to be broadly useful, beyond the circle of developers who created it, it needs documentation explaining what all of these events are, and how to make effective use of them.
On the plus side, there are some events for which the meaning is not too difficult to guess. So far, by far the most useful of these has been the context-switches event, which can be abbreviated to cs. 'perf record -g -e cs' allows you to see what's causing your program to context switch, and this turns out to be useful in troubleshooting lwlock contention within the PostgreSQL backend. This command led directly to a patch to avoid repeated CLOG access from heap_hot_search_buffer, because in looking through the context switch call graph I discovered an unexpected and unnecessary contributor to lock contention. The problem was easy to fix once spotted, but without perf, I wouldn't have realized that we had the problem in the first place.
One limitation of this technique is that it only finds context switches based on frequency. If we're context-switching out at one point in the code very frequently, but only for short periods of time, and at some other point in the code less frequently, but for much longer periods of time, only the former is visible with this technique. What I'd really like to have is a profile based on wall-clock time. I'd like to have an event that would allow me to see the places where the program spends most of its elapsed time, as opposed to most of its cycles. I thought that perhaps task-clock or cpu-clock would do this, but they don't seem to; thus far, the only profiling tool I've found that does this is HP's Caliper. If anyone out there can figure out a way to make perf do this, I owe you a beer.
Similarly, while perf can trace system calls, it can only tell you which system calls are happening frequently, not which ones are taking the most system time. It seems to get overloaded pretty easily when tracing system calls, too. Fortunately, for perf, getting overloaded means losing some data rather than (as with oprofile) taking down the machine.
Now that I've started to feel comfortable with the core perf record/perf report functionality, I've been branching out and trying some of the other tools that come as part of the package. I tried 'perf sched', but didn't find the results very informative; it seems like this may only be intended to troubleshoot bugs in the kernel scheduler. I also tried 'perf diff', but it seg faulted. So, instead of using the Fedora 17 packages, I downloaded the Linux kernel and compiled from source, and that did fix the seg fault, but it still didn't do what I wanted. 'perf diff' is supposed to allow differential profiling - essentially, you can subtract the results of two profiles and see what the differences are. However, it's really only intended to let you difference the results of two runs against the same set of binaries. If you have, say, binary A, built from the upstream branch tip, and binary B, built with one or more patches, and you do a profiling run with each one, perf diff won't realize that those binaries are "the same program" and tell you about the symbols that are hotter in one profile than the other. Instead, it'll observe that binary A isn't called at all in the profiling run using B, so when diffing A against B, all the deltas for the symbols in B are the same as their absolute values. There are probably situations in which the actual behavior here is quite useful, but it wasn't what I was looking for.
All in all, perf seems to be the best of the Linux profiling tools that are currently available by a considerable margin, and I think that they've made good decisions about the functionality and user interface. The lack of proper documentation is probably the tool's biggest weakness right now, but hopefully that is something that will be addressed over time.