SystemProfiling

From LucidDB Wiki
Jump to: navigation, search

This page describes how to set up and use OProfile, a system-wide native-code profiler, for use with Eigenbase on Linux.


Contents

Setup

On a Debian-based system, you can install OProfile with apt-get install oprofile.

Normally, you should only profile optimized code. Build Farrago and/or Fennel using the --with-optimization --with-debug combination since OProfile gives more details when it has debug symbols available.

Don't use OProfile on a laptop; the performance counters usually aren't very reliable there.


Running

OProfile requires root access to control its daemon. Here's the usual sequence:

  1. Dry-run whatever you plan to profile to make sure it works, and also to get all images cached so load time will be less significant during profiling.
  2. opcontrol --no-vmlinux. (Tell OProfile that you don't have a kernel image. Unless you do, in which case see below.) You only need to do this once.
  3. opcontrol --reset. (Clears out any accumulated data.)
  4. opcontrol --start. (Starts collection.)
  5. Run something, e.g. a unit test or SQL script, to generate a profile
  6. opcontrol --stop. (Stops collection.)
  7. opcontrol --dump. (Flushes collected data.)
  8. opreport > system.txt. (Produces a system-wide profile showing sample counts for all executables running during the collection period.)
  9. opreport -l > symbols.txt. (Produces a detailed breakdown with symbols; add -g if you want source/line info as well.)
  10. opcontrol --shutdown. Do this once you're all done with profiling so the daemon will shut down and stop stealing resources.
  11. opcontrol --deinit. Fully unload oprofile (this is a guaranteed way to clear out any changes you've made to settings).

Example Output

Below is a portion of a profile collected while running a SQL test. It shows that most of the time was spent in the JVM and the kernel; a few Fennel libraries (libfarrago-0.5.0.so and libfennel_exec-0.5.0.so) also show up.

TIMER:0|
samples| %|
------------------
11197 43.0356 libjvm.so
11116 42.7243 no-vmlinux
724 2.7827 libc-2.3.5.so
606 2.3292 libpthread-2.3.5.so
595 2.2869 libverify.so
360 1.3837 libzip.so
251 0.9647 libjava.so
202 0.7764 ld-2.3.5.so
143 0.5496 Xorg
131 0.5035 libfarrago-0.5.0.so
83 0.3190 libvte.so.4.4.0
75 0.2883 libglib-2.0.so.0.800.3
68 0.2614 libfennel_exec-0.5.0.so

Variations

  • Sometimes there's a lot of setup/shutdown time involved in whatever you want to profile, but you don't want that overhead to obscure the code of interest. If you can't break it down so that you can script the setup/shutdown outside of the collection window, do it manually instead: get the system warmed up, turn on collection for a while, and then turn off collection before the test finishes.
  • If you want call-graph profiling, add --callgraph=32 (or whatever depth you need) to the opcontrol --start line, and add -c to the opreport -l line.
  • If a significant amount of time is spent in the kernel (usually shows up as the no-vmlinux image), you may want to get a breakdown to see what portions of the kernel are burning time. Consult the OProfile docs for how to do this. Depending on your distro, you'll need to install extra kernel packages such as source or debuginfo, or in the worst case compile your kernel from source.
  • There are lots of other options, like using opannotate to annotate source files with line-by-line sample counts, or restricting profiles to a single program. Read the OProfile docs.

Events

Typically, CPU cycles are of main interest, but oprofile can also be used to track other important counters such as processor cache misses. For example, here's how to set it to track L2 cache misses on an AMD64 machine:

  • opcontrol --event=L2_CACHE_MISS:10000

This tells it to bump up a counter after every ten thousand L2 cache misses.

Here's a little program you can use to test this out:

#include <stdlib.h>

static void thrash(int n)
{
  int *x = new int[n];
  for (int j = 0; j < 1000000000; ++j) {
    int i = rand() % n;
    ++x[i];
  }
}

int main(int, char**)
{
  thrash(10000000);
}

Example results:

real    3m26.299s
user    3m25.772s
sys     0m0.256s
----
CPU: AMD64 processors, speed 1000 MHz (estimated)
Counted L2_CACHE_MISS events (L2 Cache Misses) with a unit mask of 0x07 (multip\
le flags) count 10000
L2_CACHE_MISS:...|
  samples|      %|
------------------
    92589 90.0076 a.out
     5914  5.7491 libc-2.5.so
     3531  3.4326 no-vmlinux

That's about what we expect for 1 billion cache misses (92589 => roughly 100000 * 10000).

If you reduce the memory-size parameter n to thrash(100000), then everything can fit into a 1MB L2 cache, and the misses go away (even though the same number of array access operations are performed):

real    1m5.399s
user    1m5.349s
sys     0m0.010s
----
CPU: AMD64 processors, speed 1000 MHz (estimated)
Counted L2_CACHE_MISS events (L2 Cache Misses) with a unit mask of 0x07 (multip\
le flags) count 10000
L2_CACHE_MISS:...|
  samples|      %|
------------------
      200 45.4545 no-vmlinux
       39  8.8636 libc-2.5.so
Personal tools
Product Documentation