		Function Graph Tracing

	   Documentation written by Tim Bird


Function graph tracing is a variant of function tracing, which traces
both the entry and exit of each kernel function.

Preparation
-----------

The function graph tracing feature is compiled in using the
CONFIG_FUNCTION_GRAPH_TRACER option. Tracing is disabled by default, so it is
safe to have this set to yes. On platforms without dynamic tracing capability
(e.g. ARM in 2.6.30), function tracing (and function graph tracing) add
significant overhead to function execution in the Linux kernel).  On these
platforms it would be unwise to leave function tracing turned on in
production environments.

Note that function graph tracing is supported on SMP systems.

Usage Quick Reference
---------------------

 $ mount -t debugfs debugfs /debug
 $ echo function_graph > /debug/tracing/current_tracer
 $ echo funcgraph-abstime >/debug/tracing/trace_options
 $ echo 1 >/debug/tracing/tracing_enabled
 $ <do something> ; echo 0 >/debug/tracing/tracing_enabled
 $ cat /debug/tracing/trace > /tmp/trace-something.txt
 $ echo nop > /debug/tracing/current_tracer

Usage
-----

Make sure debugfs is mounted to /debug. If not, (requires root privileges)
 $ mount -t debugfs debugfs /debug

Activate function graph tracing (requires root privileges):
 $ echo function_graph > /debug/tracing/current_tracer

Enable tracing (if not already enabled)
 $ echo 1 >/debug/tracing/tracing_enabled

Do something, and quickly disable tracing, to avoid overrunning the
related events in the trace log.  Note that the trace log uses a ring
buffer, which continually overwrites older events in the log, until
tracing is disabled.

 $ <do something> ; echo 0 >/debug/tracing/tracing_enabled

Store the trace:
$ cat /debug/tracing/trace > /tmp/trace-something.txt

Extra Tip:
During tracing you can place comments (markers) into the trace by

 $ echo "foo happened" > /debug/tracing/trace_marker

This makes it easier to see which part of the (huge) trace corresponds to
which action. It is recommended to place descriptive markers about what you
do.  (I'm not sure how effective this is for function graph tracing.  The
trace buffer fills so quickly that any comment made in "human time" will
likely get overrun in the trace buffer before a human has a chance to stop
the trace. - this tip was copied from mmiotrace.txt)

Shut down function graph tracing (requires root privileges):
 $ echo nop > /debug/tracing/current_tracer

If it doesn't look like sufficient data was captured, you can enlarge the
buffers and try again. Buffers are enlarged by first seeing how large the
current buffers are:

 $ cat /debug/tracing/buffer_size_kb

gives you a number. Approximately double this number and write it back.

For instance:
 $ echo 128000 > /debug/tracing/buffer_size_kb

Then start again from the top.

How function graph tracing works
--------------------------------
The function tracer (upon which the function graph tracer is dependent) works
by having the compiler put a call to a special instrumentation routine (called
'mcount') in the prologue of every kernel function.  (Well, almost every
kernel function - some functions can not be traced for one reason or another.)
The function graph tracer records the entry time for each function traced.
Then, it arranges to capture the function exit, by setting up a return
trampoline.  This is done by recording the real return address, and
substituting the trampoline address for the return address in the process
stack.  When the function returns to the tracer, the time is recorded again,
and the real caller is returned to.

Trace Log Format
----------------

The log format for the function graph tracer is text and is unfortunately NOT
easily filtered with e.g. grep and awk.  The output is easily read by a human,
and is useful for showing how functions nest within other functions.  But
since the duration of each function is printed at the end of the routine (on a
line with no name, just a closing brace), you have to scan through the log to
match up the duration times with the matching functions.

The function graph tracer consists of a header showing the tracer name, and
the fields that are configured for display on each line.  Then lines are shown
for function entry and exit events.  Also, when a process change occurs (that
is, a schedule occurs in the kernel), a special sequence is shown).

Here is a sample showing the log header and a few sample trace lines.

# tracer: function_graph
#
#      TIME       CPU  DURATION                  FUNCTION CALLS
#       |         |     |   |                     |   |   |   |
 2105.963678 |   0)               |      mutex_unlock() {
 2105.963682 |   0)   5.715 us    |        __mutex_unlock_slowpath();
 2105.963693 |   0) + 14.700 us   |      }
 2105.963698 |   0)               |      dnotify_parent() {

A process change line shows the process name and pid for the outgoing and
incoming tasks.  The outgoing task is shown on the left, followed by '=>'
then the incoming task.

Here is a sample of a process change:
 ------------------------------------------
 0)     ls-595     =>    sed-596
 ------------------------------------------

You can configure the items displayed for each trace element, by setting
/debug/tracing/trace_options.  (See Trace Display Options below)

The following elements are available for display:

TIME - this is the (absolute) time since the machine started in seconds, with
a decimal portion showing resolution to microseconds.  The actual resolution
of the time (and of the tracer timings in general) will depend on the specific
clock used by ftrace. This option is off by default.

CPU - indicates the CPU on which the function was executed

TASK/PID - shows the task name and PID (process ID) for each trace entry.  The
entry has the format <task>-<pid> (e.g. "sh-443").  This option is off by
default.

OVERHEAD (not labeled) - is a flag indicator, showing functions whose
duration exceeded certain threshholds:
   space (or nothing) - the function executed in less than 10 microseconds
   + - the function lasted longer than 10 microseconds
   ! - the function lasted longer than 100 microseconds

DURATION - this is printed for function exit trace events, and shows
 the time in microseconds that the function executed.  This is "wall time"
for the function, and includes the time spent in any sub-routines or
with the CPU scheduled away from this routine (this routine or a child
waiting for a something.)

FUNCTION CALLS - this shows the name of the function for function entry events,
and a closing brace '}' for function exit events. For function entry (only,
not a leaf) the line consists of a function name and an open brace.  For
function exit, the line consists of a closing brace.  Note that the function
name and closing brace are indented to indicate their depth on the call stack.
This allows you to match up function exits with their matching functions, and
to easily identify the call tree (graph) for a particular function.

Trace Display Options
---------------------

The following display options are available for customizing the function
graph trace output:

   abstime - show TIME
   cpu - show CPU
   overhead - show OVERHEAD indicator
   proc - show TASK and PID
   duration - show DURATION
   overrun - shows if the trace had an overrun (used for debugging the tracer)

To set an option echo a string to /debug/tracing/trace_options, using the format:
"funcgraph-<opt_name>".  To unset a particular option, use the format:
"nofuncgraph-<opt_name>".

 $ echo funcgraph-abstime >/debug/tracing/trace_options
 $ echo nofuncgraph-cpu >/debug/tracing/trace_options
 $ cat trace

Trace filter options
--------------------

The function graph tracer supports a few filtering options which are
useful for analyzing a subset of function events in the system.

Filter by depth
---------------

Filtering by depth==0 allows you to see only the functions where the kernel
was called directly from user space.  This is useful to see syscalls and page
faults (only).  This can be used like a system-wide strace - except it shows
faults as well as system calls.

To do this, set a function entry and function exit filter for a depth of 0.

 $ echo "graph_ent.depth == 1" >/debug/tracing/events/ftrace/funcgraph_entry/filter
 $ echo "ret.depth == 1" >/debug/tracing/events/ftrace/funcgraph_exit/filter

Then capture your trace as usual.

Tutorial/Samples
----------------

For an interesting first trace, try the following:
(make sure no other processes are active, if you can)

 $ mount -t debugfs debugfs /debug
 $ cd /debug/tracing
 $ echo function_graph > current_tracer
 $ echo funcgraph-abstime > trace_options
 $ echo 10000 > buffer_size_kb
 $ echo 1 > tracing_enabled ; ls /bin | sed s/[aeiou]/z/g ; \
	echo "marker test!" > trace_marker ; echo 0 >tracing_enabled
 $ cat trace > /tmp/trace1.txt

You might need to change the buffer size to 20000 (20 M) if
you don't capture the entire trace (if the top of the buffer
starts in the middle of ls or sed execution).

Now examine the data:
 $ ls -l /tmp/trace1.txt
 $ vi /tmp/trace1.txt

Note that the trace data is quite large, but probably only covers one
or two seconds of execution time.

Here are some things to look for in the trace:

 * Watch as the shell forks and execs ls and sed.
 * Watch for timer interrupts which cause context switches between the three
 * processes (sh, ls and sed).
 * Watch for page faults during ls and sed startup, as the processes are paged
   into memory.
 * Look for routines starting with "sys_" which are at the top level of the
   call graphs. These are system calls.
 * Look for the duration value for the system calls, on the lines with the
   closing braces at the same indentation level.  You can see how long
   different system calls took to execute, and what sub-routines were called
   internally in the kernel.  Also, you can determine if a long-running
   system call was interrupted (and the process scheduled out, before
   returning)
 * Look for the comment "marker test!" near the end of the trace.


