uprofile(1)uprofile(1)NAME
uprofile, kprofile - Profile a program (uprofile) or kernel (kprofile)
with Alpha on-chip performance counters
SYNOPSISuprofile [-v] [-quiet] [-dirname path] [-[no]pids] [-all | -each |
-one] [-stride n] [-average] [-pixie] [-display | prof-option...]
[statistic...] program [argument...]
kprofile [-v] [-quiet] [-dirname path] [-[no]pids] [-all | -each |
-one] [-stride n] [-average] [-pixie] [-display | prof-option...]
[-k kernel_name] [-t] [-ra] [statistic...] [program [argument...]]
DESCRIPTION
See prof_intro(1) for an introduction to the application performance
tuning tools provided with Tru64 UNIX.
The uprofile command uses the Alpha on-chip performance counters to
produce a finely-grained program-counter profile of a user program. The
command runs the program you specify with the arguments you specify,
collecting the selected statistics on the program's process and its
descendants. It writes the profile data to the umon.out file, by
default. If the program calls shared libraries, those libraries are not
profiled.
The kprofile command uses the Alpha on-chip performance counters to
produce a detailed program-counter profile of the kernel. If you spec‐
ify a program, kprofile runs the program with the arguments you spec‐
ify, and it collects the selected statistics on the kernel for the
duration of the program's execution. If you do not specify a program,
kprofile collects the selected statistics on the kernel until you enter
Ctrl/C or the kprofile process receives a SIGTERM signal. Note that if
SIGINT (usually generated by entering a Ctrl/C at the controlling ter‐
minal) is currently being ignored, it will continue to be ignored and
SIGTERM must be used to terminate data collection. kprofile writes the
profile data to the kmon.out file, by default.
If you specify -display or any of the prof-options, the uprofile and
kprofile commands display the profile by runnning the prof tool (with
any specified prof-options).
You can also run the prof command separately, to help analyze the data
in the umon.out or kmon.out file. The following examples show how to
invoke the prof command to analyze data in the respective files: % prof
a.out umon.out % prof /vmunix kmon.out
The CPU-time profile displayed by prof will not be accurate if the CPU
speed of the processors that executed the application are not the same,
as in certain multiprocessor systems containing EV67 or later proces‐
sors. The inaccuracy may be avoided by using the hiprof (sampling) or
cc -p/-pg profilers, or by running the application on a subset of the
processors: Select a single processor using the runon command. Check
the processor speeds using the psrinfo -v command and run the applica‐
tion in a processor set comprising only processors that run at the same
speed (see processor_sets(4))
OPERANDS
The name of an event that your particular Alpha hardware can profile,
as detailed in the STATISTICS section, below. If no statistic is named,
machine cycles are counted, giving a CPU-time profile. One statistic
can be specified for each of the hardware counters on your machine.
The name of the executable to run while profiling operations are being
performed. An argument to pass to the program that is run. Multiple
arguments can be specified, as needed by the program.
OPTIONS
Options can be abbreviated to three characters, except the
prof-options, which can be abbreviated (usually to one character) as in
a prof command. For example, -qui is interpreted as quiet, but -q is
interpreted as -quit. (See the -display option for the supported
prof-options.)
For options that specify a procedure name (proc), C++ procedures can
omit the argument type list, though this will match all overloaded pro‐
cedures with that name. To select a specific procedure, specify the
full symbol name (as printed by the nm command). Symbol names contain‐
ing spaces, *, and so on must be quoted. Engages verbose mode, which
prints some useful information about the program being profiled. Pre‐
vents informational and progress messages from being printed. Speci‐
fies the directory path in which the profiling data file or files are
created. [Disables] or enables the addition of the process-id number
to the name of the profiling data file or files. Specifies which mode
to use for profiling on multiprocessor machines. Using the -all option
(the default) aggregates the data for all CPUs into one umon.out file.
Using the -each option collects separate profiles for each CPU and
writes the output into a set of files named umon.out.n, where n is the
CPU number. Using the -one option profiles only the current CPU. For
the -one option to work, the uprofile or kprofile program must be run
using the runon command. Sets the granularity of the sample counts,
where n is the number of consecutive instructions grouped together for
each sample count. The default is -stride 4. The -asm, -heavy, and
-lines prof-options need a separate sample count for each instruction
(for their reports to be precise enough), so these options imply
-stride 1. This makes the output file four times bigger than the
default size. The -stride argument must be a power of two (for example,
1, 2, 4, 8). Attempts to average samples within basic blocks so that
each instruction within a basic block will show the same number of sam‐
ples. Ensures fine grain profiles by setting stride to 1. Produces and
files similar to those produced by running an executable instrumented
with pixie (see pixie(1)). Uses cycles0 statistic (freq on EV67) by
default. Ensures fine grain profiles by setting stride to 1. Overrides
the name of the kernel to profile. (The default is the booted kernel.)
Enables triggered mode for kprofile. This option sets up all required
information for running the performance counters, but does not invoke
them. See the STATISTICS section for additional information. Enables
PCNTCALLER mode for kprofile. Collects profiling data on the caller of
certain kernel utility routines (for example, bcopy, bzero, sim‐
ple_lock), instead of the routine itself. Runs prof on the resulting
profile data file(s). The following prof options are supported: Reports
the profile as an annotated disassembly. Excludes procedure proc from
the profile but includes its CPU time or other statistic in the total.
Excludes procedure proc from the profile and from the total. Profiles
source lines, printing those with the highest CPU time or other statis‐
tic first. Reports the profile per source line within each procedure.
Merges all profile data files into file. Prints each procedure's
starting line number. Includes only procedure proc in the profile, but
totals all procedures. Includes only procedure proc in the profile and
in the total. Profiles procedures, printing those with the highest CPU
time or other statistic first. Truncates the reports after n lines or
after (cumulative) n percent of the whole.
STATISTICS
You specify the statistics that you want to collect for the program
being profiled in one or more statistic operands.
If you specify multiple statistics, uprofile and kprofile accumulate
their results. You cannot then view the results of any single statistic
separately. Because collected data is merged into a single buffer,
interpretation of multiply collected statistics may be difficult.
The Alpha architecture implemented on your machine determines which
statistics can be collected and the number of counters available for
collecting multiple statistics at the same time. The implementation is
indicated by the Alpha chip number, which can be displayed with the
show config console command before booting Tru64 UNIX, or, after boot‐
ing, by using the psrinfo -v command, or by calling getsysinfo
(GSI_PROC_TYPE). Also, if the uprofile command is run without argu‐
ments, it will show how many counters and what statistics are available
on your machine.
All of the chips in the EV4 family (21064 [EV4], 21064A [EV45],
21066/21068 [LCA4]) have two performance counter registers, each of
which can be separately programmed. The statistics that each counter
can collect are shown in the following table:
──────────────────────────────
Counter0Stats Counter1Stats
──────────────────────────────
0disabled 1disabled
issues dcache
pipedry icache
loads dualissues
pipefrozen mispredicts
branches floatops
cycles intops
PALcycles stores
nonissues novictims
victims
──────────────────────────────
All of the chips in the EV5 family (21164 [EV5], 21164A [EV56], and
21164PC [PCA56]) have three performance counter registers, each of
which can be separately programmed. Some of the counters are common to
all EV5 implementations, some are specific to EV5 and EV56, and some
are specific to PCA56.
The statistics that each of the common EV5 counters can collect are
shown in the following table:
──────────────────────────────────────────────────
Counter0Stats Counter1Stats Counter2Stats
──────────────────────────────────────────────────
0disabled 1disabled 2disabled
cycles0 nonissues longstalls
issues splitissue pcmispredicts
pipedry branchmispredicts
replay icachemisses
singleissues itbmisses
dualissues dcacheldmisses
tripleissues dtbmisses
quadissues ldsmerged
flowchanges ldureplays
intops fullreplays
floatops externalinput
loads cycles2
stores memorybarriers
icacheacc lockedloads
dcacheacc
──────────────────────────────────────────────────
The statistics that each of the EV5- and EV56-specific counters can
collect are shown in the following table:
───────────────────────────────────
Counter1Stats Counter2Stats
───────────────────────────────────
scacheacc scachemisses
scachereads scachereadmisses
scachewrites1 scachewritemisses
scachevictim scachesharedwrites
bcacheref scachewrites2
bcachevictim bcachemisses
sysreqs systeminvalidates
systemreadrequests
───────────────────────────────────
The statistics that each of the PCA56-specific counters can collect are
shown in the following table:
──────────────────────────────────────────
Counter1Stats Counter2Stats
──────────────────────────────────────────
bcachereads bcachedreads
bcachedreadhits bcachereadhits
bcachedreadfills bcachereadfills
bcachewrites bcachewritehits
bcachecleanwritehits bcachewritefills
bcachevictims sysreadflushhits
readmisstwo sysreadflushmisses
readmissthree
──────────────────────────────────────────
The EV6 chip has two performance counter registers, each of which can
be separately programmed. The statistics that each of the EV6-specific
counters can collect are shown in the following table:
──────────────────────────────
Counter0Stats Counter1Stats
──────────────────────────────
0disabled 1disabled
cycles0 cycles1
retinst retcondbranch
retdtb1miss
retdtb2miss
retitbmiss
retunaltrap
replay
──────────────────────────────
The default is to gather cycle statistics in the 0th counter and to
disable other counters.
The EV67 chip has two kinds of performance counters: traditional aggre‐
gate counters and profile-me counters. The traditional aggregate sta‐
tistics that each of the EV67-specific counters can collect are shown
in the following table. Any one statistic or statistic combination may
be selected.
──────────────────────────────
Counter0Stats Counter1Stats
──────────────────────────────
0disabled 1disabled
cycles0 replay
retinst cycles1
retinst bcachemisses
──────────────────────────────
If no aggregate statistics are selected, one profile-me statistic may
be selected:
─────────────────────────────────────────────────────────────────────────────
Profile-me Statistics
─────────────────────────────────────────────────────────────────────────────
2disabled abort abort_per_ret arith_trap
cbr_taken cbr_taken_per_ret cycles cycles_per_ret
delay delay_per_ret dstream_fault dtb_miss
dtb_miss_per_ret dtb_miss3 dtb_miss4 early_kill
early_kill_per_ret fp_disabled freq icache_miss
icache_miss_per_ret icache_parity inflt_bcache inflt_replays
inflt_retires interrupt istream_accvio itb_miss
ldst_order ldst_unalign map_stall map_stall_per_ret
mispredict mispre‐ opcdec replay_trap
dict_per_ret
replay_trap_per_ret retire trap trap_per_ret
valid
─────────────────────────────────────────────────────────────────────────────
The default is to gather cycle statistics in the 0th counter and to
disable other counters.
For descriptions of the statistics for all EV4, EV5, and EV6 implemen‐
tations, refer to pfm(7).
You can disable any counter by specifying 0disabled, 1disabled, or
2disabled as the counter statistic. You can use this feature to iso‐
late specific event types, such as loads, without extraneous data being
generated. You cannot disable all counters at the same time, choose two
statistics for the same counter, or disable a counter once its statis‐
tic is specified.
When you specify no counter statistics, uprofile and kprofile count
cycles on counter 0 by default, and display (through prof) a profile in
terms of seconds used by each procedure in the program, except for any
shared libraries.
For noncycle statistics, the displayed profile shows the number of sam‐
ples recorded, the sampling interval (events per second), and the total
number of events that this implies. Most noncycle statistics of the EV5
family CPUs are recorded about six cycles after the instruction that
triggered the sample. So, when using prof's -asm or -lines option, the
samples should be associated with one of the previously executed few
instructions of lines. The icacheacc, icachemisses, and dtbmisses sta‐
tistics are usually attributed precisely.
To perform a detailed analysis of short sections of kernel code, use
the kprofile command with triggered mode (invoked with the -t option).
When you use this mode, kprofile performs all of the required setup for
enabling the counters as normal, but does not invoke them. You can
insert counter start or stop commands into the kernel code to be
instrumented as follows:
Turn counters on: wrperfmon (PFOPT, 1) Turn counters off: wrperfmon
(0)
You can turn the counters on and off repeatedly to collect data over
many iterations or multiple sections of code.
The macro PFOPT is defined in <sys/pfcntr.h>.
NOTES
The interrupt load that profiling places on the system may affect per‐
formance, but usually the effect is insignificant.
The kernel in use must have the pfm pseudo-device configured into it.
To do this, use one of the following methods: Add the following line to
the kernel configuration file, and rebuild the kernel. Do not use this
method if CPU hot-swap is supported by the system, because it does not
allow pfm to be easily unconfigured, as required for a hot-swap;
instead, use the sysconfig method below. pseudo-device pfm Enter
the following command from the root account. Do not configure pfm if
CPU hot-swap is anticipated. # sysconfig -c pfm
If pfm is configured, the CPU hot-swap procedure requires that
it be unconfigured, using the following command, before any CPU
is swapped: # sysconfig -u pfm
The autosysconfig program can be used to automatically load the
configurable pfm device at each system startup.
The format of the data files produced by uprofile in Tru64 UNIX is dif‐
ferent from the format produced in versions of DIGITAL UNIX prior to
Version 4.0. The Tru64 UNIX data files include the names of selected
statistics in profile displays. To convert these data files to the
industry-standard format, at the expense of losing the names of the
statistics, use the pdtostd command.
RESTRICTIONS
The EV4 victim and novictim statistics rely on the external performance
counter pin connections as described in the EV4 chip specification. The
DEC 3000/400, /500, /600, and /800 workstations have these connections.
Attempts to display either of these statistics on other platforms
(while allowed) will typically generate empty data.
The uprofile command is only supported on EV4 Pass 3 or later proces‐
sors. Attempts to use it on a Pass 2 processor will gather PC samples
for every process running on the system.
Using kprofile to generate statistics for a single command is only pos‐
sible on EV4 Pass 3 or later processors. Attempts to do this on a Pass
2 processor will gather statistics for the entire system, as if no com‐
mand had been specified.
Using kprofile with triggered mode also requires an EV4 Pass 3 or later
processor and cannot be performed with per-process monitoring.
Only one tool can use the performance counters at a time. A message
similar to “the counter device is busy” indicates that some other tool
is using the performance counters (or has used them but not cleaned up
properly). If you are sure no one else is using the performance coun‐
ters, running uprofile/kprofile with superuser privilege will attempt
to reset the busy status and proceed.
FILES
The performance counter device file. The statistics file(s) generated
by uprofile. The statistics file(s) generated by kprofile. The sta‐
tistics file(s) generated with the -pids option. The default kernel to
profile.
SEE ALSO
Introduction: prof_intro(1)pdtostd(1), pfm(7), prof(1), runon(1), psrinfo(1), sysconfig(8),
autosysconfig(8), processor_sets(4)
Programmer's Guide
uprofile(1)