Many embedded systems require hard or soft real-time execution that must
meet
rigid timing constraints. Further complicating the issue is that for a
variety of reasons, most of these same embedded systems have very
limited processing power; it is not uncommon for them to be using an
8-bit or 16- bit processor operating at 10 MHz or less.
Real-time systems theory advocates the use of an appropriate
scheduling algorithm and performing a schedulability
analysis prior to building the system. Adherence to this
theory
alone does not lead to working embedded systems, and thus use of this
theory is often dismissed by practitioners.
Practitioners, on the other hand, spend days - if not weeks - of
testing
and debugging hard-to-find and
difficult-to- replicate problems because their system is not performing
to specifications. Often, these problems are related to the system's
timing, because functional testing was done using good tools, and the
system usually produces a correct response.
There exists a balance between theory and practice, where proper
design of real-time code enables the real-time analysis of it.
Systematic techniques for measuring execution time can then be used
alongside the guidelines provided by real-time systems theory to help
an engineer design, analyze, and if necessary quickly fix timing
problems in real-time embedded systems.
This series of two articles discusses techniques for measuring and
optimizing real-time code, and analyzing performance by correlating the
measurements with the real-time specifications through use of real-time
systems theory. Since this paper is directed towards practitioners,
simple rules of thumb that encapsulate the knowledge of complex
theories and proofs are presented.
Several other activities of the development process can benefit from
estimating and measuring execution time using the methods described
here. This includes debugging hard-to-find timing errors that result in
hiccups in the system, estimating processing needs of software, and
determining the hardware needs when enhancing functionality of an
existing system or reusing code in subsequent generations of embedded
systems.
Overview of Measurement Techniques
Many different methods exist to measure execution time, but there is no
single best technique. Rather, each technique is a compromise between
multiple attributes, such as resolution, accuracy, granularity, and
difficulty. A summary of the key attributes follows:
Resolution is
a representation of the limitations of the timing hardware. For
example, a stop watch measures with a 0.01 sec resolution, while a
logic analyzer might be able to measure with a resolution of 50 nsec.
Accuracy
is the closeness of the measured value using a given method of
measuring, as compared to the actual time if a perfect measurement was
obtained. If a particular measurement is repeated several times, there
is usually some amount of error in the measurements. Thus, measurements
could yield answers of the form x +/- y. In this case, y is the
accuracy of the measurement x.
Granularity
is the part of the code that can be measured, and usually specified in
a subjective manner. For example, coarse granularity (also called
coarse-grain) methods would generally measure execution time on a
per-process, per-procedure, or per-function basis.
In contrast, a method that has fine granularity (also called
fine-grain) can be used to measure execution time of a loop, small code
segment, or even a single instruction. Important to note is that some
fine-grain techniques can also be used to perform coarse-grain
measurements, although the effort in doing so could be much greater
than using a coarse-grain method.
Difficulty
subjectively defines the effort to obtain measurements. A method that
requires the user to simply run the code and it produces an instant
answer or a table of results is considered easy. A method that requires
usage of instrumentation such as a logic analyzer and filtering of data
to obtain the answers is considered hard.
Typically, software-only methods are easier, but yield only
coarse-grain results. Hardware-assisted methods are hard, but they can
provide fine-grain results with high accuracy.
 |
| Table
1: Summary of methods to measure execution time |
Table 1, above summarizes
the methods and attributes of each method as presented in this paper
Note that in many cases the attributes are approximated or subjective,
not exact values; however comparing the attributes of different methods
should provide sufficient information to help choose the best
measurement for a particular need.
The method of choice can also depend on the hardware features and
instrumentation tools available. For example, some methods require
special hardware features like a digital output port, while other
techniques require a specific software application or measurement
instrumentation to be available. In some cases, the hardware or tools
needed can be quite expensive and the cost and lack of availability can
prevent using a particular method.
On the other hand, having access to the right tools can
significantly decrease the amount of effort needed to obtain needed
measurements, and thus obtaining the tools most suited to a project's
needs could be a worthwhile investment.
The design of the software can also have a major impact on the
ability to obtain measurements of execution time, but it is not
classified as an attribute, as there is no way to quantify or qualify
every possible variation.
In particular, the execution time of software designed in an ad-hoc
manner (also known as "spaghetti code")
is very difficult to measure because the starting and stopping points
of the code are not easy to identify, and if there are multiple and
inconsistent entry or exit points to the same piece of code, then
obtaining accurate measurements is near impossible.
On the other hand, software designed so that it is "analyzable"
clearly has a single entry and exit point for any part of it that needs
to be measured, and those entry and exit points are defined
consistently for all code segments that have similar functionality.
Selecting a Method
To select which measurement method to use, first consider the reason
for measuring execution time. The most common reasons for measuring
execution time are to refine estimates, optimize code, analyze
real-time performance, and to debug timing errors.
Refining estimates is usually done during the design phase or early
in the implementation phase. The estimates might be used to select
which processor to use, or to obtain ballpark figures on how many
iterations of a particular function can be executed per second.
Coarse-grain measurements can provide some of these answers fairly
quickly. Sometimes, the measurements can even be made on the host
processor, with an approximated scale factor for the target processor
(such as "embedded processor X is about 18 times slower than host
processor Y".)
Optimizing code could use coarse-grain methods or fine-grain
methods, depending on what is being optimized. If optimization is at a
global scale, such as deciding whether it would be faster to use arrays
or linked lists in a particular application, then a coarse-grain
technique to measure execution time of complete functions is usually
sufficient.
On the other hand, for localized optimizations, such as those that
are specific to a target processor and occur during the late stages of
development or when trying to fine-tune an application, a fine-grain
technique that can measure execution time of a single line of code is
usually needed.
Analyzing real-time performance can use a coarse-grain technique,
but often only fine-grain techniques can provide the necessary
accuracy. The accuracy needs to be at least five to ten times faster
than the period of the fastest task.
Thus, if the fastest task in the system has a period of 10 msec,
then a measurement technique that provides an accuracy of at least 1 to
2 msec for functions is needed to provide fairly good answers. More
accuracy is better, especially if the Central Processing Unit (CPU) is
either overloaded or operating at almost 100% utilization. In these
cases, a technique with microsecond accuracy is needed.
Debugging timing errors usually needs a fine-grain method with
maximum resolution. It is often necessary to measure not only user
code, but also real-time operating system (RTOS) code, and to detect
any anomalies that might be occurring, such as missed deadlines or
tasks not executing at the desired rate.
Measurement Methods
Of the measurement techniques that were summarized in Table 1, a few methods are quite
straightforward, but most are only applicable to UNIX-based systems,
such as most embedded versions of Linux.
The software analyzer method is an all-encompassing description of
some features provided by commercial RTOS and tools. The techniques
described towards the end of this tutorial are the ones based on
hardware,
and can be used independent of the RTOS. These can provide the most
accurate results, but also involve the most complexity.
Stop-watch.
A stop watch is only suitable for non-interactive programs, preferably
running on single-tasking systems. It can be used to measure time of
things like numerical code which may take minutes or hours to execute,
and when measurements only need to be approximations (e.g. to nearest
second).
The method simply involves using the chronograph feature of a
digital wrist-watch (or other equivalent timing device). When the
program starts, start the watch. When the program ends, stop the watch,
and read the time.
Date command. The
date command is useful when using a UNIX-based system or any other RTOS
that has a command that displays the current date and time.
The date command is used like a stopwatch, except it uses the
built-in clock of the computer instead of an external stopwatch. This
method is more accurate than a stop-watch, but has the same granularity
of only being able to accurately measure non-interactive processes.
A typical way to use the command is to wrap the program that is
being measured in a shell script or alias with the following commands:
date > output
program >> output
date >> output
As with the stop-watch method, this will only provide an estimate of
how long the full program takes to execute. It does not take into
consideration preemption, interrupts, or I/O. Most accurate answers are
obtained on non-preemptive systems. This method is useful if the output
serves as a log, so that the start and end time of each execution is
logged into the file. A sample use is for long simulations that run in
the background overnight, and it can provide information to know
precisely when it ended.
Time command (UNIX). The
time command is useful when using a UNIX-based system. Other RTOSes
might
provide a similar command. Execution time measurement is activated by
prefixing time to a command line. This command not only measures the
time between beginning and end of the program, but it also computes the
execution time used by the specific program, taking into consideration
preemption, I/O, and other activities that cause the process to give-up
the CPU.
The output depends on which version of the time command is being
used. In some cases, the time command is part of the shell. In other
cases, it can be found in /usr/bin/time. In each case the output is the
same information, just the format is different. For example:
% time program
8.400u 0.040s 0:18.40 56.1%
Interpreting the output, the first item (with a u appended, u=CPU), is the
execution time of program, shown here as 8.4 sec. This is the amount of
time the CPU was actually executing the program. Any time spent
preempted, blocked for I/ O, or performing RTOS functions is excluded.
The second item (with s appended,
s=system), is the execution time used by the RTOS while running
the program. This includes execution time for items such as device
drivers, interrupt handlers, or other system calls directly associated
with the program. The example shows that 0.04 sec of execution time was
for system functions.
The third item is the total time that the program was executing in
the system, whether it be running or blocked or waiting on the ready
queue. In this case, it was 18.4 sec. This time is the about the same
time that would be reported using the date method above.
The fourth item is the average percentage of CPU time used when the
task was ready or running. The value primarily depends on the load of
the system, and has little meaning as far as measuring execution time.
Prof and Gprof (UNIX). The previous methods can
only be used to measure a complete program. Many times, it is necessary
to measure execution time at a finer granularity.
One method to measure execution on a per function basis is to use
the prof or gprof profiling mechanisms available in UNIX. Profiling
means to obtain a set of timing measurements for all (or a large part)
of the code. The granularity of a profile depends on the method. In
this case, both prof and gprof measure execution time with the
granularity of a function. The resolution is usually that of the system
clock, meaning on the order of 10 msec.
Both prof and gprof do similar things, except
that gprof gives much more
detailed results than prof. The measured time properly takes into
account preemption, such that if a process is preempted, the clock
stops until the process starts to execute again. This profiling
mechanism, however, does slow down execution of the program by a
non-negligible amount.
So the execution time measured when using prof or gprof will be greater than the real
execution time of the program when it is not being profiled. Despite
this inaccuracy, the method can be useful to identify which functions
in the program are using the most execution time, to identify where
optimizations might need to be made the most.
To use prof, compile with
the "p option then run program as follows (other compiler options can
be used too, this is just an example).
% gcc "p -o program program.c
% program
When the program terminates, the file mon.out is automatically
created. It is a binary file that contains the timing data by function
for the program. To view the timing data, type the following:
% prof program
A more detailed profile report can be obtained using gprof, by compiling with the "pg
option as follows:
% gcc -pg -o program program.c
% program
Running the program creates the file gmon.out, which can be viewed
as follows:
% gprof program
For information that describes the format of the statistics and what
each entry means, look at the online UNIX manuals for prof and gprof
for the specific operating system version being used.
Clock(). Although the prof/gprof method provides more
detailed information then the first few methods presented, it is often
necessary to measure execution time with finer granularity than a
function.
Suppose prof was used and
it shows that 90% of the time is spent in one subroutine. That
subroutine becomes the primary target for optimization. But if the
routine includes several loops, the next step is then to identify the
most time-consuming parts within that subroutine.
A possible approach is to use the clock()
function, as provided by many operating systems, including UNIX. In
this case, however, the program must be instrumented such that the
clock is read at the beginning and end of the code segment( s) being
measured.
Instrumenting the code means adding lines of code explicitly to
perform the timing measurements. Such lines of code are temporary, and
are removed once the desired data has been collected.
This method is useful for fine-grain measurements, such as a code
segment or loop, but it is not as convenient as prof/ gprof to obtain
measurements of multiple functions or processes at once. Here is an
example of a program that uses clock().
#include
clock_t
start,finish;
double total;
start = clock();
do stuff;
finish = clock();
total = (double) (finish - start) / (double) CLK_TCK
printf("Total =
%f\n",total);
There are several issues that must be taken into account when using clock(). The issues stem from the
fact that there is no standard implementation of this function, thus it
can produce different results for different operating systems.
For example, it can provide a value in microseconds, seconds, or
clock ticks. The reference manual for the particular operating system
should be referenced prior to using the
clock() function.
Depending on the system, clock() might
behave differently if the system is preemptive. In some cases, if the
task is preempted, the value returned by clock() will include the time spent
by the other task too. In other cases, it will only include time used
by its own process.
The clock() function is
certainly more useful when the implementation properly deals with
preemption. But even if it does not, see descriptions in the following
sections on how to deal with preemption.
It is also important to note the resolution. Even though clock() might report time in
microseconds, the resolution is usually the same as the system clock,
which can be computed as 1/sysconf (3). On many UNIX systems, this is
10 msec or longer. Calling the function sysconf() with the argument '3'
returns the value of the system clock.
If more resolution than 10 msec is needed, then one of two
approaches can be used:
1) Create a loop around what
needs to be measured, that executes 10, 100, or 1000 times or more.
Measure execution time to the nearest 10 msec. Then divide that time by
the number of times the loop executed. If the loop executed 1000 times
using a 10 msec clock, you obtain a resolution of 10 µsec for the
loop.
2) Use a hardware-based
method.
The advantage of the loop method is that it does not require any
special hardware. The disadvantage is that it forces a change in the
code; the change might affect the functionality, and could even cause
the program to crash.
At the very least, the code slows down by the number of iterations
performed just to get a reading, and thus real-time performance is
lost. If this is not acceptable, then one of the other methods must be
used.
Software Analyzer. The term
software analyzer is used as an all-encompassing phrase for software
tools provided by a variety of RTOS and tool vendors designed
specifically for measuring execution time. Examples include
TimeTrace [6], and WindView [7].
It is beyond the scope of this tutorial to describe how to use any
such tools, or to even recommend one tool over the other. Rather, this
section provides a general discussion to aid in understanding
capabilities of these tools.
The first step in using a software analyzer is to determine the
resolution and granularity. The resolution should be one of the
specifications of the product. It can also be determined experimentally
by slowly increasing execution time of a code segment, then monitoring
the measured value by the smallest time increment. That typically is
the resolution.
If the software analyzer is based on the system clock, then the
resolution will likely be on the order of a millisecond. If the
analyzer is based on some other hardware-based method, such as using an
onboard timer/counter chip, then the resolution might be in the
microseconds range.
The granularity is another important item to identify. Some software
analyzers will be like prof/gprof, and only be able to provide
information on a per-function or per-process basis. As with prof/gprof,
such analyzers are good if coarse-grain measurements are satisfactory,
but not very useful when optimizing localized code segments or tracking
down timing or synchronization errors.
A good software analyzer will not only provide information on
per-function or per-process basis, but it will also contain a means for
measuring execution time for smaller segments, such as a loop, block of
code, or even a single statement. The ability to measure execution time
of interrupt handlers and the RTOS overhead are also a bonus.
Some software analyzers provide a timing trace to show precisely
what process is executing at what time. Such a timing trace could be
helpful to an expert when debugging timing and synchronization errors,
but they do not offer data in a convenient format to analyze real-time
performance.
Also, if the timing trace is not correlated to the source code, then
it is not possible to identify what part of code is responsible for
extended periods of execution when such an event is detected in the
timing trace. Instead, a state mode that provides tabular data that can
be analyzed or download is needed.
Another issue to consider when using software analyzers are the
resources used. Some analyzers add overhead, and thus slow down code.
Most analyzers require lots of memory to log data, making the tool
ineffective when an embedded system's memory is already fully
allocated. In such cases, the hardware-based methods described below
can instead be used.
Timer/Counter Chip. Most
embedded computers have timer/counter
chips that are user programmable. If such a chip is available,
then it can be used to obtain fine-grain measurements of code segments.
The method presented here, however, is not very useful for coarse-grain
measurements, such as total execution time used by a function or
process.
This method is similar to using the clock() method described
earlier, in that the starting and stopping points of the code being
measured are instrumented directly into the code. At the beginning of
the code segment, the current countdown (or count-up) value of the
timer/counter is read. At the end of the code, the value is read again.
The difference between these two values represents how many timer ticks
have elapsed.
It is then necessary to determine the value of a timer tick. The
value of the timer tick is typically a multiple of the microprocessor
clock speed. It could be fixed, or user-programmable.
For example, an 8 MHz microcontroller has a cycle time of 125 nsec.
A timer-chip on this microcontroller has the timer tick user
programmable as 1x, 4x, 16x, or 256x, depending on the bit-pattern
written to one of the timer's control registers. Suppose 16x is chosen.
This means the timer-tick is 16 times 125 nsec, or 2 µsec.
This yields a mechanism with a resolution of 2 µsec, and
usually an accuracy of twice the resolution, meaning 4 µsec. With
this accuracy, it is possible to measure execution time of rather small
code segments.
If an RTOS is being used, there is a possibility that the RTOS has
already configured the timer/counter chip. In such a case, either use a
second timer/counter chip if one is available, or use the same chip as
the RTOS, but only read it. Do not change the timer configuration in
any way, as that can cause the RTOS to crash.
A question arises as to where do the answers go? In the clock()
example earlier, a print statement displayed results. But on a system
in which this timer/counter method is used, there is a good possibility
that a video display is not available.
If a small display is available (even a simple 4-digit 7-segment LCD
display), then values can be shown on the display. An alternative is to
send the data out on an output port, and collect it using a chart
recorder or logic analyzer. A third possibility is to store the data in
memory at a known location, then to peek into that memory using a
debugging tool or a processor's built-in monitor.
One issue that needs to be considered is overflow. If the timer is
16 bits, and its resolution is programmed to be 2 µsec, then it
will reset and start over every 130 msec. As a rule of thumb, the
method should be restricted to measuring code segments that are at most
10% of this maximum range, meaning up to about 13 msec for a 16-bit
timer with 2 µsec resolution.
In such a case, if the measurement is continuing on a periodic
basis, approximately 1 in 10 readings will be wrong, as it coincides
with the timer overflowing. That reading needs to be spotted and
discarded. This is quite easy to do as long as the code segment takes
about the same amount of time every time, in which case the data
reading that is discarded is the one that does not make sense.
Another issue occurs in a preemptive environment or when interrupts
are present. If the code segment being measured can be preempted, then
false data readings will be provided every time such a preemption
occurs within the code segment.
Several possibilities exist. One is to disable interrupts whenever a
measurement begins, then re-enable interrupts when the measurement
ends. This could affect real-time performance by causing priority
inversion and cause the application to not meet the specifications.
But often it is acceptable to do during the testing phase in order
to get the measurements of various code segments. A second alternative
is to discard readings that are much longer than the average reading,
as they represent measurements that include preemption. Anytime
readings are discarded, care must be taken to not accidently keep an
incorrect reading and discard a valid reading.
As a general rule, any discarding of data must always be done with
great care. Only discard a value if there is a reasonable explanation.
If there is concern that a good value might accidentally be discarded,
and such a mistake cannot be tolerated, then use a different method
that is not subject to the overflow of the timer chip, or more suited
to account for preemption.
Logic Analyzers
A logic analyzer is one of the best tools for accurately measuring
execution time with microsecond resolution, especially when accurate
timing is essential. The drawback is that the it requires specialized
hardware and more effort than some of the previous techniques described
above.
There are two approaches to using a logic analyzer. One approach is
to hook up the probes to the CPU pins. Connecting the logic analyzer to
a CPU emulator or using a bus analyzer has the same effect. While this
method is least obtrusive on the real-time code, it is also the most
difficult, as it requires reverse engineering the code to correlate
logic analyzer measurements with the source code.
Some logic analyzers provide processor disassembly support, but that
only provides correlation to the assembly code, and not necessarily to
the source code. This approach is not advocated, as it is very
difficult and does not yield answers that are any better than the other
approach described next.
However, a variation of this approach is to monitor only a single
memory location, in which case this becomes the same as the other
approach.
The other approach is to send strategic signals to an output port,
which are read by the logic analyzer as events. The code is
instrumented to send signals at the start and end of each code segment.
The instrumentation is encapsulated within a macro, so that
redefining the macro to an empty statement disables the instrumentation
without the need to change any part of the application code. This
approach is compatible both with large applications that use commercial
RTOS and smaller systems based on custom executives or even ad-hoc
code.
Necessary Embedded Hardware
Features
To use a logic analyzer to measure code, signals must be sent from the
software to the analyzer. The easier way is to use a digital output
port. It is highly recommended that any embedded application is
designed with at least one such port dedicated to testing and
debugging. A single 8-bit or 16-bit port can be used as a gateway to
seeing inside the program to save tremendous development time.
Some embedded hardware features more sophisticated windows to the
inside, such as JTAG and BDM. However, each of these
require
additional engines to drive the mechanism, and while very useful for
debugging functional code, their use can greatly affect real-time
performance, and thus not recommended when measuring execution time.
An 8-bit digital output port is usually sufficient for most
applications. A 16-bit port might be desirable for larger applications
as it enables the encoding more information to send to the logic
analyzer. If at least an 8-bit port is not available, there are other
alternatives.
If there is access to the CPU's address and data lines (for example,
if an emulator is attached to the system), then only a single memory
location on the CPU needs to be reserved. The address of that memory
location is used to trigger the logic analyzer, while the data lines
contain the information that would otherwise be sent to the digital
output port.
A similar method can be used if a bus-analyzer (such as a VMEbus or PCIbus analyzer) is present in
the
system. A bus analyzer only monitors accesses to external memory that
go over the bus. Thus the single memory location that is selected must
be an external memory location that can be captured by the bus
analyzer.
A bus analyzer is in fact a logic analyzer, with all the probes
permanently affixed to each wire on the bus. Therefore the techniques
for using a bus analyzer are the same as described in this section when
using a logic analyzer.
Even if there is only a single bit of output available or even a
single serial or digital-to-analog output port, it is still possible to
measure execution time, although it is much more difficult. If the
output is analog, then an oscilloscope is needed instead of a logic
analyzer.
Logic Analyzer Features
A logic analyzer must be setup to capture the data being sent to the
digital output port or over the address lines. However, not every logic
analyzer is the same. Some key features can greatly simplify collecting
data for the purpose of measuring execution time and real-time
performance.
First, the logic analyzer should support state mode. That is, it
displays collected data as a list of hexadecimal numbers, one line per
entry in the analyzer's buffer. All but the lowest cost analyzers
usually have this mode. It is still possible to measure execution time
using timing graphs, but this is much more difficult, and forces each
measurement to be performed manually.
The logic analyzer should support automatic detection of transitions
(often called transitional mode). That is, it monitors the data lines,
and collects one entry every time it detects the output on the data
lines has changed. Even some high-end analyzers do not have this
capability; while other low-end analyzers do have the capability. If
the logic analyzer does not support this mode, then a more
sophisticated external triggering combined with setting up the analyzer
in sequence mode is needed. Here, it is assumed that transitional mode
is available.
A deep buffer on the analyzer is highly desirable. The more data
that can be collected during a single execution, the more different
items that can be measured, and the more measurements of periodic or
repeated code. This leads to higher confidence in measurements of
average and worst-case execution times. A deep buffer also increases
that ability to measure rare events, like an occasional interrupt. Some
logic analyzers have buffers that are one or two million events. The
general rule is the more the better.
To measure execution time, only 16 channels are needed, and if only
an 8-bit output port is used, then only 8 channels are needed. Most
logic analyzers—even the lowest-end ones—have this many channels. For
purposes of measuring execution time, additional channels are not
needed. Measuring execution time can become tedious, thus automating
parts of it is highly desirable.
To automate some of the data filtering there needs to be a computer.
Thus some form of output from the analyzer, either Ethernet, GPIB, or
high-speed serial is very helpful. Alternately, one of the newer
generations of logic analyzers with built-in host computer can also be
used.
A search option that enables typing in a data pattern, and
displaying only the data that matches the pattern, is also very helpful
for quickly viewing some results. Lack of the search option, however,
does not invalidate use of the analyzer, as the same effect can be
achieved after uploading data from the analyzer to a host computer.
Once an appropriate logic analyzer is selected, connecting it is
straightforward. Simply connect the 16 bits of the digital output port
to the corresponding first 16 channels of the logic analyzer. If an
8-bit output port is used, then only connect the first 8 channels. For
simplicity, be sure that bit 0 of the output port is connected to
channel 0 of the logic analyzer, bit 1 to channel 1, etc.
Next, measuring execution time for a single code segment is
described. The method is then expanded in Section 4.1 for instrumenting
complete tasks to measure code for an entire application at once.
Measuring Time for Code Segments
The following discussion assumes C or C++. If using any other language,
including assembly language, it should be fairly obvious on how to
adapt the method to the new language.
The first step is to setup macros for writing the output port. This
step is recommended because different architectures and different
output devices may require different methods of writing output.
However, it is desirable to become accustom to a single set of
commands.
Suppose the macros are called MEZ_START
and MEZ_STOP, and a
definition is created for an 8-bit output port. Following is a sample
definition:
#define MEZ_START(id)
output(dioport,0x50|id&0xF)
#define MEZ_STOP(id)
output(dioport,0x60|id&0xF)
These definitions assume a multitasking system. id is an identification number
that enables measuring execution time for multiple code segments at
once. Each code segment is simply given a separate id number. This
macro assumes a maximum of 16 id's (numbered 0 through 15).
The 0x50 and 0x60 codes are arbitrarily defined; they can be any
number that use only the first four bits of the 8-bit value, as the
bottom four bits are used for the id. A full profiling of an
application might encompass a dozen or so codes. The encoding is quite
flexible; although reserving the top four bits as the event code and
bottom four bits as the id makes it easy to view the items in
hexadecimal on the logic analyzer.
The code whose execution time is to be measured is then instrumented
to include MEZ_START and MEZ_STOP macros. For example:
:
MEZ_START(1);
funcA();
MEZ_STOP(1);
MEZ_START(2);
y = a + b * c;
MEZ_STOP(2);
:
In this example, two code segments are being measured
simultaneously. The first is to obtain the execution time of the
function funcA(). The second
is to obtain the execution time of the operation y=a+b*c.
The code is compiled. Prior to executing it, the logic analyzer is
turned on, and setup in transitional mode to collect data from the
output port. The code is then executed. Data collection on the logic
analyzer is halted, and the output displayed.
Depending on the analyzer, there could be many different columns.
Two columns are most important for this task: the data column and the
time column.
The data column will show the data codes that are output as a result
of the MEZ_START and MEZ_STOP macros. For the above
example, the data should be 0x51, 0x61, 0x52, and 0x62, in that order.
The logic analyzer automatically time-stamps every event. The
timestamp can generally be displayed as relative or absolute. Relative
means that the time column shows the amount of time that elapsed since
the reading on the previous line. Absolute is a cumulative time.
For example, assuming that funcA()
took 358 µsec and the calculation to determine y took 14
µsec, the output would appear as follows (both relative and
absolute time mode shown):
The u represents microseconds; this is a typical convention used by
most logic analyzers. Other common abbreviations are n for nanoseconds,
m for milliseconds, and s for seconds. From this output, the measured
execution time is readily obtained. Relative mode is usually easiest to
use if the start and stop operations are consecutive. Absolute mode is
useful when nesting measurements.
Using this method, any code segment(s) in the application can be
measured. Measuring individual code segments is especially helpful when
optimizing code. The code can be measured prior to optimization then
again after the optimization, and the amount of savings (if any) is
readily known.
When optimizing code, execution time should always be measured, to
prevent making changes to the code that appear as optimizations, but in
reality either do not affect execution time or worse, slow down
execution time.
There are caveats to measuring execution time in this manner. In
particular, it does not account for preemption or interrupts, and thus
measured values could be misleading.
Furthermore, it is incomplete, in that only a few code segments are
measured. That is insufficient if trying to measure real-time
performance for the entire application. Variations to using this
technique discussed later, do take into account preemption and other
related issues.
Collecting Data through a single bit
Some embedded systems are so restrictive that a spare 8-bit digital
output port is an unavailable luxury. Measuring execution time can
still occur with as little as a single output bit available.
The primary limitation with using only a single bit is that only one
item can easily be measured at once. The MEZ_START macro is modified to set
the bit to 1, while the MEZ_STOP macro
resets the bit to 0.
If there is more than 1 bit, but less than 8-bits available, various
encoding strategies can be used to measure more than a single item at a
time. For example, with 3 bits, two of the bits can be used to identify
the code segment, thus allowing four code segments to be measured at a
time. The third bit is toggled as in the single-bit case, to provide
measurements.
When using only one or two bits, an oscilloscope can replace the
logic analyzer. Another alternative if an oscilloscope is available is
to use a digital-to-analog output port.
Several different clearly-distinguishable analog levels are
pre-defined, with the occurrence of each one representing an event. The
resolution of an analog output is generally a function of the
conversion time. Values in the range of 10 to 50 µsec are not
uncommon, while very accurate ones (like a 20-bit converter) could be
in the millisecond range.
This represents much lower resolution as compared to using digital
outputs. Nevertheless, if this is the only means of sending signals
from the embedded processor to a measurement instrument, then it is
still better than not having such an ability.
Next, in Part 2 in this tutorial,
the author focusses on real-time analysis and various techniques for
analyzing real-time performance.
David B. Stewart is Director of
Software Engineering at InHand Electronics.
References
[1] D. Katcher, H. Arakawa
and J. Strosnider, "Engineering and Analysis of Fixed Priority
Schedulers", IEEE Transactions on Software Engineering, Vol. 19, No. 9,
Sep. 1993.
[2] A. Secka, "Automatic
Debugging of a Real-Time System Using Analysis and Prediction of
Various Scheduling Algorithm Implementations," M.S. Thesis, Dept.
of Electrical and Computer Engineering, University of Maryland,
Supervisor D. Stewart, Nov. 2000.
[3] M. Steenstrup, M.A.
Arbib, and E.G. Manes, "Port Automata and the Algebra of Concurrent
Processes," J. Computer and System Sciences, Vol. 27, No. 1, pp. 29-50,
August 1983.
[4] D.B. Stewart and P.K.
Khosla, "Mechanisms
for Detecting and Handling Timing Errors," Comm. the ACM, Vol. 40, No. 1, pp.
87"94, January 1997.
[5] D.B. Stewart, "Designing
Software Components for Real-Time Applications," in Proc. of Embedded
Systems Conference, San Francisco, CA, Class 507/527, Apr. 2001.
[6] TimeTrace,
TimeSys Corp.
[7] WindView,
Wind River Systems.