Magnetometer IV: Code Size

Last time, I presented a working proof-of-concept of a digital compass based on a three-axis magnetometer. That version was running in userland on a Raspberry Pi running Raspbian, which is a whole lot more computer than the Atmel ATtiny85 I eventually want to target. It was also coded for clarity rather than for speed or size.

In this post, I’ll look at some quick-and-dirty ways to estimate program size for an AVR version (as well as some simple things we can do to save space). Read on for more.

Update: This article is part of a larger series on building an AVR-based compass. You may be interested in the other articles in the series.

Motivation

The rules of optimization are usually stated something like:

Don’t.
Don’t, yet. (Experts only.)
Profile first.

(That’s also what Knuth was getting at when he said that premature optimization is the root of all evil.)

With that firmly in mind, I realized this project would fall into one of three categories:

It’s already plenty small enough, and any effort spent on making it smaller would be wasted / better spent elsewhere.
It’s too big to fit in its current state, and some effort will be required to make it work on the target platform.
It’s just way too big to ever fit in under 8K, and I need to pick either a totally different approach, or switch to a different platform (like an ATMega328P) with more resources.

My suspicion going in was that I was dealing with the middle case, but I also had enough experience to know how fallible my instincts could be. Without doing a whole lot of work, how could I get a good quantitative estimate of the scale of the problem (if any)?

Porting

Retargeting the proof-of-concept code for AVR was relatively simple. I already had a working libraries for doing the I²C and HMC5883L stuff. While not quite a drop-in replacement for the Pi version, the core data structure (hmc5883l_pos_t — a C structure representing a 3-vector) was the same.

I discarded the assertion-checking code (debug.h) completely, figuring I’d do debugging, testing and static analysis on the Pi whenever possible.

The diagnostic output functions (diag.c as well as printf()(3) and friends) were out. I could use the 8×8 LED matrix (for which I also had a working library) for output.

I very deliberately didn’t do any of the obvious things to save space by re-arranging the code (such as combining the three rot_find_R*() functions into one, or eliminating the jitter calculation).

A weird hackish impulse led to the poscp() macro (used to replace calls to memcpy()(3) where the source and destination are different hmc5883l_pos_t structures — a common case). This was absolutely a textbook case of premature optimization, and I shouldn’t have done it. But it saved a few bytes, so I left it in…

Here’s the source: compass-tst3.tar.gz (19KB gzipped tarball)

Warning!: Do not use the hmc5883l library included in the sources linked just above. The hmc5883l_read() function in said library reads the data from the sensor properly, but completely screws up turning those data into three sixteen-bit coordinates. You will get what looks like real data that change over time, but it’s complete garbage. Fixed version coming soon.

Just to be absolutely clear: This example compiles, but doesn’t actually do anything. It hasn’t been tested, because there’s no platform on which it could be run!

Finding the Size

Now that there’s (a fake but at least vaguely realistic stand-in for) an AVR version of the working compass code, the next task is to find out how big it is. Easy, right? Type “make” and:

/usr/lib/gcc/avr/4.5.3/../../../avr/bin/ld: compass.elf section `.text' will not fit in region `text'
/usr/lib/gcc/avr/4.5.3/../../../avr/bin/ld: region `text' overflowed by 7444 bytes
collect2: ld returned 1 exit status
make: *** [compass.elf] Error 1

That does give us an answer: The .text section (roughly speaking, the program code) is 8KB + 7444B, which is something like twice as much as we can fit on the ATtiny85. (If we want to target the Adafruit Trinket — which I do — and keep the nice bootloader, there’s only about 5.5K to work with.)

That sounds bad, but it’s actually really promising, especially considering that my first attempt was using desktop-style “who cares about code size?” compiler options (like -O3). I’m highly optimistic that it can be made to fit on my first choice of platform.

But before we start making it smaller, it would sure be nice if it were easier to see the size (and if we could see everything, not just the .text section). Is there any way to lie to the linker about how much memory our target has?

It turns out that there is: First we add -Wl,–verbose to the linker options to make it dump the link script it’s using, then edit that to change the LENGTH attribute of the text region. We then specify the new link script on the command line with -T filename.

The new link script looks like this: size.ld (6KB text file)

The only (non-comment) line I changed was this one:

text   (rx)   : ORIGIN = 0, LENGTH = 64K

Now, when we run make again, we get:

avr-size compass.elf
   text	   data	    bss	    dec	    hex	filename
  15636	    264	     26	  15926	   3e36	compass.elf

Obviously, we can’t actually flash this into our microcontroller, but it’s good information. We can’t optimize what we can’t measure! (In the “real computer” world, most of the time when we talk about optimization we mean speed not size. There are great profiling tools for execution speed and finding “hot spots” in code, and much effort spent on making compilers perform better with respect to speed benchmarks. Optimizing for code size is a little less “plug and play” as far as the tools are concerned.)

Low-Hanging Fruit

Making the actual code better is hard work. As Larry Wall points out in the Camel Book, laziness is one of the three great virtues of a programmer. So before doing anything that actually requires thinking, let’s see where we can get by just changing around some of the command-line arguments for the compiler and linker.

The first and most obvious thing is to compile with -Os (optimize for code size) instead of -O3 (optimize for speed). That’s actually two changes: stop doing something that makes the code bigger, plus start doing something that make it smaller. However, the util/delay.h stuff won’t build right with no optimization, so we’ll do them both at once. After a “make clean && make”, we get:

avr-size compass.elf
   text	   data	    bss	    dec	    hex	filename
  10284	    264	     26	  10574	   294e	compass.elf

Almost 4K difference (in the good direction); encouraging. Granted, using -O3 in the first place could be considered cheating for dramatic effect, but it’s still an edifying example. (In case you’re curious: using -O2 gives a size of 13204.)

In the examples above, I’m building on Ubuntu 12.04.4 LTS, which uses avr-gcc version 4.5.3. This version is somewhat infamous for yielding larger code than either 4.3.3 or 4.7+. I’ve got another box handy with Ubuntu 13.10, which uses avr-gcc version 4.7.2. With the same options (just -Os), the total binary size (“dec” column in the above) is 10398 — substantially better.

Going in, I thought “Well, obviously we want -Os if we’re trying to make small binaries — that’s what it’s there for!”. But I got a big surprise when I started creating the table you’ll see a little later: with avr-gcc 4.7.2, -O1 actually makes smaller code (at least for this input) than -Os does. I hesitate to go so far as to call it a compiler bug, but it sure isn’t the outcome I expected.

That’s why we experiment and measure. Follow the evidence wherever it leads… for Science!

Sticking with the newer avr-gcc, let’s see what other options and combinations of options can do for us:

[table]option,total size,difference vs. just -O1

-O3 (instead of -O1),14910,+4732

none (other than -O1),10178,0

-Os (instead of -O1),10398,+220

-fomit-frame-pointer,10178,0

-ffast-math,10296,+118

-ffinite-math-only,10264,+86

-fno-math-errno,10178,0

-fsingle-precision-constant,10178,0

-fno-inline-small-functions,10178,0

-fno-inline,9664,-514

-fdelete-null-pointer-checks,10178,0

-mcall-prologues,9550,-628

“-ffunction-sections -fdata-sections~~-Wl,–gc-sections (during link)”,10090,-88

“-Wl,–relax (during link)”,10174,-4

-s (during link),10178,0

“-fno-inline~~-mcall-prologues~~-ffunction-sections -fdata-sections~~-Wl,–gc-sections (during link)~~-Wl,–relax (during link)”,8796,-1382

[/table]

Some interesting things happened here. The most obvious is that -Os doesn’t always produce the smallest binary (though this is likely specific to both avr-gcc 4.7.2 and the specific code I’m using for a test case). Anyway, I know I’ll be trying both -Os and -O1 in the future and comparing the results. Another is that options meant to simplify floating-point math may make it faster, but don’t help with code size (and in some case, make the code bigger).

That leaves us with four options that seem to help (out of those I’ve thought to try, and gcc is complicated enough that I have no doubt there are other good candidates I’ve missed):

-fno-inline — This tells the compiler to always keep function calls as actual function calls, rather than inlining the code. It’s reasonably clear why this makes code smaller: you don’t end up with multiple copies of the code from a function that’s called in two or more places. It also suggests that additional improvement may be possible in cases where a function is called from exactly one place. In such cases, we can mark the function with the always_inline attribute (which will save the function call overhead, and still give us exactly one copy of the function body). For functions where the body is actually smaller than the call overhead, we can either do this or make them preprocessor macros.
-mcall-prologues — Normally the compiler emits function prologues and epilogues (to spill and load registers) as inline code in the function bodies. This option tells it to do those things with function calls instead. It costs time but saves code space.
-ffunction-sections -fdata-sections -Wl,–gc-sections — This eliminates “dead code” (e.g., public functions that are never called). If you’re curious about the specifics of how it works, or if you want to know about some important pitfalls of using it outside of the embedded world, see Denys Vlasenko’s presentation on the topic.
-Wl,–relax — The linker documentation is remarkably vague about this, saying only “On some platforms the `–relax‘ option performs target specific, global optimizations that become possible when the linker resolves addressing in the program, such as relaxing address modes, synthesizing new instructions, selecting shorter version of current instructions, and combinig constant values.”

After all that, the program is almost small enough to fit in the 8KB program space (if we’re willing to give up on the Trinket bootloader). And, I still haven’t made any code changes or done anything requiring actual effort.

Profiling for Code Size

When you’re optimizing for speed, there are handy profiling tools like gprof that can give you good information about where your program is spending most of its time. There’s no sense trying to speed up a function that gets called a dozen times and takes 5ms each time, when another function that take 5ms gets called a thousand times, or some third function gets called once but takes 5 minutes.

What we need is something analogous for code size. It makes no sense to work on making part of the program smaller unless it contributes a substantial amount to the overall size. Fortunately, the right tool for the job already exists: the map file produced by the linker (just called “map” in my example). It contains (among other things) detailed information about the location of each symbol, from which we can infer the size.

A little Perl-fu and sorting later, and we have a nice list of how much space is taken up by each part of the code, in order from the worst offenders to the most minor. For the example code, we know the actual function names (since each function had its very own section). For library code, all we know is the name of the object file (and if there are multiple functions in that object, they’re all lumped together into a single row). Fortunately, in most cases, the module names make it pretty obvious what’s going on. Here’s the list:

856 get_nwc
852 _addsub_sf.o
448 _mul_sf.o
414 calibrate
404 _pack_sf.o
380 rot_posn
328 jitter
316 _div_sf.o
296 main
284 rot_find_Rz
284 rot_find_Ry
284 rot_find_Rx
274 heading
266 rot_mult
222 _unpack_sf.o
210 mulsf3x.o
204 hmc5883l_read
194 _fpcmp_parts_sf.o
192 addsf3x.o
188 sqdist
176 USI_TWI_Start_Transceiver_With_Data
176 _si_to_sf.o
90 hmc5883l_init
84 _lt_sf.o
84 _gt_sf.o
80 matrix8x8_draw
80 fp_rempio2.o
80 atan.o
74 fp_powser.o
68 fp_split3.o
56 _prologue.o
54 show
54 _epilogue.o
50 USI_TWI_Master_Transfer
42 matrix8x8_pixels
36 _mulhi3.o
36 _clzhi2.o
34 fp_sinus.o
34 fp_round.o
34 fp_powsodd.o
30 USI_TWI_Master_Stop
28 fp_mpack.o
24 matrix8x8_clear
22 USI_TWI_Master_Initialise
18 matrix8x8_cmd
14 inverse.o
14 fp_zero.o
14 fp_pscB.o
14 fp_pscA.o
14 _clzsi2.o
12 sin.o
12 fp_inf.o
6 square.o
6 fp_nan.o
6 cos.o

That seems pretty reasonable. Adding up all the values yields 8552, which is close to the actual total. (The difference lies in things like the table of interrupt vectors that we can’t easily avoid.)

The worst single offender is get_nwc(), but that’s kind of expected as it’s our biggest and most complicated function. The find_rot_R*() functions are pretty big, and there are three of them, all of which are nearly identical. That suggests that combining them into one would save a worthwhile amount of space. The I2C and LED matrix drivers are already pretty small; any improvement there is likely to be a lot of effort for very little gain. Likewise the HMC5883L initialization function. The hmc5883l_read() is surprisingly big — probably because of the floating point math.

Speaking of which: The floating-point stuff is huge. It accounts for the second-, third- and fifth-largest individual items, plus 28 other rows. Adding it all up, I get 3798 bytes — almost half of the total code size (and that’s leaving aside the fact the using floats also makes our own functions considerably bigger).

Conclusions

It looks like a really good bet that this application can indeed be made to fit on the ATtiny85.
… but it will be tight.
… and doing so will require some clever coding.
Floating point math is, as suspected, a major contributor to binary size.
Replacing the floating point math with fixed point is worth investigating, even though it will involve having to think.
Experiment with compiler and linker options; they can make a big difference in how much space your program takes up in an embedded environment.
Said compiler and linker options are highly situational. Which help and which hurt depend on your specific program and the versions of the tools you’re using.
Profiling is just as important when optimizing for size as when optimizing for speed. The first step toward eliminating bloat is quantifying how much of it is present, and in what places.

Check back ~~in a while~~ right now for our next exciting episode, where we clean up the code and roll our own fixed-point math library!

Edited by DGH on 2014 Aug 25 to add warning about buggy hmc5883l library.