Code Gems Part 1
This text comes from IMPHOBIA Issue IX - February 1995
Welcome to the first part of Code Gems! Here follows some nice
trick on the Intel processor series. The aim of this article is to introduce
those few -byte-length jewels of assembly coding which undoubtedly prove:
"There's always a better way."
Respectable part of this was
debugged out of various products, while others were experimentally worked out by
me or by one of my friends and I'm pretty sure that You have known many of them
before. So there are no unambiguous credits for this.
* ECX-Loop in 16-bit code *
By default, TASM
doesn't support real mode LOOPs using ECX. So we have to write our little macro:
ELOOP macro _label
db 67h
loop _label
endm
This works for CX-LOOP in 32-bit code too. A similar macro for LOOPE and
LOOPNE can be written. It's good even if JUMPS is activated; but in my opinion
JUMPS isn't so good for optimizing, it rather serves the point of convenience.
* Rejecting JUMPS *
Nice that
the 386 knows the long conditional jumps - but not for the LOOP. When a long
LOOP is needed (and JUMPS is on), TASM compiles this: loop cycle_temp
jmp cycle_end
cycle_temp:
jmp cycle_start
cycle_end:
From the optimization's point of view (both size & speed) it isn't
good. What I do is I turn on the JUMPS until the final version then compile
without JUMPS, and fix the remaining LOOPs with hand. Be careful with using
JUMPS in 286-compatible code too. With a small brainwork another dozen of bytes
can be saved. Take a look at this piece of initialzing code: test_config:
[VGA checking code]
jne bad_config
[286 checking code]
jne bad_config
[mouse checking code]
jne bad_config
[soundcard checking code]
jne bad_config
...
If the bad_config is too far from this code, every conditional jump will
be extracted into two instructions. So if we put a bad_config_collector:
jmp bad_config
instruction close enough to TEST_CONFIG and replace all JNE BAD_CONFIG
with JNE BAD_CONFIG_COLLECTOR, then we saved another few bytes. Of course only
when BAD_CONFIG can't be brought any closer.
* Nested Loops *
Sometimes there's a need for little nested
loops. One solution: mov cl,outer_cycle_num
outer_cycle:
[outer cycle code]
mov ch,inner_cycle_num
inner_cycle:
[inner cycle code]
dec ch
jne inner_cycle
loop outer_cycle
This is two byte shorter than DEC CL/JNE combination. It vas invented by
TomCat / AbaddoN while developing a bootsector intro.
* Optimizing with ESP *
Using ESP as a
general-purpose register isn't so familiar because the interrupts should be
disabled. But check this out: In real mode (and sometimes in protected mode) the
stack operations ignore the upper word of ESP. (Except if a protected mode
program forgot to reload the segment rights & limits. This is why I
recommend to restore the default real mode settings.) So. When SP=0000, the
first word to be pushed will be placed to SS:FFFE. Ergo if we initialize ESP to
00010000h, we have a wonderful 32-bit number, stack operations refer to the top
of the stack segment, and interrupts can be enabled. For example let's assume we
want a big nested loop (ECX is the only free register, and ESP=00010000h): mov ecx,(outer_num)*10000h
outer_cycle:
[outer cycle code]
mov cx,inner_num
inner_cycle:
[inner cycle code]
loop inner_cycle
sub ecx,esp
jne outer_cycle
This can be combined with the other nested loop method. And this is a
possible technique for using the upper words of the 32-bit registers without a
couple of SHRs. The disadvantage is that CX must be zero when the SUB ECX, ESP
occurs. But if we restrict the usage of the upper words to 15-bit, the lower
word can be anything. Example:
mov ebx,(cyclenum-1)*10000h
cycle:
[cycle code]
sub ebx,00010000h
jns cycle
BX can contain any value that won't be touched. Cyclenum-1 can be max.
8000h.
Another small thing concerning the stack: on 386+ after all
instruction which modifies SS, the interrupts will be disabled for the next
instruction. So we can save that CLI/STI pair.
* REP zeroes CX *
Usual problem : mem->mem copy. DS:SI,
ES:DI, and CX are prepared but rep movsb
is slow...And shr cx,1
jnb copyeven
movsb
copyeven:
je copyready
rep movsw
copyready:
is also slow...
Then comes the light: shr cx,1
rep movsw
adc cx,cx
rep movsb
sounds good.
I found it in the SSI Spring '94 Software demo by
Future Crew. Yes, I debugged! And that was worth...Remember, LOOP also zeroes
(E)CX.
* Puzzle *
Let's
assume that EAX contains 0,except the least 8 bits (AL).
How many
instructions needed to fill the upper 3 bytes of EAX with AL
(Without any
pre-calculated tables) ??
E.g. if EAX=000000e3, it should be transformed to
e3e3e3e3.
If you wish to guess it yourself, think of it a little before you
read further.
Note: The solution to this will be in Code Gems 2!