Sunday, April 1, 2012

I found this sequence which is being used to clear a byte of memory:

clr r2 # clocks: 10 bytes: 2
movb r2, *r1 # clocks: 4+14+6 = 24 bytes: 2
total: 34 clocks, 4 bytes

For indexed memory locations:

clr r2 # clocks: 10 bytes: 2
movb r2, @1(r1) # clocks: 4+14+8= 26 Bytes:4
total: 36 clocks, 6 bytes

I can do better than that using subract instructions:

sb *r1, *r1 # clocks: 4+14+6+6 = 30 bytes: 2
total: 30 clocks, 2 bytes

sb @1(r1), @1(r1) # clocks: 4+14+8+8 = 34 bytes: 6
total: 34 clocks, 6 bytes

The first form is about 12% faster, the second is about 6% faster. This isn't a huge improvement, but it's still better. This is probably best added as a peephole.

And now it's in there.

There's also this sequence, which I'm not happy about. It's the result of:

unsigned char a = (((unsigned char)val & (char)0x0F) + (char)'0');

mov r2, r5
swpb r5
srl r5, 8
andi r5, >F
ai r5, >30
swpb r5

I think this would be better:

mov r2, r5
andi r5, >F
ai r5, >30
swpb r5

So I've added an optimization for (int)X = (unsigned char)((int)X). This replaces:
mov r2, r5 # clocks:14
swpb r5 # clocks:10
sra r5, 8 # clocks:12+16
# total=52 clocks

with:
mov r2, r5 # clocks: 14
andi r5, >00FF # clocks: 14+4
# total=32 clocks

This is nearly twice as fast, and in the case where no MOV is needed, even faster. This makes me happy again.

No comments:

Post a Comment