Tuesday, August 16, 2011

I was looking a 32-bit division, and I was trying to find a way to take advantage of the 16-bit DIV instruction. Here's what I came up with:

X: numerator
Y: denominator
Q: ratio
N: 16-bit radix (2^16)

X/Y = Q
X = A*N+B
Y = C*N+D
N = 1<<16

Replace terms:
(A*N+B)/(C*N+D) = Q

B will be lost due to significant figures
Multiply denominator by ((1/C)/(1/C)), this eliminates 32-bit division:
(A*N)/((N+D/C)/C) = Q

Multiply numerator by ((1/(N+D/C))/(1/(N+D/C))):
((A*N)/((N+D/C))/C = Q

Multiply numerator by ((1/2)/(1/2)), this ensures all partial terms fit into a 16-bit quantity:
((A*N/2)/((N+D/C)/2))/C = Q

Decompose into partial terms:
V1 = D/C
V2 = N/2+V1/2
V3 = (A*N/2)/V2
P = V3/C

Due to the stackup of integer truncation, there will be rounding errors. Testing over the range of valid inputs shows that the result is accurate to +-1. Another step is required to fix this approximation.

Account for the approximation (Z is error due to truncation):
Q = P+Z

Replace Q with earlier equation
P+Z = A*N/(C*N+D)

Multiply both sides by (C*N+D):
A*N = (P*C*N+P*D)+(Z*C*N+Z*D)

Divide both sides by N, solve for Z.
Terms involving D are lost due to significant figures
A - P*C = Z*C

If Z<=0, any truncation error is covered by rounding error
If Z>0, the estimate "P" is one greater than the true result

So:
if(A > P*C), P := P-1

Untested assembly below. X Passed on [r1,r2], Y Passed on [r3,r4], result passed in [r1,r2], this assumes unsigned operands

# Cycles
mov r4, r5 # 14 : Copy D to temp register
clr r4 # 10 : Clear high word, prepare for division

div r3, r4 # 124 : R4 = C/D {V1}

srl r4, 1 # 14
ai r4, >1000 # 18 : R4 = N/2+V1/2 {V2}

mov r1, r5 # 14 : Save unmodified A for later

mov r1, r2 # 14
src r2, 1 # 14
ai r2, >1000 # 18
srl r1, 1 # 14 : [R1,R2] = A*N/2
div r1, r4 # 124 : R1 = (A*N/2)/V2 {V3}

mov r1, r2 # 14
clr r1 # 10
div r3, r1 # 124 : R1 = V3/C {P}

mpy r1, r3 # 52 : [R3,R4] = P*C
c r1, r5 # 14 : Compare A to P*C
jle +2 # 8 :
dec r1 # 10 :

mov r1, r2 # 14 : Move result into proper registers
clr r1 # 10 : [R1,R2] = P

total 634 clocks, 44 bytes (722 clocks including instruction loads)

This compares well to an earlier method using shifts and subtracts. That method uses a maximum of 7394 clocks and 50 bytes. (10082 clocks including instruction loads)

Assuming this all works out, this approach is fourteen times faster with a smaller footprint. I think I have a winner.

No comments:

Post a Comment