*This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.*

- Chapter 1: Genesis.
- Chapter 2: Logical and Bitwise Operations.
- Chapter 3: Shifts.
- Chapter 4: Interlude: FFACalc.
- Chapter 5: “Egyptological” Multiplication and Division.
- Chapter 6: “Geological” RSA.
- Chapter 7: “Turbo Egyptians.”
- Chapter 8: Interlude: Randomism.
- Chapter 9: “Exodus from Egypt” with Comba’s Algorithm.
**Chapter 10: Introducing Karatsuba’s Multiplication.**

You will need:

**All**of the materials from Chapters 1 – 9.- ffa_ch10_karatsuba.vpatch
- ffa_ch10_karatsuba.vpatch.asciilifeform.sig

Add the above *vpatch* and *seal* to your V-set, and *press* to *ffa_ch10_karatsuba.vpatch*.

You should end up with the same directory structure as previously.

Now compile *ffacalc*:

cd ffacalc gprbuild

But **do not run it quite yet.**

First things first:

- The
**mail bag**from Chapter 9 will be dealt with in Chapter 11. - This Chapter may
**require a cup of coffee**. - This Chapter is dedicated to my most dedicated readers, such that I did not even know that I had, who
**publicly raged**when it did not appear on schedule.

Recall the algebraic identities in Mul_Word of Chapter 9, where a single *Word* × *Word* multiplication is turned into four *HalfWord* × *HalfWord* ones. Let’s consider the general case.

Suppose that *b* is a positive integer, and we would like to multiply a *2b*-bit-wide *X* by a *2b*-bit-wide *Y*. We can express the multiplicands algebraically as follows:

X = X_{Lo} + 2^{b}X_{Hi}

Y = Y_{Lo} + 2^{b}Y_{Hi}

…where *N _{Lo}* consists of any

*2b*-wide integer

*N*’s least-significant

*b*binary digits, while

*N*consists of its

_{Hi}*b*most-significant binary digits. The multiplication itself can then be written as:

XY = (X_{Lo} + 2^{b}X_{Hi})(Y_{Lo} + 2^{b}Y_{Hi})

.. = X_{Lo}Y_{Lo} + 2^{b}X_{Lo}Y_{Hi} + 2^{b}X_{Hi}Y_{Lo} + 2^{b}X_{Hi}2^{b}Y_{Hi}

.. = X_{Lo}Y_{Lo} + 2^{b}(X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}) + 2^{2b}X_{Hi}Y_{Hi}

…or pictured “physically”, drawn to scale (suppose that the most junior bit position in a given register is at the left edge of the rectangle representing said register; the most senior — at the right edge) :

X_{Lo} |
X_{Hi} |
||||

× | Y_{Lo} |
X_{Hi} |
|||

= | |||||

X_{Lo}Y_{Lo} |
|||||

+ | X_{Lo}Y_{Hi} |
||||

+ | X_{Hi}Y_{Lo} |
||||

+ | X_{Hi}Y_{Hi} |
||||

= | XY |

Performing this “schoolbook” *2b* × *2b* multiplication costs *four* *b* × *b*-wide multiplications (multiplications by powers of two are performed using inexpensive position shifting and do not figure towards this count) and *three* *2b*-wide additions. This cost, as we learned in previous chapters, is proportional to the *square* of *b*. But in fact it *is* possible to do better — using a *subquadratic* multiplication algorithm.

We will make use of the oldest, best-known, and **simplest** such algorithm: the classic one derived from A. A. Karatsuba’s (1937 – 2008) “Multiplication of many-digital numbers by automatic computers”, Dokl. Akad. Nauk SSSR, 145:2 (1962), 293–294.

After working through this Chapter, the reader will possess a *fits-in-head*, **constant-spacetime** and *provably-correct* computerized implementation of Karatsuba’s algorithm.

First, let’s specify that our multiplicands *X* and *Y* are *FZ*-integers, and each of them occupies exactly *L* *Word*s.

Recall that in Chapter 4, we have mandated that no *FZ* must come into existence with a bitwise length that is not a positive power of two. Therefore, at all times, a *FZ* will break cleanly into two halves of equal bitnesses, as many times as desired until halves each consisting of a *FZ* one *Word* in length are formed.

And so, at every recursive (see further below) invocation of our procedure, we are able to begin by finding a positive integer *k* (where *L = 2k*) i.e. the length of a multiplicand cut in half.

Thereby we can take *b* = *MachineBitness* × *k*, where *k* = *L* / 2.

Let’s also define, for brevity, the following recurring terms:

LL = X_{Lo}Y_{Lo}

HH = X_{Hi}Y_{Hi}

MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}

And trivially obtain the equation:

XY = LL + 2^{b}MM + 2^{2b}HH

Karatsuba’s discovery is the fact that it is possible to compute the *MM* middle term using *only one* *b × b* multiplication in place of two. In the most common implementations of the algorithm, this appears as the equivalence:

MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}

.. = (X_{Lo} + X_{Hi})(Y_{Lo} + Y_{Hi}) - HH - LL

This algebraic identity replaces an *expensive* operation (multiplication) with three *cheap* (addition/subtraction) arithmetical operations. However, the possibility of generating a carry during addition when computing the terms of the binomial makes the mechanism slightly slower and uglier than it strictly has to be. So we will instead use another variant of the algorithm, sometimes called *subtractive Karatsuba*. It makes use of the following trivial equivalence:

MM = LL + HH - (X_{Lo} - X_{Hi})(Y_{Lo} - Y_{Hi})

In *FFA*’s realization, this will take the following — at first, seemingly gnarlier — form:

Dx = |X_{Lo} - X_{Hi}|

Dy = |Y_{Lo} - Y_{Hi}|

DD = Dx × Dy

MM = LL + HH - (-1^{CX})(-1^{CY})DD

…where C_{X} is the sign (in the convention where it equals 1 if X_{Lo} is less than X_{Hi}, and 0 otherwise) of X_{Lo} – X_{Hi} and C_{Y} is the sign (similarly) of Y_{Lo} – Y_{Hi}.

And it is not difficult to show that, if we take

DD_{Sub} = C_{X} XNOR C_{Y}

then:

MM = LL + HH + (-1^{DDSub})DD

Or, for kindergarten:

Or, for the thick, for the impatient, | ||||

C_{X} |
C_{Y} |
DD_{Sub} |
(-1^{DDSub})DD |
MM = |
---|---|---|---|---|

0 | 0 | 1 | -DD | LL + HH - DD |

0 | 1 | 0 | DD | LL + HH + DD |

1 | 0 | 0 | DD | LL + HH + DD |

1 | 1 | 1 | -DD | LL + HH - DD |

…i.e. the *DD* term ends up **subtracted** if (X_{Lo} – X_{Hi})(Y_{Lo} – Y_{Hi}) is positive; otherwise it gets **added**. Which is as it should be. And this gives us the formula:

XY = LL + 2^{b}(LL + HH + (-1^{DDSub})DD) + 2^{2b}HH

And now let’s draw a possible shape for this mechanism “electrically”. First, let’s do the obvious Right Thing and begin by stuffing *LL* and *HH* into the lower and upper, respectively, halves of the register in which we are computing the sum. And after that, we will add the “middle” terms, at the requisite offset (i.e. at the *b*-th binary place) :

LL | HH | |||

+ | LL | 0 | ||

+ | HH | 0 | ||

+ | (-1^{DDSub})DD |
0 | ||

= | XY |

The above would work as a physical realization, but our 2*b*-sized additions sadly became 3*b*-sized, because of the need to ripple out the carries (recall that the addition of any two *s*-sized integers is physically *s+1*-sized, as there is the possibility of overflow.) But can we do anything about this, or do we have to live with it?

Turns out: there is a pill. One economical way to make the additions stay *2b*-sized, is to accumulate the carries. We will put them in a *Word*-sized register called *TC* (i.e. *tail carry*). After the three terms of the original equation have been added into the result *XY*, we will make the result correct by rippling out (at the cost of walking over a *b*-sized space) the accumulated carry in *TC* into the *tail* (i.e. most senior quarter segment) of the result *XY*.

The new scheme looks like this:

LL | HH | TC := 0 | ||||

+ | LL | TC += Carry | ||||

+ | HH | TC += Carry | ||||

+ | (-1^{DDSub})DD |
TC += Carry – DD_{Sub} |
||||

+ | TC | |||||

= | XY |

The reason why, after the third addition, we must…

TC += Carry - DD_{Sub}

…is not uninteresting, and I am quite tempted to leave it as **an exercise.** And in fact I think I will:

####
**Chapter 10, Exercise 1: ***Why is it necessary to subtract DD*_{Sub} from TC ?

*Why is it necessary to subtract DD*

_{Sub}from TC ?Observe that we can simplify the program we are about to write, by proving the lemma:

TC >= 0

The fact that *TC* is greater than or equal to zero by the time it gets rippled, trivially follows from the fact that:

MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}

There is no way for the value of above expression to become negative, given that no subtraction happens in it, and that both of the multiplicands in both of the terms, are positive! From which it necessarily follows that, given as…

MM = LL + HH + (-1^{DDSub})DD

…it must then at all times be true that:

LL + HH >= (-1^{DDSub})DD

And therefore it will never become necessary to propagate a “borrow” when rippling *TC* into the *tail*, and we can use a strictly “additive” formulation in the pertinent code.

But what is the **maximum** possible value of *TC* ? It so happens that the answer is: **two**. And we can show… but wait. Why not leave this as **an exercise** ! So :

####
**Chapter 10, Exercise 2: ***Can the value of TC ever become greater than 2 ? Give an example of a Karatsubaization where it does; or if it cannot, prove that it cannot. *

*Can the value of TC ever become greater than 2 ? Give an example of a Karatsubaization where it does; or if it cannot, prove that it cannot.*

And now let’s put all of this together into a **mechanism**:

*FZ_Mul.adb*:

procedure Mul_Karatsuba(X : in FZ; Y : in FZ; XY : out FZ) is -- L is the wordness of a multiplicand. Guaranteed to be a power of two. L : constant Word_Count := X'Length; -- An 'LSeg' is the same length as either multiplicand. subtype LSeg is FZ(1 .. L); -- K is HALF of the length of a multiplicand. K : constant Word_Index := L / 2; -- A 'KSeg' is the same length as HALF of a multiplicand. subtype KSeg is FZ(1 .. K); -- The three L-sized variables of the product equation, i.e.: -- XY = LL + 2^(K*Bitness)(LL + HH + (-1^DD_Sub)*DD) + 2^(2*K*Bitness)HH LL, DD, HH : LSeg; -- K-sized terms of Dx * Dy = DD Dx, Dy : KSeg; -- Dx = abs(XLo - XHi) , Dy = abs(YLo - YHi) -- Subtraction borrows, signs of (XL - XH) and (YL - YH), Cx, Cy : WBool; -- so that we can calculate (-1^DD_Sub) -- Bottom and Top K-sized halves of the multiplicand X. XLo : KSeg renames X( X'First .. X'Last - K ); XHi : KSeg renames X( X'First + K .. X'Last ); -- Bottom and Top K-sized halves of the multiplicand Y. YLo : KSeg renames Y( Y'First .. Y'Last - K ); YHi : KSeg renames Y( Y'First + K .. Y'Last ); -- L-sized middle segment of the product XY (+/- K from the midpoint). XYMid : LSeg renames XY( XY'First + K .. XY'Last - K ); -- Bottom and Top L-sized halves of the product XY. XYLo : LSeg renames XY( XY'First .. XY'Last - L ); XYHi : LSeg renames XY( XY'First + L .. XY'Last ); -- Topmost K-sized quarter segment of the product XY, or 'tail' XYHiHi : KSeg renames XYHi( XYHi'First + K .. XYHi'Last ); -- Whether the DD term is being subtracted. DD_Sub : WBool; -- Carry from individual term additions. C : WBool; -- Tail-Carry accumulator, for the final ripple TC : Word; begin -- Recurse: LL := XL * YL FZ_Multiply(XLo, YLo, LL); -- Recurse: HH := XH * YH FZ_Multiply(XHi, YHi, HH); -- Dx := |XL - XH| , Cx := Borrow (i.e. 1 iff XL < XH) FZ_Sub_Abs(X => XLo, Y => XHi, Difference => Dx, Underflow => Cx); -- Dy := |YL - YH| , Cy := Borrow (i.e. 1 iff YL < YH) FZ_Sub_Abs(X => YLo, Y => YHi, Difference => Dy, Underflow => Cy); -- Recurse: DD := Dx * Dy FZ_Multiply(Dx, Dy, DD); -- Whether (XL - XH)(YL - YH) is positive, and so DD must be subtracted: DD_Sub := 1 - (Cx xor Cy); -- XY := LL + 2^(2 * K * Bitness) * HH XYLo := LL; XYHi := HH; -- XY += 2^(K * Bitness) * HH, but carry goes in Tail Carry accum. FZ_Add_D(X => XYMid, Y => HH, Overflow => TC); -- XY += 2^(K * Bitness) * LL, ... FZ_Add_D(X => XYMid, Y => LL, Overflow => C); -- ... but the carry goes into the Tail Carry accumulator. TC := TC + C; -- XY += 2^(K * Bitness) * (-1^DD_Sub) * DD FZ_Not_Cond_D(N => DD, Cond => DD_Sub); -- invert DD if 2s-complementing FZ_Add_D(OF_In => DD_Sub, -- ... and then must increment X => XYMid, Y => DD, Overflow => C); -- carry will go in Tail Carry accumulator -- Compute the final Tail Carry for the ripple TC := TC + C - DD_Sub; -- Barring a cosmic ray, 0 < = TC <= 2 . pragma Assert(TC <= 2); -- Ripple the Tail Carry into the tail. FZ_Add_D_W(X => XYHiHi, W => TC, Overflow => C); -- Barring a cosmic ray, the tail ripple will NOT overflow. pragma Assert(C = 0); end Mul_Karatsuba;

But… Does this routine **work**? How **fast** does it run? What do the **new procedures** like *FZ_Not_Cond_D* **do**? Where and when do we **recurse** ? Why did **Comba’s Multiplier** have to **change** at all !? And, most importantly, **is the proof “proofy” enough** ?? **Find out in Chapter 11!**

*~To be continued!~*