Published on: Tuesday February 13 2018
This article is part of a series of handson tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.
You will need:
Add the above vpatch and seal to your Vset, and press to ffa_ch10_karatsuba.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
But do not run it quite yet.
First things first:
 The mail bag from Chapter 9 will be dealt with in Chapter 11.
 This Chapter may require a cup of coffee.
 This Chapter is dedicated to my most dedicated readers, such that I did not even know that I had, who publicly raged when it did not appear on schedule.
Recall the algebraic identities in Mul_Word of Chapter 9, where a single Word × Word multiplication is turned into four HalfWord × HalfWord ones. Let’s consider the general case.
Suppose that b is a positive integer, and we would like to multiply a 2bbitwide X by a 2bbitwide Y. We can express the multiplicands algebraically as follows:
X = X_{Lo} + 2^{b}X_{Hi}
Y = Y_{Lo} + 2^{b}Y_{Hi}
…where N_{Lo} consists of any 2bwide integer N’s leastsignificant b binary digits, while N_{Hi} consists of its b mostsignificant binary digits. The multiplication itself can then be written as:
XY = (X_{Lo} + 2^{b}X_{Hi})(Y_{Lo} + 2^{b}Y_{Hi})
.. = X_{Lo}Y_{Lo} + 2^{b}X_{Lo}Y_{Hi} + 2^{b}X_{Hi}Y_{Lo} + 2^{b}X_{Hi}2^{b}Y_{Hi}
.. = X_{Lo}Y_{Lo} + 2^{b}(X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}) + 2^{2b}X_{Hi}Y_{Hi}
…or pictured “physically”, drawn to scale (suppose that the most junior bit position in a given register is at the left edge of the rectangle representing said register; the most senior — at the right edge) :

X_{Lo} 
X_{Hi} 

× 
Y_{Lo} 
X_{Hi} 


= 


X_{Lo}Y_{Lo} 

+ 

X_{Lo}Y_{Hi} 

+ 

X_{Hi}Y_{Lo} 

+ 

X_{Hi}Y_{Hi} 
= 
XY 
Performing this “schoolbook” 2b × 2b multiplication costs four b × bwide multiplications (multiplications by powers of two are performed using inexpensive position shifting and do not figure towards this count) and three 2bwide additions. This cost, as we learned in previous chapters, is proportional to the square of b. But in fact it is possible to do better — using a subquadratic multiplication algorithm.
We will make use of the oldest, bestknown, and simplest such algorithm: the classic one derived from A. A. Karatsuba’s (1937 – 2008) “Multiplication of manydigital numbers by automatic computers”, Dokl. Akad. Nauk SSSR, 145:2 (1962), 293–294.
After working through this Chapter, the reader will possess a fitsinhead, constantspacetime and provablycorrect computerized implementation of Karatsuba’s algorithm.
First, let’s specify that our multiplicands X and Y are FZintegers, and each of them occupies exactly L Words.
Recall that in Chapter 4, we have mandated that no FZ must come into existence with a bitwise length that is not a positive power of two. Therefore, at all times, a FZ will break cleanly into two halves of equal bitnesses, as many times as desired until halves each consisting of a FZ one Word in length are formed.
And so, at every recursive (see further below) invocation of our procedure, we are able to begin by finding a positive integer k (where L = 2k) i.e. the length of a multiplicand cut in half.
Thereby we can take b = MachineBitness × k, where k = L / 2.
Let’s also define, for brevity, the following recurring terms:
LL = X_{Lo}Y_{Lo}
HH = X_{Hi}Y_{Hi}
MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}
And trivially obtain the equation:
XY = LL + 2^{b}MM + 2^{2b}HH
Karatsuba’s discovery is the fact that it is possible to compute the MM middle term using only one b × b multiplication in place of two. In the most common implementations of the algorithm, this appears as the equivalence:
MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}
.. = (X_{Lo} + X_{Hi})(Y_{Lo} + Y_{Hi})  HH  LL
This algebraic identity replaces an expensive operation (multiplication) with three cheap (addition/subtraction) arithmetical operations. However, the possibility of generating a carry during addition when computing the terms of the binomial makes the mechanism slightly slower and uglier than it strictly has to be. So we will instead use another variant of the algorithm, sometimes called subtractive Karatsuba. It makes use of the following trivial equivalence:
MM = LL + HH  (X_{Lo}  X_{Hi})(Y_{Lo}  Y_{Hi})
In FFA’s realization, this will take the following — at first, seemingly gnarlier — form:
Dx = X_{Lo}  X_{Hi}
Dy = Y_{Lo}  Y_{Hi}
DD = Dx × Dy
MM = LL + HH  (1^{CX})(1^{CY})DD
…where C_{X} is the sign (in the convention where it equals 1 if X_{Lo} is less than X_{Hi}, and 0 otherwise) of X_{Lo} – X_{Hi} and C_{Y} is the sign (similarly) of Y_{Lo} – Y_{Hi}.
And it is not difficult to show that, if we take
DD_{Sub} = C_{X} XNOR C_{Y}
then:
MM = LL + HH + (1^{DDSub})DD
Or, for kindergarten:
Or, for the thick, for the impatient, 
C_{X} 
C_{Y} 
DD_{Sub} 
(1^{DDSub})DD 
MM = 
0 
0 
1 
DD 
LL + HH  DD 
0 
1 
0 
DD 
LL + HH + DD 
1 
0 
0 
DD 
LL + HH + DD 
1 
1 
1 
DD 
LL + HH  DD 
…i.e. the DD term ends up subtracted if (X_{Lo} – X_{Hi})(Y_{Lo} – Y_{Hi}) is positive; otherwise it gets added. Which is as it should be. And this gives us the formula:
XY = LL + 2^{b}(LL + HH + (1^{DDSub})DD) + 2^{2b}HH
And now let’s draw a possible shape for this mechanism “electrically”. First, let’s do the obvious Right Thing and begin by stuffing LL and HH into the lower and upper, respectively, halves of the register in which we are computing the sum. And after that, we will add the “middle” terms, at the requisite offset (i.e. at the bth binary place) :

LL 
HH 
+ 

LL 
0 
+ 

HH 
0 
+ 

(1^{DDSub})DD 
0 
= 
XY 
The above would work as a physical realization, but our 2bsized additions sadly became 3bsized, because of the need to ripple out the carries (recall that the addition of any two ssized integers is physically s+1sized, as there is the possibility of overflow.) But can we do anything about this, or do we have to live with it?
Turns out: there is a pill. One economical way to make the additions stay 2bsized, is to accumulate the carries. We will put them in a Wordsized register called TC (i.e. tail carry). After the three terms of the original equation have been added into the result XY, we will make the result correct by rippling out (at the cost of walking over a bsized space) the accumulated carry in TC into the tail (i.e. most senior quarter segment) of the result XY.
The new scheme looks like this:

LL 
HH 
TC := 0 
+ 

LL 

TC += Carry 
+ 

HH 

TC += Carry 
+ 

(1^{DDSub})DD 

TC += Carry – DD_{Sub} 
+ 

TC 

= 
XY 

The reason why, after the third addition, we must…
TC += Carry  DD_{Sub}
…is not uninteresting, and I am quite tempted to leave it as an exercise. And in fact I think I will:
Chapter 10, Exercise 1: Why is it necessary to subtract DD_{Sub} from TC ?
Observe that we can simplify the program we are about to write, by proving the lemma:
TC >= 0
The fact that TC is greater than or equal to zero by the time it gets rippled, trivially follows from the fact that:
MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}
There is no way for the value of above expression to become negative, given that no subtraction happens in it, and that both of the multiplicands in both of the terms, are positive! From which it necessarily follows that, given as…
MM = LL + HH + (1^{DDSub})DD
…it must then at all times be true that:
LL + HH >= (1^{DDSub})DD
And therefore it will never become necessary to propagate a “borrow” when rippling TC into the tail, and we can use a strictly “additive” formulation in the pertinent code.
But what is the maximum possible value of TC ? It so happens that the answer is: two. And we can show… but wait. Why not leave this as an exercise ! So :
Chapter 10, Exercise 2: Can the value of TC ever become greater than 2 ? Give an example of a Karatsubaization where it does; or if it cannot, prove that it cannot.
And now let’s put all of this together into a mechanism:
FZ_Mul.adb:
procedure Mul_Karatsuba(X : in FZ;
Y : in FZ;
XY : out FZ) is
 L is the wordness of a multiplicand. Guaranteed to be a power of two.
L : constant Word_Count := X'Length;
 An 'LSeg' is the same length as either multiplicand.
subtype LSeg is FZ(1 .. L);
 K is HALF of the length of a multiplicand.
K : constant Word_Index := L / 2;
 A 'KSeg' is the same length as HALF of a multiplicand.
subtype KSeg is FZ(1 .. K);
 The three Lsized variables of the product equation, i.e.:
 XY = LL + 2^(K*Bitness)(LL + HH + (1^DD_Sub)*DD) + 2^(2*K*Bitness)HH
LL, DD, HH : LSeg;
 Ksized terms of Dx * Dy = DD
Dx, Dy : KSeg;  Dx = abs(XLo  XHi) , Dy = abs(YLo  YHi)
 Subtraction borrows, signs of (XL  XH) and (YL  YH),
Cx, Cy : WBool;  so that we can calculate (1^DD_Sub)
 Bottom and Top Ksized halves of the multiplicand X.
XLo : KSeg renames X( X'First .. X'Last  K );
XHi : KSeg renames X( X'First + K .. X'Last );
 Bottom and Top Ksized halves of the multiplicand Y.
YLo : KSeg renames Y( Y'First .. Y'Last  K );
YHi : KSeg renames Y( Y'First + K .. Y'Last );
 Lsized middle segment of the product XY (+/ K from the midpoint).
XYMid : LSeg renames XY( XY'First + K .. XY'Last  K );
 Bottom and Top Lsized halves of the product XY.
XYLo : LSeg renames XY( XY'First .. XY'Last  L );
XYHi : LSeg renames XY( XY'First + L .. XY'Last );
 Topmost Ksized quarter segment of the product XY, or 'tail'
XYHiHi : KSeg renames XYHi( XYHi'First + K .. XYHi'Last );
 Whether the DD term is being subtracted.
DD_Sub : WBool;
 Carry from individual term additions.
C : WBool;
 TailCarry accumulator, for the final ripple
TC : Word;
begin
 Recurse: LL := XL * YL
FZ_Multiply(XLo, YLo, LL);
 Recurse: HH := XH * YH
FZ_Multiply(XHi, YHi, HH);
 Dx := XL  XH , Cx := Borrow (i.e. 1 iff XL < XH)
FZ_Sub_Abs(X => XLo, Y => XHi, Difference => Dx, Underflow => Cx);
 Dy := YL  YH , Cy := Borrow (i.e. 1 iff YL < YH)
FZ_Sub_Abs(X => YLo, Y => YHi, Difference => Dy, Underflow => Cy);
 Recurse: DD := Dx * Dy
FZ_Multiply(Dx, Dy, DD);
 Whether (XL  XH)(YL  YH) is positive, and so DD must be subtracted:
DD_Sub := 1  (Cx xor Cy);
 XY := LL + 2^(2 * K * Bitness) * HH
XYLo := LL;
XYHi := HH;
 XY += 2^(K * Bitness) * HH, but carry goes in Tail Carry accum.
FZ_Add_D(X => XYMid, Y => HH, Overflow => TC);
 XY += 2^(K * Bitness) * LL, ...
FZ_Add_D(X => XYMid, Y => LL, Overflow => C);
 ... but the carry goes into the Tail Carry accumulator.
TC := TC + C;
 XY += 2^(K * Bitness) * (1^DD_Sub) * DD
FZ_Not_Cond_D(N => DD, Cond => DD_Sub);  invert DD if 2scomplementing
FZ_Add_D(OF_In => DD_Sub,  ... and then must increment
X => XYMid,
Y => DD,
Overflow => C);  carry will go in Tail Carry accumulator
 Compute the final Tail Carry for the ripple
TC := TC + C  DD_Sub;
 Barring a cosmic ray, 0 < = TC <= 2 .
pragma Assert(TC <= 2);
 Ripple the Tail Carry into the tail.
FZ_Add_D_W(X => XYHiHi, W => TC, Overflow => C);
 Barring a cosmic ray, the tail ripple will NOT overflow.
pragma Assert(C = 0);
end Mul_Karatsuba;
But… Does this routine work? How fast does it run? What do the new procedures like FZ_Not_Cond_D do? Where and when do we recurse ? Why did Comba’s Multiplier have to change at all !? And, most importantly, is the proof “proofy” enough ?? Find out in Chapter 11!
~To be continued!~