You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch10_karatsuba.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
First things first:
Recall the algebraic identities in Mul_Word of Chapter 9, where a single Word × Word multiplication is turned into four HalfWord × HalfWord ones. Let’s consider the general case.
Suppose that b is a positive integer, and we would like to multiply a 2b-bit-wide X by a 2b-bit-wide Y. We can express the multiplicands algebraically as follows:
X = X_{Lo} + 2^{b}X_{Hi}
Y = Y_{Lo} + 2^{b}Y_{Hi}
…where N_{Lo} consists of any 2b-wide integer N’s least-significant b binary digits, while N_{Hi} consists of its b most-significant binary digits. The multiplication itself can then be written as:
XY = (X_{Lo} + 2^{b}X_{Hi})(Y_{Lo} + 2^{b}Y_{Hi})
.. = X_{Lo}Y_{Lo} + 2^{b}X_{Lo}Y_{Hi} + 2^{b}X_{Hi}Y_{Lo} + 2^{b}X_{Hi}2^{b}Y_{Hi}
.. = X_{Lo}Y_{Lo} + 2^{b}(X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}) + 2^{2b}X_{Hi}Y_{Hi}
…or pictured “physically”, drawn to scale (suppose that the most junior bit position in a given register is at the left edge of the rectangle representing said register; the most senior — at the right edge) :
X_{Lo} | X_{Hi} | ||||
× | Y_{Lo} | X_{Hi} | |||
= | |||||
X_{Lo}Y_{Lo} | |||||
+ | X_{Lo}Y_{Hi} | ||||
+ | X_{Hi}Y_{Lo} | ||||
+ | X_{Hi}Y_{Hi} | ||||
= | XY |
Performing this “schoolbook” 2b × 2b multiplication costs four b × b-wide multiplications (multiplications by powers of two are performed using inexpensive position shifting and do not figure towards this count) and three 2b-wide additions. This cost, as we learned in previous chapters, is proportional to the square of b. But in fact it is possible to do better — using a subquadratic multiplication algorithm.
We will make use of the oldest, best-known, and simplest such algorithm: the classic one derived from A. A. Karatsuba’s (1937 – 2008) “Multiplication of many-digital numbers by automatic computers”, Dokl. Akad. Nauk SSSR, 145:2 (1962), 293–294.
After working through this Chapter, the reader will possess a fits-in-head, constant-spacetime and provably-correct computerized implementation of Karatsuba’s algorithm.
First, let’s specify that our multiplicands X and Y are FZ-integers, and each of them occupies exactly L Words.
Recall that in Chapter 4, we have mandated that no FZ must come into existence with a bitwise length that is not a positive power of two. Therefore, at all times, a FZ will break cleanly into two halves of equal bitnesses, as many times as desired until halves each consisting of a FZ one Word in length are formed.
And so, at every recursive (see further below) invocation of our procedure, we are able to begin by finding a positive integer k (where L = 2k) i.e. the length of a multiplicand cut in half.
Thereby we can take b = MachineBitness × k, where k = L / 2.
Let’s also define, for brevity, the following recurring terms:
LL = X_{Lo}Y_{Lo}
HH = X_{Hi}Y_{Hi}
MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}
And trivially obtain the equation:
XY = LL + 2^{b}MM + 2^{2b}HH
Karatsuba’s discovery is the fact that it is possible to compute the MM middle term using only one b × b multiplication in place of two. In the most common implementations of the algorithm, this appears as the equivalence:
MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}
.. = (X_{Lo} + X_{Hi})(Y_{Lo} + Y_{Hi}) - HH - LL
This algebraic identity replaces an expensive operation (multiplication) with three cheap (addition/subtraction) arithmetical operations. However, the possibility of generating a carry during addition when computing the terms of the binomial makes the mechanism slightly slower and uglier than it strictly has to be. So we will instead use another variant of the algorithm, sometimes called subtractive Karatsuba. It makes use of the following trivial equivalence:
MM = LL + HH - (X_{Lo} - X_{Hi})(Y_{Lo} - Y_{Hi})
In FFA’s realization, this will take the following — at first, seemingly gnarlier — form:
Dx = |X_{Lo} - X_{Hi}|
Dy = |Y_{Lo} - Y_{Hi}|
DD = Dx × Dy
MM = LL + HH - (-1^{CX})(-1^{CY})DD
…where C_{X} is the sign (in the convention where it equals 1 if X_{Lo} is less than X_{Hi}, and 0 otherwise) of X_{Lo} – X_{Hi} and C_{Y} is the sign (similarly) of Y_{Lo} – Y_{Hi}.
And it is not difficult to show that, if we take
DD_{Sub} = C_{X} XNOR C_{Y}
then:
MM = LL + HH + (-1^{DDSub})DD
Or, for kindergarten:
Or, for the thick, for the impatient, | ||||
C_{X} | C_{Y} | DD_{Sub} | (-1^{DDSub})DD | MM = |
---|---|---|---|---|
0 | 0 | 1 | -DD | LL + HH - DD |
0 | 1 | 0 | DD | LL + HH + DD |
1 | 0 | 0 | DD | LL + HH + DD |
1 | 1 | 1 | -DD | LL + HH - DD |
…i.e. the DD term ends up subtracted if (X_{Lo} – X_{Hi})(Y_{Lo} – Y_{Hi}) is positive; otherwise it gets added. Which is as it should be. And this gives us the formula:
XY = LL + 2^{b}(LL + HH + (-1^{DDSub})DD) + 2^{2b}HH
And now let’s draw a possible shape for this mechanism “electrically”. First, let’s do the obvious Right Thing and begin by stuffing LL and HH into the lower and upper, respectively, halves of the register in which we are computing the sum. And after that, we will add the “middle” terms, at the requisite offset (i.e. at the b-th binary place) :
LL | HH | |||
+ | LL | 0 | ||
+ | HH | 0 | ||
+ | (-1^{DDSub})DD | 0 | ||
= | XY |
The above would work as a physical realization, but our 2b-sized additions sadly became 3b-sized, because of the need to ripple out the carries (recall that the addition of any two s-sized integers is physically s+1-sized, as there is the possibility of overflow.) But can we do anything about this, or do we have to live with it?
Turns out: there is a pill. One economical way to make the additions stay 2b-sized, is to accumulate the carries. We will put them in a Word-sized register called TC (i.e. tail carry). After the three terms of the original equation have been added into the result XY, we will make the result correct by rippling out (at the cost of walking over a b-sized space) the accumulated carry in TC into the tail (i.e. most senior quarter segment) of the result XY.
The new scheme looks like this:
LL | HH | TC := 0 | ||||
+ | LL | TC += Carry | ||||
+ | HH | TC += Carry | ||||
+ | (-1^{DDSub})DD | TC += Carry – DD_{Sub} | ||||
+ | TC | |||||
= | XY |
The reason why, after the third addition, we must…
TC += Carry - DD_{Sub}
…is not uninteresting, and I am quite tempted to leave it as an exercise. And in fact I think I will:
Observe that we can simplify the program we are about to write, by proving the lemma:
TC >= 0
The fact that TC is greater than or equal to zero by the time it gets rippled, trivially follows from the fact that:
MM = X_{Lo}Y_{Hi} + X_{Hi}Y_{Lo}
There is no way for the value of above expression to become negative, given that no subtraction happens in it, and that both of the multiplicands in both of the terms, are positive! From which it necessarily follows that, given as…
MM = LL + HH + (-1^{DDSub})DD
…it must then at all times be true that:
LL + HH >= (-1^{DDSub})DD
And therefore it will never become necessary to propagate a “borrow” when rippling TC into the tail, and we can use a strictly “additive” formulation in the pertinent code.
But what is the maximum possible value of TC ? It so happens that the answer is: two. And we can show… but wait. Why not leave this as an exercise ! So :
And now let’s put all of this together into a mechanism:
FZ_Mul.adb:
procedure Mul_Karatsuba(X : in FZ; Y : in FZ; XY : out FZ) is -- L is the wordness of a multiplicand. Guaranteed to be a power of two. L : constant Word_Count := X'Length; -- An 'LSeg' is the same length as either multiplicand. subtype LSeg is FZ(1 .. L); -- K is HALF of the length of a multiplicand. K : constant Word_Index := L / 2; -- A 'KSeg' is the same length as HALF of a multiplicand. subtype KSeg is FZ(1 .. K); -- The three L-sized variables of the product equation, i.e.: -- XY = LL + 2^(K*Bitness)(LL + HH + (-1^DD_Sub)*DD) + 2^(2*K*Bitness)HH LL, DD, HH : LSeg; -- K-sized terms of Dx * Dy = DD Dx, Dy : KSeg; -- Dx = abs(XLo - XHi) , Dy = abs(YLo - YHi) -- Subtraction borrows, signs of (XL - XH) and (YL - YH), Cx, Cy : WBool; -- so that we can calculate (-1^DD_Sub) -- Bottom and Top K-sized halves of the multiplicand X. XLo : KSeg renames X( X'First .. X'Last - K ); XHi : KSeg renames X( X'First + K .. X'Last ); -- Bottom and Top K-sized halves of the multiplicand Y. YLo : KSeg renames Y( Y'First .. Y'Last - K ); YHi : KSeg renames Y( Y'First + K .. Y'Last ); -- L-sized middle segment of the product XY (+/- K from the midpoint). XYMid : LSeg renames XY( XY'First + K .. XY'Last - K ); -- Bottom and Top L-sized halves of the product XY. XYLo : LSeg renames XY( XY'First .. XY'Last - L ); XYHi : LSeg renames XY( XY'First + L .. XY'Last ); -- Topmost K-sized quarter segment of the product XY, or 'tail' XYHiHi : KSeg renames XYHi( XYHi'First + K .. XYHi'Last ); -- Whether the DD term is being subtracted. DD_Sub : WBool; -- Carry from individual term additions. C : WBool; -- Tail-Carry accumulator, for the final ripple TC : Word; begin -- Recurse: LL := XL * YL FZ_Multiply(XLo, YLo, LL); -- Recurse: HH := XH * YH FZ_Multiply(XHi, YHi, HH); -- Dx := |XL - XH| , Cx := Borrow (i.e. 1 iff XL < XH) FZ_Sub_Abs(X => XLo, Y => XHi, Difference => Dx, Underflow => Cx); -- Dy := |YL - YH| , Cy := Borrow (i.e. 1 iff YL < YH) FZ_Sub_Abs(X => YLo, Y => YHi, Difference => Dy, Underflow => Cy); -- Recurse: DD := Dx * Dy FZ_Multiply(Dx, Dy, DD); -- Whether (XL - XH)(YL - YH) is positive, and so DD must be subtracted: DD_Sub := 1 - (Cx xor Cy); -- XY := LL + 2^(2 * K * Bitness) * HH XYLo := LL; XYHi := HH; -- XY += 2^(K * Bitness) * HH, but carry goes in Tail Carry accum. FZ_Add_D(X => XYMid, Y => HH, Overflow => TC); -- XY += 2^(K * Bitness) * LL, ... FZ_Add_D(X => XYMid, Y => LL, Overflow => C); -- ... but the carry goes into the Tail Carry accumulator. TC := TC + C; -- XY += 2^(K * Bitness) * (-1^DD_Sub) * DD FZ_Not_Cond_D(N => DD, Cond => DD_Sub); -- invert DD if 2s-complementing FZ_Add_D(OF_In => DD_Sub, -- ... and then must increment X => XYMid, Y => DD, Overflow => C); -- carry will go in Tail Carry accumulator -- Compute the final Tail Carry for the ripple TC := TC + C - DD_Sub; -- Barring a cosmic ray, 0 < = TC <= 2 . pragma Assert(TC <= 2); -- Ripple the Tail Carry into the tail. FZ_Add_D_W(X => XYHiHi, W => TC, Overflow => C); -- Barring a cosmic ray, the tail ripple will NOT overflow. pragma Assert(C = 0); end Mul_Karatsuba;
But… Does this routine work? How fast does it run? What do the new procedures like FZ_Not_Cond_D do? Where and when do we recurse ? Why did Comba’s Multiplier have to change at all !? And, most importantly, is the proof “proofy” enough ?? Find out in Chapter 11!
~To be continued!~
]]>The original source was an ancient piece of MS-Windows nagware. The perpetrator of this item (I hesitate to dignify his actions with the word “author”) took the — very clearly pre-1989, public domain — material, and embedded it in a crypted EXE “viewer” turd. Why he did this, can only be guessed at, but a certain observation by Mircea Popescu may shed some light on this particular type of psychopathology:
So : you have found somewhere something exceptional. A treasure trove. Looky, you were just walking down the road and found a chestful of wonder. What to do ? Well, confronted with such an occurence, some people proceed in the following manner : they run off to town, gather the folk and explain : “looky, I’ve found this chestful of wonder, so and so and back and forth, let’s go pick it up, stick it in a museum, so it may be inspected, admired, discussed, preserved for future generations so they may also gain by the said wonders”. Other… let’s call them people by virtue of bipedalism, proceed in a different manner : they cloink a coupla with the sledgehammer so as to break down the find into shards the size they can fit in a pocket, after which they stick it on their oxcart. Nevermind that those things aren’t made to go with oxcarts, nor do they help the oxcart in any way, the peon sits proudly on his pile of dung. And if anyone asks where he found the things, for they don’t really go with his general appearance, he answers just as proudly : he raised them since they were but pistols. (This is an ancient joke, saying that a gypsy is picked up for stealing a rifle. Before the judge, he sees a farmer brought under charges of having stolen a cow. “Why’ve you stolen the cow ?” asks the judge. “I’ve not stolen it your Honor, I raised it myself since it were a calf”. He’s eventually released, and when the gypsy’s turn comes, “Why’ve you stolen the rifle ?” asks the judge. “I’ve not stolen it your honor, I raised it myself since it were a pistol”.)
– “What happens when you add a drop of sewage to a bottle of fine wine ?”
And so, with a big, fat, Fuck You to the oxcart rider, all of it (decrypted, deduplicated, and minus some obvious rubbish) is out in the open, permanently:
File | Description | SHA256 |
---|---|---|
ro_eng_ascii.txt | Romanian < -> English Dictionary. | b10ac91292746466433b2cdb4ff67268b556463fe144e38cd20bde4057c0d5e3 |
ro_eng_tech.txt | Romanian < -> English “Technical” Dictionary. | 82f44453fe55847d5ee0e1b80697cf8a1e6fa088bf2d823ab79bb7b2a8ce8fc3 |
ro_eng_expr.txt | Romanian < -> English “Expressions” Dictionary. | c085054bf9f51d17d8d7c6a610b496b5086647391114c07cfb28a264fcb4dee0 |
The first of these items (the general-purpose dictionary) is also available as a V-Genesis:
ro_eng_genesis.vpatch.asciilifeform.sig
The diacritics are conspicuous in their absence.
I suspect that the sledgehammer-wielder had removed them, for whatever reason of his own.
You can also flip any of these Romanian -> English dictionaries into an English -> Romanian one, e.g.:
paste -d'`' < (cut -f2 -d'`' ro_eng_ascii.txt) <(cut -f1 -d'`' ro_eng_ascii.txt) | sort -n > eng_ro_ascii.txt
In all three files, the backtick (`) is used to separate each line into the two respective languages; parsing and searching is quite trivial.
Edit: files now mirrored here.
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch9_exodus.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
The “Egyptian” algorithm for multiplication (in its simplest Chapter 5 version, and likewise in the refined form given in Chapter 7) has a time cost that is proportional to the square of the operand bitness — i.e. it runs in O(N^2), or quadratic time. In Chapter 10, we will meet a substantially faster — subquadratic — algorithm for multiplication. However, for reasons which will become apparent to the reader, we would first like to obtain the fastest quadratic-time multiplier that we possibly can (without compromising any of the FFA design principles laid out in Chapter 1).
The ancient Egyptians came up with an easily-taught and reasonably efficient-method for multiplication. However, they ran it “on” human scribes, rather than automatic machinery. And it turns out that the latter, especially in the form of recently-made CPUs, is quite sensitive to the particular time-and-space “shape” of an algorithm. But we will expand on this point later.
Mathematician and astronomer Paul G. Comba (1926 — 2017) devised an elegant multiplication algorithm which superficially resembles the familiar “grade school” long-multiplication — but with the important difference that it does not require the generation of intermediate per-row temporary results. Instead, it generates each column — or digit — (or, in FFA parlance, Word) of the product, sequentially, from lowest to highest, using only a handful of words of memory for temporary storage. Let’s implement Comba’s algorithm:
FZ_Mul.ads:
with FZ_Type; use FZ_Type; package FZ_Mul is pragma Pure; -- Comba's multiplier. procedure FZ_Mul_Comba(X : in FZ; Y : in FZ; XY_Lo : out FZ; XY_Hi : out FZ); pragma Precondition(X'Length = Y'Length and XY_Lo'Length = XY_Hi'Length and XY_Lo'Length = ((X'Length + Y'Length) / 2)); end FZ_Mul;
FZ_Mul.adb:
with Words; use Words; with Word_Ops; use Word_Ops; with W_Mul; use W_Mul; package body FZ_Mul is -- Comba's multiplier. procedure FZ_Mul_Comba(X : in FZ; Y : in FZ; XY_Lo : out FZ; XY_Hi : out FZ) is -- Words in each multiplicand L : constant Word_Index := X'Length; -- Length of Product, i.e. double the length of either multiplicand LP : constant Word_Index := 2 * L; -- 3-word Accumulator A2, A1, A0 : Word := 0; -- Register holding Product; indexed from zero XY : FZ(0 .. LP - 1); -- Type for referring to a column of XY subtype ColXY is Word_Index range XY'Range; -- Compute the Nth (indexed from zero) column of the Product procedure Col(N : in ColXY; U : in ColXY; V : in ColXY) is -- The outputs of a Word multiplication Lo, Hi : Word; -- Carry for the Accumulator addition C : WBool; -- Sum for Accumulator addition Sum : Word; begin -- For lower half of XY, will go from 0 to N -- For upper half of XY, will go from N - L + 1 to L - 1 for j in U .. V loop -- Hi:Lo := j-th Word of X * (N - j)-th Word of Y Mul_Word(X(X'First + j), Y(Y'First - j + N), Lo, Hi); -- Now add Hi:Lo into the Accumulator: -- A0 += Lo; C := Carry Sum := A0 + Lo; C := W_Carry(A0, Lo, Sum); A0 := Sum; -- A1 += Hi + C; C := Carry Sum := A1 + Hi + C; C := W_Carry(A1, Hi, Sum); A1 := Sum; -- A2 += A2 + C A2 := A2 + C; end loop; -- We now have the Nth (indexed from zero) word of XY XY(N) := A0; -- Right-Shift the Accumulator by one Word width: A0 := A1; A1 := A2; A2 := 0; end Col; pragma Inline_Always(Col); begin -- Compute the lower half of the Product: for i in 0 .. L - 1 loop Col(i, 0, i); end loop; -- Compute the upper half (sans last Word) of the Product: for i in L .. LP - 2 loop Col(i, i - L + 1, L - 1); end loop; -- The very last Word of the Product: XY(XY'Last) := A0; -- Output the Product's lower and upper FZs: XY_Lo := XY(0 .. L - 1); XY_Hi := XY(L .. XY'Last); end FZ_Mul_Comba; pragma Inline_Always(FZ_Mul_Comba); end FZ_Mul;
I will omit the detailed analysis of why Comba’s Multiplication works, and leave the proof as an exercise for the reader. However, do observe that this routine demands a Mul_Word procedure. Which the reader will correctly guess, lives in the unit W_Mul:
W_Mul.ads:
with Words; use Words; package W_Mul is pragma Pure; -- The bitness of a Half-Word HalfBitness : constant Positive := Bitness / 2; subtype HalfWord is Word range 0 .. 2**HalfBitness; -- The number of bytes in a Half-Word HalfByteness : constant Positive := Byteness / 2; -- Multiply half-words X and Y, producing a Word-sized product function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word; -- Get the bottom half of a Word function BottomHW(W : in Word) return HalfWord; -- Get the top half of a Word function TopHW(W : in Word) return HalfWord; -- Carry out X*Y mult, return lower word XY_LW and upper word XY_HW. procedure Mul_Word(X : in Word; Y : in Word; XY_LW : out Word; XY_HW : out Word); end W_Mul;
W_Mul.adb (”Iron” Variant):
with W_Shifts; use W_Shifts; package body W_Mul is function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word is begin return X * Y; end Mul_HalfWord; pragma Inline_Always(Mul_HalfWord); -- Get the bottom half of a Word function BottomHW(W : in Word) return HalfWord is begin return W and (2**HalfBitness - 1); end BottomHW; pragma Inline_Always(BottomHW); -- Get the top half of a Word function TopHW(W : in Word) return HalfWord is begin return Shift_Right(W, HalfBitness); end TopHW; pragma Inline_Always(TopHW); -- Carry out X*Y mult, return lower word XY_LW and upper word XY_HW. procedure Mul_Word(X : in Word; Y : in Word; XY_LW : out Word; XY_HW : out Word) is -- Bottom half of multiplicand X XL : constant HalfWord := BottomHW(X); -- Top half of multiplicand X XH : constant HalfWord := TopHW(X); -- Bottom half of multiplicand Y YL : constant HalfWord := BottomHW(Y); -- Top half of multiplicand Y YH : constant HalfWord := TopHW(Y); -- XL * YL LL : constant Word := Mul_HalfWord(XL, YL); -- XL * YH LH : constant Word := Mul_HalfWord(XL, YH); -- XH * YL HL : constant Word := Mul_HalfWord(XH, YL); -- XH * YH HH : constant Word := Mul_HalfWord(XH, YH); -- Carry CL : constant Word := TopHW(TopHW(LL) + BottomHW(LH) + BottomHW(HL)); begin -- Get the bottom half of the Product: XY_LW := LL + Shift_Left(LH + HL, HalfBitness); -- Get the top half of the Product: XY_HW := HH + TopHW(HL) + TopHW(LH) + CL; end Mul_Word; pragma Inline_Always(Mul_Word); end W_Mul;
Mul_Word, it turns out, is merely a Word*Word -> double-Word “schoolbook” multiplier. And: it is abundantly obvious that it could be replaced with a single IMUL instruction on x86/x86-64. But we won’t do this in standard FFA — recall the design principles from Chapter 1! We will not be “marrying” any particular iron.
A very sharp reader may have already noticed the “problematic” element in the above Mul_Word routine. Namely, Mul_HalfWord as given here creates a danger of timing sidechannel leakage on certain machine architectures!
CPUs widely known to suffer from this “helpful” “optimization” include x86-compatible VIA, old Intels (386, 486), and most of the commonly-met PowerPC and ARM devices. Others archs, as well as obscure implementations of well-known archs, may also belong on this list. (Please write in if you happen to know of one such that I have not listed here.)
If FFA were to continue with the above variant (let’s call it iron) of Mul_HalfWord, every machine on which it is put to use would have to be first probed for non-constanttime iron multiplication: every newly-purchased CPU is arguably “guilty until proven innocent”. But it turns out that there is a simple and relatively-inexpensive cure for this headache:
W_Mul.adb (”Soft” Variant, non-unrolled):
-- Multiply half-words X and Y, producing a Word-sized product function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word is -- X-Slide XS : Word := X; -- Y-Slide YS : Word := Y; -- Gate Mask GM : Word; -- The Product XY : Word := 0; begin -- For each bit of the Y-Slide: for b in 1 .. HalfBitness loop -- Compute the gate mask GM := 0 - (YS and 1); -- Perform the gated addition XY := XY + (XS and GM); -- Crank the next Y-slide bit into position YS := Shift_Right(YS, 1); -- Advance the X-slide by 1 bit XS := Shift_Left(XS, 1); end loop; -- Return the Product return XY; end Mul_HalfWord; pragma Inline_Always(Mul_HalfWord);
The above Mul_HalfWord replaces the “iron” multiplication with a miniature version of the Chapter 5 “Egyptian” algorithm. Because there is no multiwordism and no carry calculation, this safe (bitwise logical operations and shifts by one bit are constant-time on all CPUs known to me) alternative to the “leaky” MUL, when implemented in the partially unrolled form as follows — comes at a cost of a less than 6 percent (see further below) increase in required CPU time(see comments):
W_Mul.adb (”Soft” Variant, unrolled):
-- Multiply half-words X and Y, producing a Word-sized product function Mul_HalfWord(X : in HalfWord; Y : in HalfWord) return Word is -- X-Slide XS : Word := X; -- Y-Slide YS : Word := Y; -- Gate Mask GM : Word; -- The Product XY : Word := 0; -- Performed for each bit of HalfWord's bitness: procedure Bit is begin -- Compute the gate mask GM := 0 - (YS and 1); -- Perform the gated addition XY := XY + (XS and GM); -- Crank the next Y-slide bit into position YS := Shift_Right(YS, 1); -- Advance the X-slide by 1 bit XS := Shift_Left(XS, 1); end Bit; pragma Inline_Always(Bit); begin -- For each bit of the Y-Slide (unrolled) : for b in 1 .. HalfByteness loop Bit; Bit; Bit; Bit; Bit; Bit; Bit; Bit; end loop; -- Return the Product return XY; end Mul_HalfWord; pragma Inline_Always(Mul_HalfWord);
And now, on the same machine as in previous Chapters, timed and plotted logarithmically:
Or, for those who prefer the raw numbers,
FFA Bitness | Ch.6 Egyptian ‘X’ (sec) | Ch.7 Egyptian ‘X’ (sec) | Ch.9 “Iron” Comba ‘X’ (sec) | Ch.9 “Soft” Comba ‘X’ (sec) |
---|---|---|---|---|
256 | 0.072 | 0.040 | 0.028 | 0.029 |
512 | 0.505 | 0.240 | 0.163 | 0.169 |
1024 | 3.685 | 1.672 | 1.125 | 1.177 |
2048 | 27.975 | 12.024 | 7.975 | 8.387 |
4096 | 217.966 | 90.439 | 59.356 | 62.725 |
8192 | 1720.127 | 699.979 | 457.103 | 483.993 |
Despite this small increase in runtime cost, standard FFA will from this point on use only the “soft” variant of Mul_HalfWord, so as to maintain the guarantee of constant-time operation on all known CPU architectures. Individual readers, can, of course, at their own risk — replace the entirety of Mul_Word — or even FZ_Mul_Comba — with an inline-assembly routine known to work safely on a particular CPU. But this type of optimization falls outside the scope of FFA as per the design principles introduced in Chapter 1.
~To be continued!~
Answer to Homework Problem 1 from Chapter 8:
Let’s suppose that you wish to test multiplication by generating a Python program via FFACalc. Your tape could look like this:
tape.txt:
??``*[print 0x]##[ == 0x]#[ * 0x]#
Which would produce a valid Python program that can be fed straight into a Python interpreter, e.g. :
./bin/ffa_calc 1048576 32 /dev/urandom < tape.txt | tr -d '\n' > multest.py
… will pseudorandomly generate two numbers, each being one megabit in size, and produce a statement regarding their (two-megabit-wide, at most) product, that can be verified by the Python interpreter, in turn producing an output after around half a minute on the test box used thus far.
Now let’s feed this output into Python:
$ python < multest.py
Prints:
True
... after less than one fifth of a second, on the same machine.
The bulk of the remaining material in this series will concern methods of bridging the speed gap between our routines and traditional "bignum" mechanisms (such as those used in the Python interpreter) to the extent possible without compromising any of the design principles of FFA.
Homework:
1. Prove that Comba's Algorithm works. And in particular, that the three-word Accumulator space given, will always suffice.
2. Given that both "Egyptian" and Comba's multiplication run in O(N^2), why does Comba's method win speedwise? Does it continue to win for all possible operand bitnesses, or only over a particular range?
3. Write an inline assembly variant of FZ_Mul_Comba for your CPU architecture. Demonstrate how you determined that the resulting program, when operating on the exponentiation test tape from Chapter 6, still runs in constant time (i.e. its CPU time cost is unaffected by any aspect of the input other than the choice of FFA bitness) on your particular machine.
~To be continued!~
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch8_randomism.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
The “mail bag” from Chapter 7… is empty.
Well, not entirely empty: reader Apeloyee observed that FZ_Mod_Exp would produce the wrong answer in a hypothetical scenario where its output Result is permitted to overwrite the location containing the Modulus input operand. A revised variant is therefore as follows:
FZ_ModEx.adb:
-- Modular Exponent: Result := Base^Exponent mod Modulus procedure FZ_Mod_Exp(Base : in FZ; Exponent : in FZ; Modulus : in FZ; Result : out FZ) is -- Working register for the squaring; initially is copy of Base B : FZ(Base'Range) := Base; -- Copy of Exponent, for cycling through its bits E : FZ(Exponent'Range) := Exponent; -- Register for the Mux operation T : FZ(Result'Range); -- Buffer register for the Result R : FZ(Result'Range); begin -- Result := 1 WBool_To_FZ(1, R); -- For each bit of R width: for i in 1 .. FZ_Bitness(R) loop -- T := Result * B mod Modulus FZ_Mod_Mul(X => R, Y => B, Modulus => Modulus, Product => T); -- Sel is the current low bit of E; -- When Sel=0 -> Result := Result; -- When Sel=1 -> Result := T FZ_Mux(X => R, Y => T, Result => R, Sel => FZ_OddP(E)); -- Advance to the next bit of E FZ_ShiftRight(E, E, 1); -- B := B*B mod Modulus FZ_Mod_Mul(X => B, Y => B, Modulus => Modulus, Product => B); end loop; -- Output the Result: Result := R; end FZ_Mod_Exp; pragma Inline_Always(FZ_Mod_Exp);
The ultimate aim of FFA is not merely correctness, but obvious correctness. In that vein, any reasonably inexpensive (from complexity and speed POVs) “cushion” which increases rodent resistance — i.e. cuts down the phase space of possible abuses — is of interest.
As always, I would like to thank my eagle-eyed readers for their sweat. And anyone who observes a similar thing, I encourage to write in, and receive credit in the next Chapter.
In this Chapter, we will be taking a break from the intricacies of Modular Exponentiation. And in fact from FFA per se. (We will return to the subject in Chapter 9.) Instead, we will be introducing an important knob previously missing from FFACalc: random number generation :
FFA_RNG.ads:
with Ada.Sequential_IO; with Words; use Words; with FZ_Type; use FZ_Type; package FFA_RNG is Default_RNG_Path : constant String := "/dev/random"; -- For reading from RNGs: package Word_IO is new Ada.Sequential_IO(Element_Type => Word); -- Represents an RNG Device: type RNG_Device is record F : Word_IO.File_Type; end record; -- Prepare an RNG for use; at given path, or will use default procedure Init_RNG(RNG : out RNG_Device; RNG_Unix_Path : in String := Default_RNG_Path); -- Fill a FZ from RNG procedure FZ_Random(RNG : in RNG_Device; N : out FZ); end FFA_RNG;
Do you have an auditable hardware True Random Number Generator? If so, you will be able to use it with FFACalc. (Note: if your RNG is on, e.g., a serial port, the port must be already initialized prior to using FFACalc.)
Alternatively, it is possible to specify an ordinary file as an “RNG”, for deterministic testing.
The mechanism itself is not complicated:
FFA_RNG.adb:
with OS; use OS; with FZ_Type; use FZ_Type; package body FFA_RNG is -- Prepare an RNG for use; at given path, or will use default procedure Init_RNG(RNG : out RNG_Device; RNG_Unix_Path : in String := Default_RNG_Path) is begin begin -- Open the RNG at the offered path: Word_IO.Open(File => RNG.F, Mode => Word_IO.In_File, Name => RNG_Unix_Path); exception when others => Eggog("Could not open RNG at : " & RNG_Unix_Path & "!"); end; end Init_RNG; -- Fill a FZ from RNG procedure FZ_Random(RNG : in RNG_Device; N : out FZ) is begin begin -- Fill the destination FZ from this RNG: for i in N'Range loop Word_IO.Read(RNG.F, N(i)); end loop; exception when others => Eggog("Could not read from RNG!"); end; end FZ_Random; end FFA_RNG;
Observe that we make no attempt to “massage” the RNG device’s output in any way whatsoever. If your RNG is correctly designed, it is unnecessary; if incorrectly — futile.
The RNG initialization will look like this:
ffa_calc.adb:
-- For RNG: with FFA_RNG; use FFA_RNG; procedure FFA_Calc is Width : Positive; -- Desired FFA Width Height : Positive; -- Desired Height of Stack RNG : RNG_Device; -- The active RNG device. begin if Arg_Count < 3 or Arg_Count > 4 then Eggog("Usage: ./ffa_calc WIDTH HEIGHT [/dev/rng]"); end if; declare Arg1 : CmdLineArg; Arg2 : CmdLineArg; begin -- Get commandline args: Get_Argument(1, Arg1); -- First arg Get_Argument(2, Arg2); -- Second arg if Arg_Count = 4 then -- RNG was specified: declare Arg3 : CmdLineArg; begin Get_Argument(3, Arg3); -- Third arg (optional) -- Ada.Sequential_IO chokes on paths with trailing whitespace! -- So we have to give it a trimmed path. But we can't use -- Ada.Strings.Fixed.Trim, because it suffers from -- SecondaryStackism-syphilis. Instead we are stuck doing this: Init_RNG(RNG, Arg3(Arg3'First .. Len_Arg(3))); end; else -- RNG was NOT specified: Init_RNG(RNG); -- Use the machine default then end if; -- .......snipped.......
To satisfy the Unix insanity of Ada.Sequential_IO, it was necessary to make the Len_Arg function accessible.
FFACalc is invoked quite like before; but now it can take an optional third argument, consisting of the RNG device path. If this optional argument is not given, the default RNG (presently /dev/random. And pointedly not /dev/urandom, which only even exists in Linux because rats gotta rat around.)
And now for the new op itself:
ffa_calc.adb:
-- Push a FZ of RNGolade onto the stack when '?' => Push; FZ_Clear(Stack(SP)); FZ_Random(RNG, Stack(SP));
Now let’s try it:
echo "?#" | ./bin/ffa_calc 4096 32
And one possible result:
A331FBE217BC77C7C6AD90C1BABB18C9C95B5A420650AE30876102ECF3FB47667BC5C3 38857C058ED15554724F71321D950A41CB95E86B7894B73BC61BDBAE8B9E0F24EF400D B71742B96A2D46F64D24C57304423D13F5C3EBE7DC62D1A72A33B424F8C3920905BCD6 D1AAB6868D2350D82F73A4583C695B9D46D1D4CFCC46E8487AFBE55DAA53865084D261 B32880A337AFD5A567E1CBA475B3215F18A625E0343031F153F46070F15A159DCBE628 4FC6E624DE42AE31CE086502419777DDC9133BC07D21930D3B87DE124F8E5B282629C3 95DC240033C8E0EF1790F95FBC4C21E8346ECF335B9F4D0DC9A26D5B05218A03C742DA 2D03667A16FB01BBCC2B4CF29169D0CE02E750265BBA4F568C5EE35103440A54B9D6E6 098490DF7E587BD220D01EA396524DAC92812E417BD79248AD3D4480E3AD9CA76DF2B2 405D79F689830C4101D52AEC26311324D529661F286E90DCBB973B371114102971BFF6 FA2EA2460D6733D2C893AFAC0A81FFA0B39C3F67DE04CF9E56AE82B9C659BC8A8470DE EFE5908C16CC1168F1F0AD00DBF6AEFD08B402B9E742817475DDFF273E99641360B62A 3CCB5CC9343BC936CCECAF4FA926B4BF69D96D286C8308448A17B0FF4342AECD0F84CC 5CD43D7DF5D8A343FABEE3CDD4126B3E7DB94C35ECD111824CA288051C269CBEDBAB2F 368627DDD2D4E8CAFD1E3E00001F88FF4E9E5E5DD94C
Homework:
1. Write an automatic generator of arithmetical statements, e.g. multiplications, which are valid programs in your favourite scripting language, and will print “True” supposing that FFACalc had performed the computation correctly, and “False” otherwise.
2. Observe what you get when invoking the RNG when a file is used as the entropy source. Do the contents of the resulting FZ depend on the machine “endianness” ? What happens if there are insufficiently many bytes in the file to supply the demand of your FFACalc tape?
~To be continued!~
]]>
The OEM crate contains a roll of tape, but — oddly enough — no power supply. Evidently these were expected to be used inside industrial equipment which would supply the necessary 6 volts. (Seiko claims “6.5-7.0V, 2A” but a 6V dongle worked. Keep in mind that the polarity is reversed from the typical: the positive terminal is the outer surface of the cylinder.)
There are a number of uses for a small tape printer, but I will mention two rodenticidal ones in particular:
This is self-explanatory. The enemy cannot edit or delete the log of a machine which sends its log in real time to a printer. (Granted, he can rob your house, or burn it down, but this falls outside of the scope of “computer” security, and is a problem all valuable physical objects, including your own skin, have in common.)
Observe that the enemy does not need to know that a log printer is in use. Let him wonder. But naturally you will want to keep the output brief enough to read and compare with the contents of electronic logs, with some regularity.
AKA Vernam cipher. This should need no explanation; OTP is part of an ordinarily educated computer user’s essential toolkit.
You can plug it straight into your TRNG, if you know how to configure the latter to output a serial stream at 19200 (or below) baud. The Seiko can be configured to print hex dumps natively. (RTFM)
Observe that you can print duplicate “pads” without ever storing a pad on a computer for any length of time, by simply connecting two or more tape printers using a fan-out cable (AFAIK no such thing is sold anywhere, but you can trivially make such a cable yourself in a couple of minutes) to your RNG.
The Seiko is particularly handy in that it does not use USB (works fine, however, with a USB-to-RS232 dongle, if you must) and does not require drivers or specific system support of any kind whatsoever : once the baud rate is configured, it will eat standard RS-232 without complaint until it runs out of tape. (The printer also comes with a “nostalgic” Centronics “parallel port” connector, but I have not tried to use this.)
This printer can be configured to emit 80 columns of text (Alternatively, 40, with larger characters.) AFAIK it is the only thermal printer on the market which can print 80 columns. If anyone knows otherwise, I would like to hear about it.
Now, for the headaches. The Seiko 414 comes with factory defaults in EEPROM that make it virtually unusable with recent iron: it wants input from the Centronics port. And to reconfigure it, you must use “software DIP switches” — Seiko’s way of saving a dime in part cost by making you waste a ~metre of tape:
Fortunately, this only has to be done once.
If one believes the manual, the Seiko 414 also knows how to print graphics (monochrome, 640 pixels wide, times… well, “infinity”-high…) and can therefore print “QR” codes (for, e.g., Bitcoinism) and similar applications. But I have not tried this.
And this is where I note that I have no particular love for Seiko, and will ask any among my readers who might know of a similar-but-cheaper printer; or of one that can print on a wider thermal tape; or even of an entirely different but somehow superior tool for the same job — to please write in.
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch7_turbo_egyptians.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
No one submitted a solution to the “trivial optimizations” problem posed in Chapter 5, but reader Diana Coman did show symptoms of knowing the magic ingredient which happens to be the subject of this Chapter.
The attentive reader may already have begun to suspect that the “Egyptian” multiplication algorithm (FZ_Mul_Egyptian from Ch. 5) is not the final word on the subject of integer multiplication in FFA. But before we can explore subquadratic multiplication (and — much later — clever methods for speeding up modular reduction) it is necessary to set up the arena for a “fair fight”, by making reasonably well-optimized variants of both of the quadratic-runtime “Egyptian” algos, FZ_Mul_Egyptian and FZ_Mod.
Previously, the primary design objectives of the FFA algorithms were correctness and constant-spacetime operation; followed, secondarily, by simplicity (in the fits-in-head sense.) In this chapter, we add another objective: speed.
At no point will correctness or constant-spacetime (branch-free and offsetting-by-secrets-free) operation be sacrificed under any pretext whatsoever. However, in order to obtain, e.g., an RSAtron that is practically usable on commonplace machines, it will be necessary to sacrifice a certain amount of mechanical simplicity. This Chapter, along with the bulk of the remaining material, will be devoted to this painful — but not uninteresting! — task.
Let’s begin with a new, “turbo” FZ_Mul_Egyptian:
FZ_Mul.adb:
-- 'Egyptological' multiplier. XY_Lo and XY_Hi hold result of X*Y. procedure FZ_Mul_Egyptian(X : in FZ; Y : in FZ; XY_Lo : out FZ; XY_Hi : out FZ) is L : constant Indices := X'Length; -- Register holding running product XY : FZ(1 .. X'Length + Y'Length); -- X-Slide XS : FZ(1 .. X'Length + Y'Length); begin -- Product register begins empty FZ_Clear(XY); -- X-Slide initially equals X: XS(1 .. X'Length) := X; XS(X'Length + 1 .. XS'Last) := (others => 0); -- For each word of Y: for i in Y'Range loop declare -- Current word of Y W : Word := Y(i); -- Current cut of XY and XS. Stay ahead by a word to handle carry. Cut : constant Indices := L + i; XYc : FZ renames XY(1 .. Cut); XSc : FZ renames XS(1 .. Cut); begin for b in 1 .. Bitness loop -- If current Y bit is 1, X-Slide Cut is added into XY Cut FZ_Add_Gated(X => XYc, Y => XSc, Sum => XYc, Gate => W and 1); -- Crank the next bit of Y into the bottom position of W W := Shift_Right(W, 1); -- X-Slide := X-Slide * 2 FZ_ShiftLeft(XSc, XSc, 1); end loop; end; end loop; -- Write out the Product's lower and upper FZs: XY_Lo := XY(1 .. XY_Lo'Length); XY_Hi := XY(XY_Lo'Length + 1 .. XY'Last); end FZ_Mul_Egyptian; pragma Inline_Always(FZ_Mul_Egyptian);
Observe that the Y-Slide from Chapter 5’s multiplier is gone. Instead, we now walk through the bits of the multiplicand Y without having to shift the entire thing FZ_Bitness(Y) times: each word of Y is loaded into W, starting with the first; and the “egyptology” is then performed Bitness times, once for each bit of W.
The other optimization is the introduction of the cut concept. Observe that an addition of two FZ integers of identical bitness, can produce a result with an intrinsic bitness that is larger than that of the greater parent’s by a maximum of one bit. The consequence of this is that the Chapter 5 FZ_Mul_Egyptian wastes roughly half of its CPU time shifting and adding words that never, at the particular times they are touched, depart from zero.
Thus, the first iteration of the loop is carried out on a cut of length X’Length + 1; the second, X’Length + 2; and forth; the last iteration is the only one which is performed on the entire XY product-accumulator and the entire XS X-Slide. We “run ahead” of the segment of XY which has been touched, by one word, so as to have a place to which to ripple the carry.
At the expense of a certain amount of “obviousness”, we win a 2x gain in multiplication speed.
And now we will apply exactly the same optimizations to the modulus routine:
FZ_Divis.adb:
-- Modulus. Permits the asymmetric Dividend and Divisor in FZ_Mod_Exp. procedure FZ_Mod(Dividend : in FZ; Divisor : in FZ; Remainder : out FZ) is -- Length of Divisor and Remainder; < = Dividend'Length L : constant Indices := Divisor'Length; -- Remainder register, starts as zero R : FZ(1 .. L) := (others => 0); -- Indices into the words of Dividend subtype Dividend_Index is Word_Index range Dividend'Range; -- Permissible 'cuts' for the Slice operation subtype Divisor_Cuts is Word_Index range 2 .. Divisor'Length; -- Performs Restoring Division on a given segment of Dividend:Divisor procedure Slice(Index : Dividend_Index; Cut : Divisor_Cuts) is begin declare -- Borrow, from comparator C : WBool; -- Left-Shift Overflow LsO : WBool; -- Current cut of Remainder register Rs : FZ renames R(1 .. Cut); -- Current cut of Divisor Ds : FZ renames Divisor(1 .. Cut); -- Current word of Dividend, starting from the highest W : Word := Dividend(Dividend'Last + 1 - Index); begin -- For each bit in the current Dividend word: for b in 1 .. Bitness loop -- Send top bit of current Dividend word to the bottom of W W := Rotate_Left(W, 1); -- Advance Rs, shifting in the current Dividend bit FZ_ShiftLeft_O_I(N => Rs, ShiftedN => Rs, Count => 1, OF_In => W and 1, Overflow => LsO); -- Subtract Divisor-Cut from R-Cut; Underflow goes into C FZ_Sub(X => Rs, Y => Ds, Difference => Rs, Underflow => C); -- If C=1, subtraction underflowed, and we must undo it: FZ_Add_Gated(X => Rs, Y => Ds, Sum => Rs, Gate => C and W_Not(LsO)); end loop; end; end Slice; begin -- Process bottom half of dividend: for i in 1 .. L - 1 loop Slice(i, i + 1); -- stay ahead by a word to handle carry end loop; -- Process top half of dividend for i in L .. Dividend'Length loop Slice(i, L); end loop; -- Output the Remainder. Remainder := R; end FZ_Mod; pragma Inline_Always(FZ_Mod);
In addition to the two optimizations analogous to those we had applied to FZ_Mul_Egyptian, we also get the ability to apply FZ_Mod to a Dividend which exceeds the bitness of the Divisor. This will allow us to abolish the zero-padding of Modulus in Chapter 6’s FZ_Mod_Mul.
Correspondingly, we carefully alter the preconditions of FZ_Mod:
FZ_Divis.ads:
-- Modulus. Permits the asymmetric Dividend and Divisor in FZ_Mod_Exp. procedure FZ_Mod(Dividend : in FZ; Divisor : in FZ; Remainder : out FZ); pragma Precondition(Dividend'Length >= Divisor'Length and Divisor'Length = Remainder'Length);
The only place in FFA where the asymmetric invocation of FZ_Mod will be permitted, is FZ_Mod_Mul, the modular multiplier procedure, which will now look like this:
FZ_ModEx.adb:
-- Modular Multiply: Product := X*Y mod Modulus procedure FZ_Mod_Mul(X : in FZ; Y : in FZ; Modulus : in FZ; Product : out FZ) is -- The wordness of all three operands is equal: L : constant Indices := X'Length; -- Double-width register for multiplication and modulus operations XY : FZ(1 .. L * 2); -- To refer to the lower and upper halves of the working register: XY_Lo : FZ renames XY(1 .. L); XY_Hi : FZ renames XY(L + 1 .. XY'Last); begin -- XY_Lo:XY_Hi := X * Y FZ_Mul_Egyptian(X, Y, XY_Lo, XY_Hi); -- Product := XY mod M FZ_Mod(XY, Modulus, Product); end FZ_Mod_Mul; pragma Inline_Always(FZ_Mod_Mul);
And now, on the same machine as in Ch.6 , timed and plotted logarithmically:
Or, for those who prefer the raw numbers,
FFA Bitness | Ch.6 ‘X’ (sec) | Ch.7 ‘X’ (sec) |
---|---|---|
256 | 0.072 | 0.040 |
512 | 0.505 | 0.240 |
1024 | 3.685 | 1.672 |
2048 | 27.975 | 12.024 |
4096 | 217.966 | 90.439 |
8192 | 1720.127 | 699.979 |
And finally, let’s turn this into a FFACalc tape, and verify the signature of the Chapter 7 material, using itself, just like we did in the previous chapter:
( the seal of ffa_ch7_turbo_egyptians.vpatch : ) . 88A0984C5438EFD811BA8505362EEA572B4057C52662DA8854DF66613C40332EB8F4E3 80932DBF9106F3536643BD2FF69DA90EA0402B4FDC21A9245C3EBCB10C86C1847C9CC9 9C2ECE4245735013D0E59EE905479445B625D76EECC1BDEEBF489DC512715324995D13 EAB27D286EFA86BF8D69C3D93CB65E2F4BA9B805AA392B29D9F6635E9C7D81C0CC4EAA A2DA0F618D83F7DAB4854F24B3A900BFDF35E58E87745A6A8D11868C860334DD80A4A4 D0DCAC4E5EC34E9C16BBD174659008145F1E09556009557277A7DFE74FAA3018995746 8CB25E0B10B5AE8D114E4E89E4B940B1A21A5BB7EFDD75AB59A1D82AC08DBCC9A7B10F 5B708B2BC2453AB9BC6C80 ( my public rsa exponent : ) . 10001 ( my public rsa modulus : ) . CDD49A674BAF76D3B73E25BC6DF66EF3ABEDDCA461D3CCB6416793E3437C7806562694 73C2212D5FD5EED17AA067FEC001D8E76EC901EDEDF960304F891BD3CAD7F9A335D1A2 EC37EABEFF3FBE6D3C726DC68E599EBFE5456EF19813398CD7D548D746A30AA47D4293 968BFBAFCBF65A90DFFC87816FEE2A01E1DC699F4DDABB84965514C0D909D54FDA7062 A2037B50B771C153D5429BA4BA335EAB840F9551E9CD9DF8BB4A6DC3ED1318FF3969F7 B99D9FB90CAB968813F8AD4F9A069C9639A74D70A659C69C29692567CE863B88E191CC 9535B91B417D0AF14BE09C78B53AF9C5F494BCF2C60349FFA93C81E817AC682F0055A6 07BB56D6A281C1A04CEFE1 ( now modularly-exponentiate it : ) X ( ... and print the output .) #
… and play the tape! :
$ time cat ch7_rsa_tape.txt | ./bin/ffa_calc 2048 32
… and unsurprisingly:
0001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF003051 300D060960864801650304020305000440FCFB42B5342D226261DF36158BB4EA5482F4 B45B8D4259F3873D8B77CD7572D55FBADCBCC1267BECE169F330154F7A9156C60A07FA 8C902BEE43FB27FC33B248 real 0m12.018s user 0m12.004s sys 0m0.002s
Guess what? The correct answer. (Use the patched GPG from Ch. 6, and compare.)
Homework:
1. Prove that the new FZ_Mod produces, for all valid inputs, the equivalent answer to the one in the previous chapter.
2. Formulate a hypothesis regarding why the observed speedup of ‘X’ was greater than two-fold.
~To be continued!~
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch6_simplest_rsa.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
The “mail bag” from Chapter 5… is empty. Congratulations, however, to the successful solvers of the Chapter 4 puzzle!
Now, errata! In the FFACalc extensions of Chapter 5, we have:
ffa_calc.adb:
-- Multiply, give bottom and top halves when '*' => Want(2); MustNotZero(Stack(SP)); -- Ch5: slow and simple 'Egyptological' method: FZ_Mul_Egyptian(X => Stack(SP - 1), Y => Stack(SP), XY_Lo => Stack(SP - 1), XY_Hi => Stack(SP));
After removal of the obvious bug, this becomes:
ffa_calc.adb:
-- Multiply, give bottom and top halves when '*' => Want(2); -- Ch5: slow and simple 'Egyptological' method: FZ_Mul_Egyptian(X => Stack(SP - 1), Y => Stack(SP), XY_Lo => Stack(SP - 1), XY_Hi => Stack(SP));
In this Chapter, we will add a modular multiplication routine to FFA. This in turn will immediately let us write a modular exponentiation routine, which is the fundamental operation of the RSA system of asymmetric cryptography.
Concretely:
FZ_ModEx.ads:
with FZ_Type; use FZ_Type; package FZ_ModEx is pragma Pure; -- Modular Multiply: Product := X*Y mod Modulus procedure FZ_Mod_Mul(X : in FZ; Y : in FZ; Modulus : in FZ; Product : out FZ); pragma Precondition(X'Length = Y'Length and Modulus'Length = X'Length and Product'Length = Modulus'Length); -- Modular Exponent: Result := Base^Exponent mod Modulus procedure FZ_Mod_Exp(Base : in FZ; Exponent : in FZ; Modulus : in FZ; Result : out FZ); pragma Precondition(Base'Length = Exponent'Length and Base'Length = Result'Length and Base'Length = Modulus'Length); end FZ_ModEx;
Because of the extremely simplistic and wholly-unoptimized way in which we implemented Multiplication and Division in Chapter 5, this will be a “geological-time” RSAtron, rather than a battlefield-ready one (though I can conceive of a few speed-insensitive applications where even an algorithm such as this one could suffice.) And at the end of this Chapter, we will answer the question of “exactly how geological” ?
The bulk of the material in the remaining Chapters in this series will concern itself with demonstrably-correct and safe (i.e. constant-spacetimeness-preserving) methods for speeding up Multiplication and Division — and consequently Modular Exponentiation — so as to transform this “geological” RSA into a generally-applicable one.
And so, without further delay, let’s define the new modular multiplication and modular exponentiation operators in the FFACalc system:
ffa_calc.adb:
-- Modular Multiplication when 'M' => Want(3); MustNotZero(Stack(SP)); FZ_Mod_Mul(X => Stack(SP - 2), Y => Stack(SP - 1), Modulus => Stack(SP), Product => Stack(SP - 2)); Drop; Drop; -- Modular Exponentiation when 'X' => Want(3); MustNotZero(Stack(SP)); FZ_Mod_Exp(Base => Stack(SP - 2), Exponent => Stack(SP - 1), Modulus => Stack(SP), Result => Stack(SP - 2)); Drop; Drop;
And now let’s implement the new unit, FZ_ModEx:
FZ_ModEx.adb:
with FZ_Basic; use FZ_Basic; with FZ_Pred; use FZ_Pred; with FZ_Shift; use FZ_Shift; with FZ_Mul; use FZ_Mul; with FZ_Divis; use FZ_Divis; package body FZ_ModEx is
We will begin with the modular multiplication algorithm. It is deliberately made in the most straightforward way I could think of, making use only of “schoolboy” methods:
-- Modular Multiply: Product := X*Y mod Modulus procedure FZ_Mod_Mul(X : in FZ; Y : in FZ; Modulus : in FZ; Product : out FZ) is -- The wordness of all three operands is equal: L : constant Indices := X'Length; -- Double-width register for multiplication and modulus operations XY : FZ(1 .. L * 2); -- To refer to the lower and upper halves of the working register: XY_Lo : FZ renames XY(1 .. L); XY_Hi : FZ renames XY(L + 1 .. XY'Last); -- A zero-padded double-wide copy of Modulus, to satisfy Ch.5's FZ_Mod M : FZ(XY'Range); begin -- Place the Modulus in a double-width M, as FZ_Mod currently demands M(Modulus'Range) := Modulus; M(L + 1 .. M'Last) := (others => 0); -- XY_Lo:XY_Hi := X * Y FZ_Mul_Egyptian(X, Y, XY_Lo, XY_Hi); -- XY := XY mod M FZ_Mod(XY, M, XY); -- The bottom half of XY is our modular product; top half is always 0 Product := XY_Lo; end FZ_Mod_Mul; pragma Inline_Always(FZ_Mod_Mul);
This FZ_Mod_Mul multiplies X and Y using the barbaric FZ_Mul_Egyptian algorithm from Chapter 5; then subsequently divides XY — their product — by Modulus, and returns the remainder — using the FZ_Mod from Chapter 5.
Some of the wastefulness of the above FZ_Mod_Mul will immediately stand out to the reader. In subsequent Chapters, we will methodically strip it away. For now, we are concerned strictly with obtaining a minimal and correct baseline against which to work.
Now we can define a modular exponentiation algorithm:
-- Modular Exponent: Result := Base^Exponent mod Modulus procedure FZ_Mod_Exp(Base : in FZ; Exponent : in FZ; Modulus : in FZ; Result : out FZ) is -- Working register for the squaring B : FZ(Base'Range) := Base; -- Register for cycling through the bits of E E : FZ(Exponent'Range) := Exponent; -- Register for the Mux operation T : FZ(Result'Range); begin -- Result := 1 WBool_To_FZ(1, Result); -- For each bit of Result width: for i in 1 .. FZ_Bitness(Result) loop -- T := Result * B mod Modulus FZ_Mod_Mul(X => Result, Y => B, Modulus => Modulus, Product => T); -- Sel is the current low bit of E; -- When Sel=0 -> Result := Result; -- When Sel=1 -> Result := T FZ_Mux(X => Result, Y => T, Result => Result, Sel => FZ_OddP(E)); -- Advance to the next bit of E FZ_ShiftRight(E, E, 1); -- B := B*B mod Modulus FZ_Mod_Mul(X => B, Y => B, Modulus => Modulus, Product => B); end loop; end FZ_Mod_Exp; pragma Inline_Always(FZ_Mod_Exp);
This is the traditional right-to-left binary method, as seen in e.g. Vol. 2 of D. E. Knuth’s AOCP. The only substantial departure from the schoolbook algorithm is the elimination of branching on secret bits. The multiplication step is always carried out, and FZ_Mux is used to either keep, or throw away the answer.
Specifically: Result is initially equal to 1. B initially equals the Base; E — the Exponent. Now we iterate for a fixed number of cycles, equal to the bitness of Result (which is also equal to the bitness of all three operands.)
In each iteration:
… We multiply Result with the current B, modulo the Modulus, and assign the result to temporary register T.
… And now, if the current bottom bit (determined via FZ_OddP) of E is a zero, we end up discarding (in constant-time) T — the output of the modular multiplication in the previous step, via FZ_Mux (See Chapter 1). But if this bottom bit is a one, then T is not discarded, and becomes the new value of Result.
… Now E, where we had placed the Exponent at the beginning, is cranked right-wise by one bit (i.e. divided by 2). The next highest bit will now be found at the bottom position, for use in the next iteration.
… Finally, the register B is squared modulo the Modulus.
After the final iteration, Result will contain Base raised to the Exponent power modulo the Modulus.
And that’s all for the new unit:
end FZ_ModEx;
Now let’s remember that, as in every other FFA operation, the CPU time consumed by our new modular multiplication and modular exponentiation algorithms is independent of the Hamming weights of the operands. Concretely, e.g., 1 to the 1st power mod 1:
time echo -E ".~.~.~.1.1.1X" | ./bin/ffa_calc 1024 32
… will take — and for any given bitness — the same, within the margin of error of your Unix time utility, amount of time to compute, as maxint to the maxint-th power modulo maxint:
time echo -E ".1.1.1.~.~.~X" | ./bin/ffa_calc 1024 32
Verify this to your satisfaction.
And now, for some empirical data. On a certain 3 GHz Opteron, timed and plotted logarithmically:
Perhaps not quite as “geological” as the reader expected! We can actually fire this primitive RSAtron in anger! Let’s verify the RSA seal of ffa_ch6_simplest_rsa.vpatch, the Chapter 6 code itself, using itself:
First, take the seal:
$ pgpdump -i ffa_ch6_simplest_rsa.vpatch.asciilifeform.sig
… and dump it to a human-readable form:
Old: Signature Packet(tag 2)(284 bytes) Ver 4 - new Sig type - Signature of a binary document(0x00). Pub alg - RSA Encrypt or Sign(pub 1) Hash alg - SHA512(hash 10) Hashed Sub: signature creation time(sub 2)(4 bytes) Time - Sat Jan 6 11:47:07 EST 2018 Sub: issuer key ID(sub 16)(8 bytes) Key ID - 0xB98228A001ABFFC7 Hash left 2 bytes - 62 55 RSA m^d mod n(2047 bits) - 5b 6a 8a 0a cf 4f 4d b3 f8 2e ac 2d 20 25 5e 4d f3 e4 b7 c7 99 60 32 10 76 6f 26 ef 87 c8 98 0e 73 75 79 ec 08 e6 50 5a 51 d1 96 54 c2 6d 80 6b af 1b 62 f9 c0 32 e0 b1 3d 02 af 99 f7 31 3b fc fd 68 da 46 83 6e ca 52 9d 73 60 94 85 50 f9 82 c6 47 6c 05 4a 97 fd 01 63 5a b4 4b fb db e2 a9 0b e0 6f 79 84 ac 85 34 c3 86 13 74 7f 34 0c 18 17 6e 6d 5f 0c 10 24 6a 2f ce 3a 66 8e ac b6 16 5c 20 52 49 7c a2 ee 48 3f 4f d8 d0 6a 99 11 bd 97 e9 b6 72 05 21 d8 72 bd 08 ff 8d a1 1a 1b 8d b1 47 f2 52 e4 e6 9a e6 20 1d 3b 37 4b 17 1d f4 45 ef 2b f5 09 d4 68 fd 57 ce b5 84 03 49 b1 4c 6e 2a aa 19 4d 95 31 d2 38 b8 5b 8f 0d d3 52 d1 e5 96 71 53 9b 42 98 49 e5 d9 65 e4 38 bf 9e ff c3 38 df 9a ad f3 04 c4 13 0d 5a 05 e0 06 ed 85 5f 37 a0 62 42 28 09 7e f9 2f 6e 78 ca e0 cb 97 -> PKCS-1
Now, obtain my public modulus:
$ pgpdump -i ~/.wot/asciilifeform.asc
Old: Public Key Packet(tag 6)(269 bytes) Ver 4 - new Public key creation time - Thu Dec 20 12:49:24 EST 2012 Pub alg - RSA Encrypt or Sign(pub 1) RSA n(2048 bits) - cd d4 9a 67 4b af 76 d3 b7 3e 25 bc 6d f6 6e f3 ab ed dc a4 61 d3 cc b6 41 67 93 e3 43 7c 78 06 56 26 94 73 c2 21 2d 5f d5 ee d1 7a a0 67 fe c0 01 d8 e7 6e c9 01 ed ed f9 60 30 4f 89 1b d3 ca d7 f9 a3 35 d1 a2 ec 37 ea be ff 3f be 6d 3c 72 6d c6 8e 59 9e bf e5 45 6e f1 98 13 39 8c d7 d5 48 d7 46 a3 0a a4 7d 42 93 96 8b fb af cb f6 5a 90 df fc 87 81 6f ee 2a 01 e1 dc 69 9f 4d da bb 84 96 55 14 c0 d9 09 d5 4f da 70 62 a2 03 7b 50 b7 71 c1 53 d5 42 9b a4 ba 33 5e ab 84 0f 95 51 e9 cd 9d f8 bb 4a 6d c3 ed 13 18 ff 39 69 f7 b9 9d 9f b9 0c ab 96 88 13 f8 ad 4f 9a 06 9c 96 39 a7 4d 70 a6 59 c6 9c 29 69 25 67 ce 86 3b 88 e1 91 cc 95 35 b9 1b 41 7d 0a f1 4b e0 9c 78 b5 3a f9 c5 f4 94 bc f2 c6 03 49 ff a9 3c 81 e8 17 ac 68 2f 00 55 a6 07 bb 56 d6 a2 81 c1 a0 4c ef e1 RSA e(17 bits) - 01 00 01 ...~~~crapola snipped~~~...
But let us now “thank” the malignant idiocy of W. Koch (see also)… Because we will have to patch GPG in order to learn the RFC4880 turd that it uses as “the signature” to compare against when verifying a seal:
diff -uNr a/gnupg-1.4.10/cipher/rsa.c b/gnupg-1.4.10/cipher/rsa.c --- a/gnupg-1.4.10/cipher/rsa.c 2008-12-11 11:40:06.000000000 -0500 +++ b/gnupg-1.4.10/cipher/rsa.c 2018-01-06 15:15:55.000000000 -0500 @@ -444,6 +444,16 @@ result = mpi_alloc ( mpi_nlimb_hint_from_nbits (160) ); public( result, data[0], &pk ); rc = mpi_cmp( result, hash )? G10ERR_BAD_SIGN:0; + + /* Enable dumping of raw sig verification operands: */ + if( DBG_CIPHER ) { + log_mpidump(" data = ", data[0] ); + log_mpidump(" pk.n = ", pk.n ); + log_mpidump(" pk.e = ", pk.e ); + log_mpidump(" hash = ", hash ); + log_mpidump(" result= ", result ); + } + mpi_free(result); return rc;
So we built it; and now let’s run the seal verification with this debuggism:
$ ~/gpg-patched/gnupg-1.4.10/g10/gpg --debug-all --verify ch6.vpatch.sig ch6.vpatch
And obtain:
(From this point on, lines were wrapped and reindented for clarity:)
...~~~crapola snipped~~~... gpg: data = 5B6A8A0ACF4F4DB3F82EAC2D20255E4DF3E4B7C799603210766F26EF87C8980E737579 EC08E6505A51D19654C26D806BAF1B62F9C032E0B13D02AF99F7313BFCFD68DA46836E CA529D7360948550F982C6476C054A97FD01635AB44BFBDBE2A90BE06F7984AC8534C3 8613747F340C18176E6D5F0C10246A2FCE3A668EACB6165C2052497CA2EE483F4FD8D0 6A9911BD97E9B6720521D872BD08FF8DA11A1B8DB147F252E4E69AE6201D3B374B171D F445EF2BF509D468FD57CEB5840349B14C6E2AAA194D9531D238B85B8F0DD352D1E596 71539B429849E5D965E438BF9EFFC338DF9AADF304C4130D5A05E006ED855F37A06242 28097EF92F6E78CAE0CB97 gpg: pk.n = CDD49A674BAF76D3B73E25BC6DF66EF3ABEDDCA461D3CCB6416793E3437C7806562694 73C2212D5FD5EED17AA067FEC001D8E76EC901EDEDF960304F891BD3CAD7F9A335D1A2 EC37EABEFF3FBE6D3C726DC68E599EBFE5456EF19813398CD7D548D746A30AA47D4293 968BFBAFCBF65A90DFFC87816FEE2A01E1DC699F4DDABB84965514C0D909D54FDA7062 A2037B50B771C153D5429BA4BA335EAB840F9551E9CD9DF8BB4A6DC3ED1318FF3969F7 B99D9FB90CAB968813F8AD4F9A069C9639A74D70A659C69C29692567CE863B88E191CC 9535B91B417D0AF14BE09C78B53AF9C5F494BCF2C60349FFA93C81E817AC682F0055A6 07BB56D6A281C1A04CEFE1 gpg: pk.e = 10001 gpg: hash = 1FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF003051300 D0609608648016503040203050004406255509399A3AF322C486C770C5F7F6E05E18FC 3E2219A03CA56C7501426A597187468B2F71B4A198C807171B73D0E7DBC3EEF6EA6AFF 693DE58E18FF84395BE gpg: result= 1FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF003051300 D0609608648016503040203050004406255509399A3AF322C486C770C5F7F6E05E18FC 3E2219A03CA56C7501426A597187468B2F71B4A198C807171B73D0E7DBC3EEF6EA6AFF 693DE58E18FF84395BE ...~~~crapola snipped~~~...
And finally, let’s turn this into a FFACalc tape:
( the seal of ffa_ch6_simplest_rsa.vpatch.asciilifeform.sig : ) . 5B6A8A0ACF4F4DB3F82EAC2D20255E4DF3E4B7C799603210766F26EF87C8980E737579 EC08E6505A51D19654C26D806BAF1B62F9C032E0B13D02AF99F7313BFCFD68DA46836E CA529D7360948550F982C6476C054A97FD01635AB44BFBDBE2A90BE06F7984AC8534C3 8613747F340C18176E6D5F0C10246A2FCE3A668EACB6165C2052497CA2EE483F4FD8D0 6A9911BD97E9B6720521D872BD08FF8DA11A1B8DB147F252E4E69AE6201D3B374B171D F445EF2BF509D468FD57CEB5840349B14C6E2AAA194D9531D238B85B8F0DD352D1E596 71539B429849E5D965E438BF9EFFC338DF9AADF304C4130D5A05E006ED855F37A06242 28097EF92F6E78CAE0CB97 ( my public rsa exponent : ) . 10001 ( my public rsa modulus : ) . CDD49A674BAF76D3B73E25BC6DF66EF3ABEDDCA461D3CCB6416793E3437C7806562694 73C2212D5FD5EED17AA067FEC001D8E76EC901EDEDF960304F891BD3CAD7F9A335D1A2 EC37EABEFF3FBE6D3C726DC68E599EBFE5456EF19813398CD7D548D746A30AA47D4293 968BFBAFCBF65A90DFFC87816FEE2A01E1DC699F4DDABB84965514C0D909D54FDA7062 A2037B50B771C153D5429BA4BA335EAB840F9551E9CD9DF8BB4A6DC3ED1318FF3969F7 B99D9FB90CAB968813F8AD4F9A069C9639A74D70A659C69C29692567CE863B88E191CC 9535B91B417D0AF14BE09C78B53AF9C5F494BCF2C60349FFA93C81E817AC682F0055A6 07BB56D6A281C1A04CEFE1 ( now modularly-exponentiate it : ) X ( ... and print the output .) #
… and play the tape! :
$ time cat ch6_rsa_tape.txt | ./bin/ffa_calc 2048 32
… and unsurprisingly:
0001FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF003051 300D0609608648016503040203050004406255509399A3AF322C486C770C5F7F6E05E1 8FC3E2219A03CA56C7501426A597187468B2F71B4A198C807171B73D0E7DBC3EEF6EA6 AFF693DE58E18FF84395BE real 0m27.956s user 0m27.927s sys 0m0.003s
Guess what? The correct answer. Took an entire half-minute, however, to complete on this (same as where the log plot was made) machine.
In the next Chapter, we will investigate some elementary improvements in efficiency!
~To be continued!~
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch5_egypt.vpatch.
You should end up with the same directory structure as previously.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
First, let’s go through the “mail bag” from Chapter 4:
Reader Apeloyee observed that FZ_Valid_Bitness_P can be made slightly shorter and clearer. I have taken the more conservative of his two suggestions.
The rewritten procedure now looks like this:
-- Determine if a proposed FFA Bitness is valid. function FZ_Valid_Bitness_P(B : in Positive) return Boolean is Result : Boolean := False; T : Natural := B; PopCount : Natural := 0; begin -- Supposing we meet the minimal bitness: if B >= FZ_Minimal_Bitness then while T > 0 loop PopCount := PopCount + T mod 2; T := T / 2; end loop; -- Is B a power of 2? if PopCount = 1 then Result := True; end if; end if; return Result; end FZ_Valid_Bitness_P;
I would like to thank again the eagle-eyed Apeloyee for his attentive and skeptical reading.
And now for the solution to the Chapter 4 Puzzle:
Readers who have written solutions of their own to this puzzle, are invited to post them as comments to this post (please try to avoid spoiling it for those who have not yet reached Chapter 5.)
In Chapter 4, we wrote a small and very limited RPN calculator, FFACalc, capable only of addition, subtraction, basic I/O, and a few stack motion operations. In Chapter 5, we will give it the ability to multiply and divide.
The division and multiplication routines in FFA will extend across more than one chapter in this article series, on account of the wide spectrum of possible tradeoffs between simplicity and speed in the known approaches to both problems. We will begin at the extreme of simplicity at all cost. And, as the reader may already have guessed from the title of this Chapter, we will begin with division and multiplication algorithms that were known to the ancient Egyptians.
But first, a few small helper-function additions to LibFFA.
A convenient WBool (See Chapter 1) inverter:
w_pred.ads:
-- Return WBool-complement of N. function W_Not(N : in WBool) return WBool;
w_pred.adb:
-- Return WBool-complement of N. function W_Not(N : in WBool) return WBool is begin return 1 xor N; end W_Not; pragma Inline_Always(W_Not);
A convenient function for determining the bitness of an FZ integer:
fz_type.ads:
-- A count of Bits, anywhere (cannot be 0): subtype Bit_Count is Positive;
fz_basic.ads:
-- Determine the Bitness of N function FZ_Bitness(N : in FZ) return Bit_Count;
fz_basic.adb:
-- Determine the Bitness of N function FZ_Bitness(N : in FZ) return Bit_Count is begin return N'Length * Words.Bitness; end FZ_Bitness; pragma Inline_Always(FZ_Bitness);
And finally, a mechanism for constant-time conditional addition:
fz_arith.ads:
-- Gate = 1: Sum := X + Y; Overflow := Carry -- Gate = 0: Sum := X; Overflow := 0 procedure FZ_Add_Gated_O(X : in FZ; Y : in FZ; Gate : in WBool; Sum : out FZ; Overflow : out WBool); pragma Precondition(X'Length = Y'Length and X'Length = Sum'Length);
fz_arith.adb:
procedure FZ_Add_Gated_O(X : in FZ; Y : in FZ; Gate : in WBool; Sum : out FZ; Overflow : out WBool) is Carry : WBool := 0; Mask : constant Word := 0 - Gate; begin for i in 0 .. Word_Index(X'Length - 1) loop declare A : constant Word := X(X'First + i); B : constant Word := Y(Y'First + i) and Mask; S : constant Word := A + B + Carry; begin Sum(Sum'First + i) := S; Carry := W_Carry(A, B, S); end; end loop; Overflow := Carry; end FZ_Add_Gated_O; pragma Inline_Always(FZ_Add_Gated_O);
This is something like a hybrid between FZ_Add (See Chapter 1) and FZ_Mux. If Gate is equal to 1, Mask is set to “maxint” of our Word, i.e. 11111….1; consequently the quantity B is unmolested, and the routine behaves exactly like FZ_Add. However, if Gate is equal to 0, Mask is then also equal to 0 and B is forced to also equal 0 during each iteration of the loop; the resulting addition is a constant-time “no-op”.
Work out the behaviour of this algorithm on a sheet of paper, as you have done for the material in earlier chapters.
And at last, we are ready for “Egyptological” division:
fz_divis.ads:
with FZ_Type; use FZ_Type; package FZ_Divis is pragma Pure; -- Dividend is divided by Divisor, producing Quotient and Remainder. -- WARNING: NO div0 test here! Caller must test. procedure FZ_IDiv(Dividend : in FZ; Divisor : in FZ; Quotient : out FZ; Remainder : out FZ); pragma Precondition(Dividend'Length = Divisor'Length and Quotient'Length = Remainder'Length and Dividend'Length = Quotient'Length); -- Exactly same thing as IDiv, but keep only the Quotient procedure FZ_Div(Dividend : in FZ; Divisor : in FZ; Quotient : out FZ); pragma Precondition(Dividend'Length = Divisor'Length and Dividend'Length = Quotient'Length); -- Exactly same thing as IDiv, but keep only the Remainder procedure FZ_Mod(Dividend : in FZ; Divisor : in FZ; Remainder : out FZ); pragma Precondition(Dividend'Length = Divisor'Length and Dividend'Length = Remainder'Length); end FZ_Divis;
This algorithm is quite exactly analogous to the “long division” everyone learns in grade school. The “Egyptian” variant is also known as “restoring division” and is discussed as such in, e.g., Vol. 2 of D. E. Knuth’s “Art of Programming”. Observe that, just like every other FFA operation, this procedure does not branch on the operand bits. This is essential in any routine intended for use in cryptographic arithmetic.
The mechanics are quite simple: the Dividend is shifted into a register R “head-first”, and the Divisor is repeatedly subtracted from R. If the subtraction results in an underflow, this is taken to signify that the Divisor “did not go in”, and the subtraction is then undone, in constant-time (via the conditional-adder introduced earlier in this Chapter.) In the latter case, a 0 is revealed as the current bit of the Quotient — whose bits are found backwards, from last (highest) to first (lowest); in the opposite case, a 1 is written. The Remainder obtained from the division is found in R at the end of the procedure, which requires a number of iterations equal to the bitness of the Dividend:
fz_divis.adb:
with Words; use Words; with W_Pred; use W_Pred; with FZ_Basic; use FZ_Basic; with FZ_Arith; use FZ_Arith; with FZ_BitOp; use FZ_BitOp; with FZ_Shift; use FZ_Shift; package body FZ_Divis is -- Dividend is divided by Divisor, producing Quotient and Remainder. -- WARNING: NO div0 test here! Caller must test. procedure FZ_IDiv(Dividend : in FZ; Divisor : in FZ; Quotient : out FZ; Remainder : out FZ) is -- The working register QR : FZ(1 .. Dividend'Length + Divisor'Length); -- Bottom seg of Z will end up containing the Quotient; top - remainder Q : FZ renames QR(1 .. Dividend'Length); -- Quotient R : FZ renames QR(Dividend'Length + 1 .. QR'Last); -- Remainder C : WBool := 0; -- Borrow, from comparator begin Q := Dividend; -- Q begins with the Dividend FZ_Clear(R); -- R begins empty -- For each bit of Dividend: for i in 1 .. FZ_Bitness(Dividend) loop -- Advance QR by 1 bit: FZ_ShiftLeft(QR, QR, 1); -- Subtract Divisor from R; Underflow goes into C FZ_Sub(X => R, Y => Divisor, Difference => R, Underflow => C); -- If C=1, subtraction underflowed, and then Divisor gets added back: FZ_Add_Gated(X => R, Y => Divisor, Gate => C, Sum => R); -- Current result-bit is equal to Not-C, i.e. 1 if Divisor 'went in' FZ_Or_W(Q, W_Not(C)); end loop; Quotient := Q; -- Output the Quotient. Remainder := R; -- Output the Remainder. end FZ_IDiv; pragma Inline_Always(FZ_IDiv);
The above is, to my knowledge, the simplest practical integer division algorithm (simple repeated-subtraction is not practical for large numbers and will not be discussed.) There are several substantial improvements that can be made to it, and we will visit them in later chapters. However, it is worth noting that Division is the most expensive of the familiar arithmetical operations, and — unlike multiplication — no order-of-magnitude speedup for it is known.
Now we will define remainderless division, and modulus operations. These can be optimized to an extent, but their simplest possible formulation is the trivial one below, written in terms of FZ_IDiv:
-- Exactly same thing as IDiv, but keep only the Quotient procedure FZ_Div(Dividend : in FZ; Divisor : in FZ; Quotient : out FZ) is Remainder : FZ(Divisor'Range); pragma Unreferenced(Remainder); begin FZ_IDiv(Dividend, Divisor, Quotient, Remainder); end FZ_Div; pragma Inline_Always(FZ_Div); -- Exactly same thing as IDiv, but keep only the Remainder procedure FZ_Mod(Dividend : in FZ; Divisor : in FZ; Remainder : out FZ) is Quotient : FZ(Dividend'Range); pragma Unreferenced(Quotient); begin FZ_IDiv(Dividend, Divisor, Quotient, Remainder); end FZ_Mod; pragma Inline_Always(FZ_Mod); end FZ_Divis;
And, now: “Egyptological” multiplication.
fz_mul.ads:
with FZ_Type; use FZ_Type; package FZ_Mul is pragma Pure; -- 'Egyptological' multiplier. XY_Lo and XY_Hi hold result of X*Y. procedure FZ_Mul_Egyptian(X : in FZ; Y : in FZ; XY_Lo : out FZ; XY_Hi : out FZ); pragma Precondition(X'Length = Y'Length and XY_Lo'Length = XY_Hi'Length and XY_Lo'Length = ((X'Length + Y'Length) / 2)); end FZ_Mul;
Observe that it is “symmetric” with the Division, i.e. the exact same process “in reverse gear”:
fz_mul.adb:
with Words; use Words; with FZ_Basic; use FZ_Basic; with FZ_Pred; use FZ_Pred; with FZ_Arith; use FZ_Arith; with FZ_Shift; use FZ_Shift; package body FZ_Mul is -- 'Egyptological' multiplier. XY_Lo and XY_Hi hold result of X*Y. procedure FZ_Mul_Egyptian(X : in FZ; Y : in FZ; XY_Lo : out FZ; XY_Hi : out FZ) is -- Register holding running product XY : FZ(1 .. X'Length + Y'Length); -- X-Slide XS : FZ(1 .. X'Length + Y'Length); -- Y-Slide YS : FZ(Y'Range) := Y; begin -- Product register begins empty FZ_Clear(XY); -- X-Slide initially equals X: XS(1 .. X'Length) := X; XS(X'Length + 1 .. XS'Last) := (others => 0); -- For each bit of Y: for i in 1 .. FZ_Bitness(Y) loop -- If lowest bit of Y-Slide is 1, X-Slide is added into XY FZ_Add_Gated(X => XY, Y => XS, Sum => XY, Gate => FZ_OddP(YS)); -- X-Slide := X-Slide * 2 FZ_ShiftLeft(XS, XS, 1); -- Y-Slide := Y-Slide / 2 FZ_ShiftRight(YS, YS, 1); end loop; -- Write out the Product's lower and upper FZs: XY_Lo := XY(1 .. XY_Lo'Length); XY_Hi := XY(XY_Lo'Length + 1 .. XY'Last); end FZ_Mul_Egyptian; pragma Inline_Always(FZ_Mul_Egyptian); end FZ_Mul;
We begin with a working register XY, initially zero, wide enough to hold the product; a “X-Slide” XS, which initially equals multiplicand X; and a “Y-Slide”, YS, which initially equals multiplicand Y. XS slides “upwards”, and so is given room so as to be able to fit the uppermost possible bit of the product. YS slides “downward” and is therefore Y-sized.
The procedure requires a number of iterations equal to the bitness of Y (which, the attentive reader will observe, in current FFA will always equal the bitness of X — but why not specify the general case?) In each iteration, XS is conditionally added in constant time to the running product XY. The condition is determined by the lowest bit of current YS. At the end of each iteration, XS slides “up”, i.e. is multiplied by two, while YS slides “down”, i.e. is divided by two.
After the required number of iterations, XY will equal the product of the multiplicands X and Y. The two halves of this product, FZ integers equal in their bitness to that of the multiplicands, are finally written to XY_Lo and XY_Hi — the lower and upper halves respectively.
Observe that this multiplier runs in O(N^2) — FZ_Add being an O(N) operation, N being the current FZ bitness. This “bearded” algorithm is impractically slow for battlefield use, in e.g. RSA. However it is unbeatable in simplicity. The reader should be able to fully understand why it cannot fail, and why the result will be emitted in constant time (i.e. the required work is wholly independent of the Hamming weights of the multiplicands) and occupy exactly the allotted space.
In subsequent chapters, we will explore several other multiplication algorithms, which make tradeoffs in simplicity (but however pointedly without compromising constant-time operation!) and gain considerable improvements in speed.
Now, let’s add all of the new operations to FFACalc:
First, a helper function that prevents division by zero:
ffa_calc.adb:
-- Ensure that a divisor is not zero procedure MustNotZero(D : in FZ) is begin if FZ_ZeroP(D) = 1 then E("Division by Zero!"); end if; end MustNotZero;
The Division-with-Remainder, Integer Division, Modulus, and Multiplication operations:
First, ‘\’: expects a Dividend, and on top of it, a Divisor; and leaves the stack containing in their place a Quotient, and on top of it, a Remainder:
ffa_calc.adb:
-- Divide and give Quotient and Remainder when '\' => Want(2); MustNotZero(Stack(SP)); FZ_IDiv(Dividend => Stack(SP - 1), Divisor => Stack(SP), Quotient => Stack(SP - 1), Remainder => Stack(SP));
Then, ‘/’: exactly the same thing as the above, but leaves only the Quotient on the stack:
-- Divide and give Quotient only when '/' => Want(2); MustNotZero(Stack(SP)); FZ_Div(Dividend => Stack(SP - 1), Divisor => Stack(SP), Quotient => Stack(SP - 1)); Drop;
And, ‘%’: this is the familiar modulus operation; it performs division but leaves only the remainder on the stack:
-- Divide and give Remainder only when '%' => Want(2); MustNotZero(Stack(SP)); FZ_Mod(Dividend => Stack(SP - 1), Divisor => Stack(SP), Remainder => Stack(SP - 1)); Drop;
Finally, multiplication, using the very slow Egyptian method. Observe that multiplication of two integers results in a product having as its bitness the sum of their bitnesses. Therefore FFACalc multiplication places two FZ integers on the stack, the lower half of the result, followed by the upper half. If the latter is unwanted, it must be dropped using the Drop (”_”) operation (See Chapter 4.) The Overflow flag is in all cases unaffected, as there is room for the entire product — its lower and upper halves occupy the same places on the stack which formerly held the multiplicands.
-- Multiply, give bottom and top halves when '*' => Want(2); MustNotZero(Stack(SP)); -- Ch5: slow and simple 'Egyptological' method: FZ_Mul_Egyptian(X => Stack(SP - 1), Y => Stack(SP), XY_Lo => Stack(SP - 1), XY_Hi => Stack(SP));
Now let’s build the improved FFACalc:
cd ffacalc gprbuild
And try out the new operations:
$ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.F88A509A858786B366D16B9D65BF3991D0\##" | ./bin/ffa_calc 256 32 000000000000000000000000000000A1398B1FBC8872F908C44763C5427EBD0F 000000000000000000000000000000000101D96F0815F0853E1D8D883B93F2D9 $ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.F88A509A858786B366D16B9D65BF3991D0/#" | ./bin/ffa_calc 256 32 000000000000000000000000000000000101D96F0815F0853E1D8D883B93F2D9 $ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.F88A509A858786B366D16B9D65BF3991D0%#" | ./bin/ffa_calc 256 32 000000000000000000000000000000A1398B1FBC8872F908C44763C5427EBD0F $ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.0\#" | ./bin/ffa_calc 256 32 Pos: 67: Division by Zero! $ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.0/#" | ./bin/ffa_calc 256 32 Pos: 67: Division by Zero! $ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.0%#" | ./bin/ffa_calc 256 32 Pos: 67: Division by Zero! $ echo -n -E ".FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F.F88A509A858786B366D16B9D65BF3991D0\.A1398B1FBC8872F908C44763C5427EBD0F={[R OK]}{[R SAD]}_[; ].101D96F0815F0853E1D8D883B93F2D9={[Q OK]}{[Q SAD]}_[; ]" | ./bin/ffa_calc 256 32 R OK; Q OK; $ echo -n -E ".FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF * ## " | ./bin/ffa_calc 256 32 FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFE 0000000000000000000000000000000000000000000000000000000000000001
Homework Problems, in order of difficulty:
~To be continued!~
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ch4_ffacalc.vpatch.
Just like before, you will end up with two directories, libffa and ffademo.
However you will also see a new one, ffacalc.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
As the title of this chapter suggests, it will not introduce fundamentally new FFA material. Instead, you will meet FFACalc — a program which makes practical use of the routines presented in Chapters 1, 2, and 3. Henceforth every Chapter in this series will build on FFACalc, rather than continuing to expand the rather-uninteresting ffademo. When we reach the final Chapter, the reader will notice that… but let’s not spoil it!
For now, FFACalc is exactly what the name implies: a FFAtronic RPN calculator, capable strictly of addition, subtraction, basic bitwise ops, numeric comparison, a small number of simple stack manipulations (a la Forth), and some elementary I/O.
But first, a few small helper-function additions to libFFA.
Calculators need to accept numeric input, and it is to be processed one hexadecimal digit at a time. Therefore a nibble subtype of Word is called for:
words.ads:
-- Word, restricted to Nibble range. subtype Nibble is Word range 0 .. 16#F#;
Predicate operators produce strictly WBool (see Chapter 1) outputs. FFACalc will operate strictly in FZ integers. Therefore a conversion is required:
fz_basic.ads:
-- Set given FZ to a given truth value procedure WBool_To_FZ(V : in WBool; N : out FZ);
fz_basic.adb:
-- Set given FZ to a given truth value procedure WBool_To_FZ(V : in WBool; N : out FZ) is begin FZ_Clear(N); FZ_Set_Head(N, V); end WBool_To_FZ; pragma Inline_Always(WBool_To_FZ);
Sometimes, we will need to go in the other direction, and produce a WBool from an FZ, based on the Boolean meaning of its value (i.e. whether it is a nonzero.) This will look like this:
w_pred.ads:
-- Return 1 if N is unequal to 0; otherwise return 0. function W_NZeroP(N : in Word) return WBool;
w_pred.adb:
-- Return 1 if N is unequal to 0; otherwise return 0. function W_NZeroP(N : in Word) return WBool is begin return 1 xor W_ZeroP(N); end W_NZeroP; pragma Inline_Always(W_NZeroP);
Now since we are introducing a program where the user is able to control the width of an instantiated FFAtron, we will need a validity predicate. FFA “bitness” is not an arbitrary positive number, but must be an integer multiple of all historical machine word sizes, and additionally a power of two (the reason for the latter restriction will become apparent in the Sub-Quadratic Multiplication Chapter.) And so:
fz_lim.ads:
package FZ_Lim is pragma Pure; FZ_Minimal_Bitness : constant Positive := 256; FZ_Validity_Rule_Doc : constant String := "Must be greater than or equal to 256, and a power of 2."; -- Determine if a proposed FFA Bitness is valid. function FZ_Valid_Bitness_P(B : in Positive) return Boolean; end FZ_Lim;
fz_lim.adb:
package body FZ_Lim is -- Determine if a proposed FFA Bitness is valid. function FZ_Valid_Bitness_P(B : in Positive) return Boolean is Result : Boolean := False; T : Natural := B; PopCount : Natural := 0; begin -- Supposing we meet the minimal bitness: if B >= FZ_Minimal_Bitness then while T > 0 loop if T mod 2 = 1 then PopCount := PopCount + 1; end if; T := T / 2; end loop; -- Is B a power of 2? if PopCount = 1 then Result := True; end if; end if; return Result; end FZ_Valid_Bitness_P; end FZ_Lim;
In a power of 2, the “popcount” (number of 1s) is necessarily 1. The mechanics of this routine should be apparent to the alert reader, nothing more will be said of it.
And that’s all for LibFFA, for the time being. Now returning to FFACalc: we will shed the old ffademo’s dependence on Ada’s standard I/O library, in favour of a more compact approach:
os.ads:
with Interfaces; use Interfaces; with Interfaces.C; use Interfaces.C; package OS is -- Receive a character from the TTY, and True if success (False if EOF) function Read_Char(C : out Character) return Boolean; -- Send a character to the TTY. procedure Write_Char(C : in Character); -- Send a Newline to the TTY. procedure Write_Newline; -- Exit with an error condition report. procedure Eggog(M : String); procedure Quit(Return_Code : Integer); pragma Import (Convention => C, Entity => Quit, External_Name => "exit"); private -- POSIX stdio: EOF : constant int := -1; function GetChar return int; pragma Import(C, getchar); function PutChar(item: int) return int; pragma Import(C, putchar); -- GNATistic procedure To_Stderr(C : Character); pragma Import(Ada, To_Stderr, "__gnat_to_stderr_char"); Sadness_Code : constant Integer := -1; end OS;
os.adb:
package body OS is -- Receive a character from the TTY, and True if success (False if EOF) function Read_Char(C : out Character) return Boolean is i : int; Result : Boolean := False; begin i := GetChar; if i /= EOF then C := Character'Val(i); Result := True; end if; return Result; end Read_Char; -- Send a character to the TTY. procedure Write_Char(C : in Character) is R : int; pragma Unreferenced(R); begin R := PutChar(int(Character'Pos(C))); end Write_Char; -- Send a Newline to the TTY. procedure Write_Newline is begin Write_Char(Character'Val(16#A#)); end Write_Newline; -- Exit with an error condition report. procedure Eggog(M : String) is begin for i in 1 .. M'Length loop To_Stderr(M(I)); end loop; -- Emit LF To_Stderr(Character'Val(16#A#)); -- Exit Quit(Sadness_Code); end; end OS;
Now for a bit of surprise. Did you know that it is impossible to make use of the standard Ada.Command_Line functionality in a program where the
pragma Restrictions(No_Secondary_Stack);
… restriction is in force?
The reason for this becomes apparent in a careful reading of the Standard, where we find the following turd:
function Argument (Number : in Positive) return String;
Indeed, a completely unnecessary invocation of the secondary stack ! Why did the authors of the Standard do this ? I have no idea, but we will have to correct their mistake! Unfortunately it is quite impossible to do this without invoking some GNATisms. And so, we must:
cmdline.ads:
with System; package CmdLine is -- IMHO this is reasonable. CmdLineArg_Length : constant Positive := 256; subtype CmdLineArg is String(1 .. CmdLineArg_Length); function Initialized return Boolean; function Arg_Count return Natural; pragma Import(C, Arg_Count, "__gnat_arg_count"); procedure Get_Argument(Number : in Natural; Result : out String); private procedure Fill_Arg (A : System.Address; Arg_Num : Integer); pragma Import(C, Fill_Arg, "__gnat_fill_arg"); function Len_Arg (Arg_Num : Integer) return Integer; pragma Import(C, Len_Arg, "__gnat_len_arg"); end CmdLine;
cmdline.adb:
with System; use System; package body CmdLine is -- Test if GNAT's cmdline mechanism is available function Initialized return Boolean is gnat_argv : System.Address; pragma Import (C, gnat_argv, "gnat_argv"); begin return gnat_argv /= System.Null_Address; end Initialized; -- Fill the provided string with the text of Number-th cmdline arg procedure Get_Argument(Number : in Natural; Result : out String) is begin if Number >= Arg_Count or (not Initialized) then raise Constraint_Error; end if; declare L : constant Integer := Len_Arg(Number); Arg : aliased String(1 .. L); begin -- Will it fit into the available space? if L > Result'Length then raise Constraint_Error; end if; -- Get this arg string from where GNAT stowed it Fill_Arg(Arg'Address, Number); -- Copy it to Result: Result := (others => ' '); Result(Arg'Range) := Arg; end; end Get_Argument; end CmdLine;
How this is invoked, will soon become quite apparent. Let’s at last proceed to FFACalc !
We make use here of nearly everything we have seen in Chapters 1-3:
ffa_calc.adb:
-- Basics with OS; use OS; with CmdLine; use CmdLine; -- FFA with FZ_Lim; use FZ_Lim; with Words; use Words; with W_Pred; use W_Pred; with FZ_Type; use FZ_Type; with FZ_Basic; use FZ_Basic; with FZ_Arith; use FZ_Arith; with FZ_Cmp; use FZ_Cmp; with FZ_Pred; use FZ_Pred; with FZ_BitOp; use FZ_BitOp; with FZ_Shift; use FZ_Shift; -- For Output with FFA_IO; use FFA_IO;
procedure FFA_Calc is Width : Positive; -- Desired FFA Width Height : Positive; -- Desired Height of Stack begin if Arg_Count /= 3 then Eggog("Usage: ./ffa_calc WIDTH HEIGHT"); end if; declare Arg1 : CmdLineArg; Arg2 : CmdLineArg; begin -- Get commandline args: Get_Argument(1, Arg1); -- First arg Get_Argument(2, Arg2); -- Second arg -- Parse into Positives: Width := Positive'Value(Arg1); Height := Positive'Value(Arg2); exception when others => Eggog("Invalid arguments!"); end; -- Test if proposed Width is permissible: if not FZ_Valid_Bitness_P(Width) then Eggog("Invalid Width: " & FZ_Validity_Rule_Doc); end if;
Above, we see how our replacement for Ada’s standard command-line argument reader works. Instead of demanding the secondary stack, we make use of pre-allocated strings, into which each argument is copied. The reader is invited to try and overflow these: the resulting death is a clean one.
Now for the calculator…
-- The Calculator itself: declare -- The number of Words required to make a FZ of the given Bitness. Wordness : Indices := Indices(Width / Bitness); -------------------------------------------------------- -- State -- -------------------------------------------------------- -- The Stack: subtype Stack_Positions is Natural range 0 .. Height; type Stacks is array(Stack_Positions range <>) of FZ(1 .. Wordness); Stack : Stacks(Stack_Positions'Range); -- Stack Pointer: SP : Stack_Positions := Stack_Positions'First; -- Carry/Borrow Flag: Flag : WBool := 0; -- Odometer: Pos : Natural := 0; -- The current levels of the three types of nestedness: QuoteLevel : Natural := 0; CommLevel : Natural := 0; CondLevel : Natural := 0; --------------------------------------------------------
Observe that the FORTH-like stack is allocated on the stack (your machine’s, that is), and its height is determined by the HEIGHT parameter given in the second command line argument. The width of the FZ integers comprising the elements of this stack, is in turn given by WIDTH, the first command line argument.
Now for some elementary stack-manipulation routines:
-- Clear the stack and set SP to bottom. procedure Zap is begin -- Clear the stack for i in Stack'Range loop FZ_Clear(Stack(i)); end loop; -- Set SP to bottom SP := Stack_Positions'First; -- Clear Overflow flag Flag := 0; end Zap; -- Report a fatal error condition at the current symbol procedure E(S : in String) is begin Eggog("Pos:" & Natural'Image(Pos) & ": " & S); end E; -- Move SP up procedure Push is begin if SP = Stack_Positions'Last then E("Stack Overflow!"); else SP := SP + 1; end if; end Push; -- Discard the top of the stack procedure Drop is begin FZ_Clear(Stack(SP)); SP := SP - 1; end Drop; -- Check if stack has the necessary N items procedure Want(N : in Positive) is begin if SP < N then E("Stack Underflow!"); end if; end Want;
Here we make use of the FZ_ShiftLeft operation we implemented in Chapter 3:
-- Slide a new hex digit into the FZ on top of stack procedure Ins_Hex_Digit(N : in out FZ; D : in Nibble) is Overflow : Word := 0; begin -- Make room in this FZ for one additional hex digit FZ_ShiftLeft_O(N => N, ShiftedN => N, Count => 4, Overflow => Overflow); -- Constants which exceed the Width are forbidden: if W_NZeroP(Overflow) = 1 then E("Constant Exceeds Bitness!"); end if; -- Set the new digit FZ_Or_W(N, D); end;
And now for the "opcodes" comprising our stack machine:
-- Execute a Normal Op procedure Op_Normal(C : in Character) is -- Over/underflow output from certain ops F : Word; begin case C is -------------- -- Stickies -- -------------- -- Enter Commented when '(' => CommLevel := 1; -- Exit Commented (but we aren't in it!) when ')' => E("Mismatched close-comment parenthesis !"); -- Enter Quoted when '[' => QuoteLevel := 1; -- Exit Quoted (but we aren't in it!) when ']' => E("Mismatched close-quote bracket !"); -- Enter a ~taken~ Conditional branch: when '{' => Want(1); if FZ_ZeroP(Stack(SP)) = 1 then CondLevel := 1; end if; Drop; -- Exit from a ~non-taken~ Conditional branch: -- ... we push a 0, to suppress the 'else' clause when '}' => Push; WBool_To_FZ(0, Stack(SP)); ---------------- -- Immediates -- ---------------- -- These operate on the FZ ~currently~ at top of the stack; -- and this means that the stack may NOT be empty. when '0' .. '9' => Want(1); Ins_Hex_Digit(Stack(SP), Character'Pos(C) - Character'Pos('0')); when 'A' .. 'F' => Want(1); Ins_Hex_Digit(Stack(SP), 10 + Character'Pos(C) - Character'Pos('A')); when 'a' .. 'f' => Want(1); Ins_Hex_Digit(Stack(SP), 10 + Character'Pos(C) - Character'Pos('a')); ------------------ -- Stack Motion -- ------------------ -- Push a 0 onto the stack when '.' => Push; FZ_Clear(Stack(SP)); -- Dup when '″' => Want(1); Push; Stack(SP) := Stack(SP - 1); -- Drop when '_' => Want(1); Drop; -- Swap when ''' => Want(2); FZ_Swap(Stack(SP), Stack(SP - 1)); -- Over when '`' => Want(2); Push; Stack(SP) := Stack(SP - 2); ---------------- -- Predicates -- ---------------- -- Equality when '=' => Want(2); WBool_To_FZ(FZ_Eqp(X => Stack(SP), Y => Stack(SP - 1)), Stack(SP - 1)); Drop; -- Less-Than when '< ' => Want(2); WBool_To_FZ(FZ_LessThanP(X => Stack(SP - 1), Y => Stack(SP)), Stack(SP - 1)); Drop; -- Greater-Than when '>' => Want(2); WBool_To_FZ(FZ_GreaterThanP(X => Stack(SP - 1), Y => Stack(SP)), Stack(SP - 1)); Drop; ---------------- -- Arithmetic -- ---------------- -- Subtract when '-' => Want(2); FZ_Sub(X => Stack(SP - 1), Y => Stack(SP), Difference => Stack(SP - 1), Underflow => F); Flag := W_NZeroP(F); Drop; -- Add when '+' => Want(2); FZ_Add(X => Stack(SP - 1), Y => Stack(SP), Sum => Stack(SP - 1), Overflow => F); Flag := W_NZeroP(F); Drop; ----------------- -- Bitwise Ops -- ----------------- -- Bitwise-And when '&' => Want(2); FZ_And(X => Stack(SP - 1), Y => Stack(SP), Result => Stack(SP - 1)); Drop; -- Bitwise-Or when '|' => Want(2); FZ_Or(X => Stack(SP - 1), Y => Stack(SP), Result => Stack(SP - 1)); Drop; -- Bitwise-Xor when '^' => Want(2); FZ_Xor(X => Stack(SP - 1), Y => Stack(SP), Result => Stack(SP - 1)); Drop; -- Bitwise-Not (1s-Complement) when '~' => Want(1); FZ_Not(Stack(SP), Stack(SP)); ----------- -- Other -- ----------- -- mUx when 'U' => Want(3); FZ_Mux(X => Stack(SP - 2), Y => Stack(SP - 1), Result => Stack(SP - 2), Sel => FZ_NZeroP(Stack(SP))); Drop; Drop; -- Put the Overflow flag on the stack when 'O' => Push; WBool_To_FZ(Flag, Stack(SP)); -- Print the FZ on the top of the stack when '#' => Want(1); Dump(Stack(SP)); Drop; -- Zap (reset) when 'Z' => Zap; -- Quit with Stack Trace when 'Q' => for I in reverse Stack'First + 1 .. SP loop Dump(Stack(I)); end loop; Quit(0); ---------- -- NOPs -- ---------- -- Ops we have not yet spoken of -- do nothing when others => null; end case; end Op_Normal; -- Process a Symbol procedure Op(C : in Character) is begin -- First, see whether we are in a state of nestedness: -- ... in a Comment block: if CommLevel > 0 then case C is when ')' => -- Drop a nesting level: CommLevel := CommLevel - 1; when '(' => -- Add a nesting level: CommLevel := CommLevel + 1; when others => null; -- Other symbols have no effect at all end case; -- ... in a Quote block: elsif QuoteLevel > 0 then case C is when ']' => -- Drop a nesting level: QuoteLevel := QuoteLevel - 1; when '[' => -- Add a nesting level: QuoteLevel := QuoteLevel + 1; when others => null; -- Other symbols have no effect on the level end case; -- If we aren't the mode-exiting ']', print current symbol: if QuoteLevel > 0 then Write_Char(C); end if; --- ... in a ~taken~ Conditional branch: elsif CondLevel > 0 then case C is when '}' => -- Drop a nesting level: CondLevel := CondLevel - 1; -- If we exited the Conditional as a result, -- we push a 1 to trigger the possible 'else' clause: if CondLevel = 0 then Push; WBool_To_FZ(1, Stack(SP)); end if; when '{' => -- Add a nesting level: CondLevel := CondLevel + 1; when others => null; -- Other symbols have no effect on the level end case; else -- This is a Normal Op, so proceed with the normal rules. Op_Normal(C); end if; end Op; -- Current Character C : Character; begin -- Reset the Calculator Zap; -- Process characters until EOF: loop if Read_Char(C) then -- Execute Op: Op(C); -- Advance Odometer Pos := Pos + 1; else Zap; Quit(0); -- if EOF, we're done end if; end loop; end; end FFA_Calc;
But rather than describing in detail the operation of FFACalc, I will invite the reader to build it, and solve the following puzzle:
Write a FFACalc tape that will take seven numbers, presumed to be on the top of the stack, and return the largest.
Your answer should work with any legal WIDTH, and any stack HEIGHT large enough to hold the working set.
For instance, suppose file numbers.txt were to contain:
.9.1.7.5.1.1.0
Then the following example invocation:
cd ffacalc gprbuild cat numbers.txt youranswer.txt | ./bin/ffa_calc 256 16
... should produce the output:
0000000000000000000000000000000000000000000000000000000000000009
... and similarly for any other seven numbers.
A solution will be posted in the next Chapter.
~To be continued!~
]]>You will need:
Add the above vpatch and seal to your V-set, and press to ffa_ch3_shifts.vpatch.
Just like before, you will end up with two directories, libffa and ffademo.
Now compile the demo, just as in Chapters 1 and 2:
cd ffademo gprbuild
And, as before, do not run it quite yet.
First, let’s go through the “mail bag” from Chapter 2:
Reader Apeloyee observed : “A bit odd to see FZ_Neg to be named, unlike other SIMD bit operations, differently from the corresponding Ada operation on words. I personally would expect that “Neg” is arithmetic negation, since other operations aren’t named FZ_Conj or FZ_Disj.”
-- NotN := ~N procedure FZ_Neg(N : in FZ; NotN : out FZ) is begin for i in N'Range loop NotN(i) := not N(i); end loop; end FZ_Neg; pragma Inline_Always(FZ_Neg);
He’s quite right, and henceforth the FFA ones-complement operator will be called FZ_Not.
Reader Diana Coman noticed the duplicate assignment to T in FZ_Swap:
-- Exchange X and Y procedure FZ_Swap(X : in out FZ; Y : in out FZ) is T : FZ(X'Range) := X; begin T := X; X := Y; Y := T; end FZ_Swap; pragma Inline_Always(FZ_Swap);
The renaming of the ones-complement operator, and the removal of the redundant assignment to T in FZ_Swap appear in this Chapter’s vpatch.
I would like to thank Apeloyee and Diana Coman for their careful reading; and ask all of my readers not to hesitate in pointing out potentially troublesome (or merely confusing) algorithms, typos, obvious mistakes, etc.
Now let’s proceed to new material: shift operators for FZ integers :
fz_shift.ads:
with Words; use Words; with FZ_Type; use FZ_Type; package FZ_Shift is pragma Pure; -------------------------------------------------------------- -- Shift Right -------------------------------------------------------------- -- ShiftedN := N >> Count (with Overflow Input and Output) procedure FZ_ShiftRight_O_I(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word; OF_in : in Word); pragma Precondition(N'Length = ShiftedN'Length); -- ShiftedN := N >> Count (with Overflow Output only) procedure FZ_ShiftRight_O(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word); pragma Precondition(N'Length = ShiftedN'Length); -- ShiftedN := N >> Count (no Overflow output or input) procedure FZ_ShiftRight(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index); pragma Precondition(N'Length = ShiftedN'Length); -------------------------------------------------------------- -- Shift Left -------------------------------------------------------------- -- ShiftedN := N < < Count (with Overflow Input and Output) procedure FZ_ShiftLeft_O_I(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word; OF_in : in Word); pragma Precondition(N'Length = ShiftedN'Length); -- ShiftedN := N << Count (with Overflow Output only) procedure FZ_ShiftLeft_O(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word); pragma Precondition(N'Length = ShiftedN'Length); -- ShiftedN := N << Count (no Overflow output or input) procedure FZ_ShiftLeft(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index); pragma Precondition(N'Length = ShiftedN'Length); end FZ_Shift;
The basic building blocks for shifts are FZ_ShiftRight_O_I and FZ_ShiftLeft_O_I. The other operators shown are simply shorthand notation for discarding the ability to take the overflow from, or place incoming shifted-in bits into, one of these operations. This convention is followed in order to avoid littering FFA and user code with declarations for unused variables and the pragma Unreferenced(…) statements to go with them.
Note that the overflow inputs and outputs are Words, rather than WBools. The reason for this will become apparent shortly.
Observe that, thus far, it is only possible to shift an FZ by a WBit_Index quantity. Recall from words.ads in Chapter 1:
-- When we must refer to individual bit positions of a machine word: subtype WBit_Index is Natural range 0 .. Bitness - 1;
This means that, in the 64-bit example code provided, left- and right- shifts from 0 to 63 binary positions are permissible.
In a later chapter we will see arbitrary (i.e. up to the full width of an FZ) constant-time shifting algorithms. However for most of FFA's functionality, these are not necessary, so there is no need to introduce them to the reader quite yet.
Let's proceed to the essential moving parts of the Right and Left subword shifts:
fz_shift.adb:
with W_Shifts; use W_Shifts; package body FZ_Shift is -- ShiftedN := N >> Count (with Overflow Input and Output) procedure FZ_ShiftRight_O_I(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word; OF_in : in Word) is Ni : Word; Carry : Word := OF_in; begin for i in reverse N'Range loop Ni := N(i); ShiftedN(i) := Shift_Right(Ni, Count) or Carry; Carry := Shift_Left(Ni, Bitness - Count); end loop; Overflow := Carry; end FZ_ShiftRight_O_I; pragma Inline_Always(FZ_ShiftRight_O_I);
Shifting a FZ integer to the right by Count binary places, is done in the following way. We walk the Words comprising the FZ from highest to lowest (hence the reverse in the looping operator.)
Inside the loop, we make a temporary copy of the current word N(i), Ni. Then we shift Ni by the requested number of binary places, Count; and bitwise-OR the current Carry into its high bits. This Carry is equal to OF_in initially (typically zero, if nothing is being shifted in). The result of the OR is saved to ShiftedN(i) -- we have obtained the current Word of the final output.
For every Word after the first one operated on, Carry consists of that previous Word's "dropped" bits, shifted left so as to end up OR-ed into the upper end of the next Word. The Carry obtained from the last Word's shifting is written to Overflow.
Observe that it is entirely legal to call FZ_ShiftRight_O_I "in-place", i.e. with the input and output locations being the same FZ.
Now we'll define an operator that wraps the above, but does not require bits being "shifted in" to be provided:
-- ShiftedN := N >> Count (with Overflow Output only) procedure FZ_ShiftRight_O(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word) is begin FZ_ShiftRight_O_I(N, ShiftedN, Count, Overflow, 0); end FZ_ShiftRight_O; pragma Inline_Always(FZ_ShiftRight_O);
And here is one that gives neither Overflow output nor demands shifted-in input:
-- ShiftedN := N >> Count (no Overflow output or input) procedure FZ_ShiftRight(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index) is Overflow : Word; pragma Unreferenced(Overflow); begin FZ_ShiftRight_O_I(N, ShiftedN, Count, Overflow, 0); end FZ_ShiftRight; pragma Inline_Always(FZ_ShiftRight);
The process of shifting a FZ integer to the left by Count binary places, is exactly analogous to the right-shift, with one exception: we begin with the lowest Word in the FZ, and proceed "up" :
-- ShiftedN := N < < Count (with Overflow Input and Output) procedure FZ_ShiftLeft_O_I(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word; OF_in : in Word) is Ni : Word; Carry : Word := OF_in; begin for i in N'Range loop Ni := N(i); ShiftedN(i) := Shift_Left(Ni, Count) or Carry; Carry := Shift_Right(Ni, Bitness - Count); end loop; Overflow := Carry; end FZ_ShiftLeft_O_I; pragma Inline_Always(FZ_ShiftLeft_O_I);
It is very much a "mirror image" of the FZ_ShiftRight_O_I algorithm.
Now for the "wrappers", in exactly the same spirit as earlier:
-- ShiftedN := N < < Count (with Overflow Output only) procedure FZ_ShiftLeft_O(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index; Overflow : out Word) is begin FZ_ShiftLeft_O_I(N, ShiftedN, Count, Overflow, 0); end FZ_ShiftLeft_O; pragma Inline_Always(FZ_ShiftLeft_O); -- ShiftedN := N << Count (no Overflow output or input) procedure FZ_ShiftLeft(N : in FZ; ShiftedN : out FZ; Count : in WBit_Index) is Overflow : Word; pragma Unreferenced(Overflow); begin FZ_ShiftLeft_O_I(N, ShiftedN, Count, Overflow, 0); end FZ_ShiftLeft; pragma Inline_Always(FZ_ShiftLeft); end FZ_Shift;
Just as in prior Chapters, I advise the reader to make abundant use of grid paper and pen, to properly fit the mechanics of the given algorithms into his head -- until they are as comfortably understood as a favourite armchair.
Now, let's make a demo for the routines in this Chapter:
demo_ch3.ads:
package Demo_Ch3 is procedure Demo_Shifts; end Demo_Ch3;
demo_ch3.adb:
-- From Ada: with Ada.Text_IO; use Ada.Text_IO; -- From FFA: with Words; use Words; with FZ_Type; use FZ_Type; -- FFA Ch. 3 with FZ_Shift; use FZ_Shift; -- From the Demo: with FFA_IO; use FFA_IO; package body Demo_Ch3 is procedure Demo_Shifts is X : FZ(1 .. 4) := ( 16#083e16f27091f65f#, 16#74c01a9c3ce54f80#, 16#9fd0913737c3fcbe#, 16#fa55f3f5459a9e79# ); Z : FZ(1 .. 4) := ( 0, 0, 0, 0 ); -- Overflow O : Word := 0; begin Put_Line("~~~ Ch. 3 : Shifts ~~~"); New_Line; Put_Line("X ="); Dump(X); New_Line; New_Line; -- Left Shifts: FZ_ShiftLeft_O(X, Z, 1, O); Put_Line("X < < 1 ="); Dump(Z); New_Line; Put_Line("Overflow = "); Dump(O); New_Line; FZ_ShiftLeft_O(X, Z, 13, O); Put_Line("X << 13 ="); Dump(Z); New_Line; Put_Line("Overflow = "); Dump(O); New_Line; FZ_ShiftLeft_O(X, Z, 40, O); Put_Line("X << 40 ="); Dump(Z); New_Line; Put_Line("Overflow = "); Dump(O); New_Line; -- Right Shifts: FZ_ShiftRight_O(X, Z, 1, O); Put_Line("X >> 1 ="); Dump(Z); New_Line; Put_Line("Overflow = "); Dump(O); New_Line; FZ_ShiftRight_O(X, Z, 13, O); Put_Line("X >> 13 ="); Dump(Z); New_Line; Put_Line("Overflow = "); Dump(O); New_Line; FZ_ShiftRight_O(X, Z, 40, O); Put_Line("X >> 40 ="); Dump(Z); New_Line; Put_Line("Overflow = "); Dump(O); New_Line; end Demo_Shifts; end Demo_Ch3;
ffa_demo.adb:
with Demo_Ch1; use Demo_Ch1; with Demo_Ch2; use Demo_Ch2; with Demo_Ch3; use Demo_Ch3; procedure FFA_Demo is begin -- Ch. 1 Demo_Add_Sub; -- Ch. 2 Demo_Word_Preds; Demo_FZ_Basics; Demo_FZ_Various_Ops; -- Ch. 3 Demo_Shifts; end FFA_Demo;
Finally, let's run it:
./bin/ffa_demo
And this is what you should see:
~~~ ... Output from earlier Chapters snipped ... ~~~ ~~~ ............................................ ~~~ ~~~ Ch. 3 : Shifts ~~~ X = FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F X < < 1 = F4ABE7EA8B353CF33FA1226E6F87F97CE980353879CA9F00107C2DE4E123ECBE Overflow = 0000000000000001 X << 13 = BE7EA8B353CF33FA1226E6F87F97CE980353879CA9F00107C2DE4E123ECBE000 Overflow = 0000000000001F4A X << 40 = 9A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16F27091F65F0000000000 Overflow = 000000FA55F3F545 X >> 1 = 7D2AF9FAA2CD4F3CCFE8489B9BE1FE5F3A600D4E1E72A7C0041F0B793848FB2F Overflow = 8000000000000000 X >> 13 = 0007D2AF9FAA2CD4F3CCFE8489B9BE1FE5F3A600D4E1E72A7C0041F0B793848F Overflow = B2F8000000000000 X >> 40 = 0000000000FA55F3F5459A9E799FD0913737C3FCBE74C01A9C3CE54F80083E16 Overflow = F27091F65F000000
~To be continued!~
]]>