## “Finite Field Arithmetic.” Chapter 15: Greatest Common Divisor.

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

Add the above vpatches and seals to your V-set, and press to ffa_ch15_gcd.kv.vpatch.

You should end up with the same directory structure as previously.

As of Chapter 15, the versions of FFACalc and FFA are 254 and 254, respectively.

Now compile ffacalc:

```cd ffacalc gprbuild```

But do not run it quite yet.

First, the mail bag!

Reader diana_coman has given me to know that she has read and signed Chapters 7 and 8:

Now, let’s eat the meat of this Chapter.

Op Description # Ins # Outs Notes
R* Multiply top and second item and place only the junior half of the product on the stack. 2 1 The “Low-Multiply” from Ch. 14B.
MS Square the second item modulo the top item and place the result on the stack. 2 1 Conventional modular squaring.

R* (i.e. Right-Multiply) is simply the “Low Mul” helper routine from the previous chapter, now made available directly from FFACalc. It is arithmetically-equivalent to an ordinary * multiplication followed by dropping the senior half of the product, but gives a twofold saving of CPU time by avoiding the calculation of the senior half to begin with.

MS (i.e. Modular-Square) is directly analogous to M* (Modular Multiply) as seen in Chapter 13.

Both of these new FFACalc operators simply expose previously-discussed routines, and therefore do not merit further discussion.

Two typos in the comments of Chapter 14B were found, on lines 258 and 261 of:

``` -- (1) Ns := X >> Jm FZ_Quiet_ShiftRight(N => X, ShiftedN => Xs, Count => Bar.J);   -- (2) Z := X * Bm FZ_Multiply_Unbuffered(X => Bar.B, Y => Xs, XY => Z);```

They have been corrected:

``` -- (1) Xs := X >> Jm FZ_Quiet_ShiftRight(N => X, ShiftedN => Xs, Count => Bar.J);   -- (2) Z := Xs * Bm FZ_Multiply_Unbuffered(X => Bar.B, Y => Xs, XY => Z);```

Finally, let’s proceed to the main subject of Chapter 15: Greatest Common Divisor. We have a new FFACalc operator:

Op Description # Ins # Outs Notes
G Find the Greatest Common Divisor of the top and second item, and place on the stack. 2 1 GCD(0,0) is conventionally defined as 0.

This is the Greatest Common Divisor operator, as originally defined by Euclid. Several constructive uses for it will become apparent in subsequent chapters.

There did not, to my knowledge, previously exist in the public literature a constant-spacetime algorithm for GCD. However, it is not difficult to produce one. As a starting point, we will use Stein’s Algorithm, also known as binary GCD (refer to Vol. 2 of D. Knuth’s AOP, where a detailed analysis is found.)

It should be noted that there are several other classic algorithms for GCD (e.g. Lehmer’s method). Unfortunately, none of them are suitable for a constant-time implementation, as every single one of them relies on periodically examining some number of bits in the working registers and performing a variant set of operations (i.e. branching) depending on their value. This leaks, via the timing side-channel, information about the numbers being operated on — and is therefore absolutely prohibited in FFA.

Additionally, all known GCD algorithms run in quadratic time on their worst-case input. And per the definition of constant-time, all FFA algorithms must always run in the worst-case time. Hence, the simplest practical algorithm is best, as it will have the smallest constant factor in its run-time (as well as being the easiest to Fit-in-Head !) Hence, we will be sticking with Stein’s Binary GCD as the basis for our method.

Let’s begin by looking at a commonly-studied recursive variant of this algorithm:

Algorithm 1: Stein’s Recursive GCD.

For positive integers A and B:

1. If B > A:
return Stein(B, A)
2. If B = 0:
return A
3. If A and B are both even:
return 2 × Stein(A / 2, B / 2)
4. If only A is even:
return Stein(A / 2, B)
5. If only B is even:
return Stein(A, B / 2)
6. Else:
return Stein((A - B) / 2, B)

#### Chapter 15 Exercise #1:

Prove that Algorithm 1 computes GCD(A, B) in a finite number of steps.

As it stands, this algorithm is not suitable for constant-time implementation. However, it illustrates the equivalences that we will use to construct one which is.

Let’s illuminate how Stein’s Algorithm works:

Step 1 forces the relationship A ≥ B to hold at the start of each recursive call.

Step 2 terminates the recursion when B is found to equal zero, and there is nothing further to do but to return the value of A — which will be equal to the sought GCD.

Step 3 determines whether the current values of A and B have a common factor of two (i.e. both are even numbers), and accumulates this common factor for later re-introduction into the computed GCD.

Steps 4 and 5 eliminate a possible non-common factor of two in the current value of either A or B.

Finally, Step 6 makes use of the elementary fact that a difference between two odd numbers (A and B are both known to be odd at this point) is always even, in order to remove a known non-common (with B) factor of two from the quantity A - B, and then proceeds to the next level of the recursion (similarly to the well-known “kindergarten” variant of Euclid’s original GCD — where only such subtractions take place.)

The re-introduction of the shared power-of-two factor accumulated in Step 3 happens as a result of the unwinding of the recursion.

It is important to note that Steps 3, 4, and 5 are the reason why Stein’s Algorithm converges in quadratic (i.e. O(N2), where N is the total number of bits being operated on) time, rather than the O(max(A, B)) convergence time of “Euclid’s subtractive GCD”. These steps shave a single bit of length from A, B, or both whenever the respective quantities are even (i.e. have a zero for their junior-most bit.)

Recall the unsurprising Lemma 3 of Ch. 14A-Bis: if a small number is being subtracted from a much-larger one, it will take very many subtractions to shorten the quantity being subtracted from by even a single bit — you are liable to die of old age waiting for Euclid’s subtractive GCD to converge if the difference between the inputs is large.

The eventual result is that one of the quantities will equal zero, at which point Step 2 terminates the recursion. The zero always ends up in B, while the sought GCD ends up in A.

Now, let’s transform Algorithm 1 into a similar but iterative one.

Observe that a constant-time algorithm must always execute exactly the same — from an algebraic point of view — computations, regardless of the particular inputs. Therefore, the subtractive step must be performed during every iteration of the loop, and likewise the divisions-by-two must also take place during every iteration. Fortunately, the following equivalence holds for all integers A, B, and not merely when |A - B| is odd:

GCD(A, B) = GCD(|A - B|, Min(A, B))

Therefore we can dispense with the division-by-two in Step 6, and likewise with the “shortcuts” in Steps 1-5, and get the following modified algorithm:

Algorithm 2: Iterative Quasi-Stein GCD.

For positive integers A and B:

1. Twos := 0
2. Iterate until B = 0:
3.    Ae, Be := IsEven(A), IsEven(B)
4.    A      := A >> Ae
5.    B      := B >> Be
6.    Twos   := Twos + (Ae AND Be)
7.    Bnext   := Min(A, B)
8.    Anext   := |A - B|
9.    A, B   := Anext, Bnext
10. A := A × 2Twos
11. A contains the GCD.

Algorithm 2 is not yet constant-time, but the missing ingredient should at this point be quite apparent to the astute reader.

Observe that once B = 0, the value of A at step 10 will not be affected by any number of “redundant” iterations of the loop.

This fact is the key to deriving a constant-time, i.e. always-worst-case version of Algorithm 2. Let’s write it out in a form directly suitable for implementation in FFA:

Algorithm 3: Constant-Time Iterative GCD.

For FZ integers A and B:

1. Twos := 0
2. Iterate Bitness(A) + Bitness(B) times:
3.    Ae   := 1 - (A AND 1)
4.    Be   := 1 - (B AND 1)
5.    A    := A >> Ae
6.    B    := B >> Be
7.    Twos := Twos + (Ae AND Be)
8.    D    := |A - B|, C ← Borrow
9.    B    := {C=0 → B, C=1 → A}
10.    A    := D
11. A := A << Twos
12. A contains the GCD.

#### Chapter 15 Exercise #2:

Prove that the number of iterations given in Algorithm 3 always suffices, and that the “redundant” iterations have no effect on the final output.

#### Chapter 15 Exercise #3:

What values of A and B, for a given FZ_Bitness, actually demand the maximum given number of iterations in order to produce the correct GCD?

And now, we will write an Ada program to perform Algorithm 3:

``` -- Find Greatest Common Divisor (GCD) of X and Y. -- Note that by convention, GCD(0, 0) = 0. procedure FZ_Greatest_Common_Divisor(X : in FZ; Y : in FZ; Result : out FZ) is   -- Widths of X, Y, and Result are equal subtype Width is Word_Index range X'Range;   -- Working buffers for GCD computation, initially equal to the inputs A : FZ(Width) := X; -- GCD will appear in A in the end B : FZ(Width) := Y;   -- Evenness (negation of lowest bit) of A and B respectively Ae, Be : WBool;   -- Common power-of-2 factor Twos : Word := 0;   -- |A - B| D : FZ(Width);   -- This flag is set iff A < B A_lt_B : WBool;   begin   -- For convergence, requires number of shots equal to 2 * FZ_Bitness: for i in 1 .. 2 * FZ_Bitness(X) loop   -- If A is even, A := A >> 1; otherwise A := A Ae := 1 - FZ_OddP(A); FZ_ShiftRight(A, A, WBit_Index(Ae));   -- If B is even, B := B >> 1; otherwise B := B Be := 1 - FZ_OddP(B); FZ_ShiftRight(B, B, WBit_Index(Be));   -- If both A and B were even, increment the common power-of-two Twos := Twos + (Ae and Be);   -- D := |A - B| FZ_Sub_Abs(X => A, Y => B, Difference => D, Underflow => A_lt_B);   -- B' := min(A, B) FZ_Mux(X => B, Y => A, Result => B, Sel => A_lt_B);   -- A' := |A - B| A := D;   end loop;   -- Reintroduce the common power-of-2 factor stored in 'Twos' FZ_Quiet_ShiftLeft(N => A, ShiftedN => A, Count => Indices(Twos));   -- Output final result Result := A;   end FZ_Greatest_Common_Divisor;```

Take note that we have defined GCD(0, 0) = 0. This is technically an abuse of mathematical rigour — the GCD of two zeroes is not a uniquely-determined value. However (unlike division by zero) there are no known down-sides to permitting a computation of GCD(0, 0).

I have found that this choice was made in every currently-extant arithmetron, including popular computer algebra systems (e.g. Octave and Wolfram’s.) (Reader: if you know of one which signals an error when given GCD(0, 0), please let me know which, and the author’s reasoning, if it is known.)

Now we will want to test the new GCD routine.

Let’s create a test FFACalc tape for GCD by taking a famous pair of braindamaged RSA moduli from the Phuctor zoo — the stars of the SecLab incident:

```.c08b0693f9ae0854829cd88d6538756df69ff8067d1556678f7e45b17437014374174c4 aca94bf1f83640928832b398f88c935c6a08177c4cbaa8b85002fee95068bd68487f286f e3b814d92d6147b3d90fba606701f72e1f205c3e06dba55f5e180e45c2225a6ca2061d2d 638ef42609c5d8225620107519628b35983e92e0930ff2e2b8a3a0d9da57a4f50aaefe21 c0b02f8a91587f3ea2337df593f2faea40cb0d6359fee2df45b14b4e8f20988c542b81c7 862f74ea3a3761c22f6ecef64efb2014ccdcf13fb251ed3160ee20f392d0a2200db105c4 5bc12badbaa53a00a1371a77e12de455824c10dafd87f9c150f1e3fb622a8bb68134764a 77a939371bbe63ede53591d1b2bf35ff2f15776a2e1670c8c0006973782c52e97ded5ad1 e4cc96cc4bffd73061e14059aa40dbbc89d46ea1e20500a0e5ac7ac374c277e8d745dc45 449505d1c1bafecd9df8aa75096fffe4cd2f164e2a12d35000782dd73a5b58f8064ea4c0 afe2066f31d336fe65c50a9dff8e3db8a379b182e6d440cb8903fad5ed8477bde7ac2c13 1a7cd47d94630e92f98f68b86d6288607d1ef03880ca18f4176caf08869df93e93433a08 20af7e82e5eed7fd39a2480d98c34f5862dd7ceb4f8382a84acad97d1ee8da685d2e4aa5 f26167a3385f0a3412e168162916dd7ec1a864431f649e610d0ed2593d1be2abda31bb48 a66214a3f8e0ba011   .ed9917eb72fe4b283ba43bd98f163dc5331d47ddeef7319d1e339ab2ccbfa912e7a41c0 f02628c858a511578d173a0ab425dfbfe3d50d279649a0487ca1eff34ee220bc13b207f2 382d76d414ec849784dfd4dc86c5b4f1bac60976d737db018abb94e14f4c91cef8db6b6a 49ed7ece31d054281a92224ebad99c9bafd9b4931b3135e0e03ae55559512ba43725070f ff9912831d49a77c2eeefde1b557c6845166d401aedaf73dd7aefe2a6c1f5a90b7a62207 6b97f1fae8597525dcda6886f736b73990e371a5c424e802e6e9b846998a6dce0a8f4e21 97619373f965dda46ee8ee47a84cb2071321c0ec4186502fa03edf4a63437069440d1b78 889f68ededf9356b8b55df65b5dbb358ff0606eeb5d15b4a433d082a35fbcdf95a97561d e0c99b4f207f326c54cd14093f77e2063c782a14a6dfc7e45800cd87e800d2d875995ff0 1d3540292725283edf6abc78c4f5aba7422b563071e2cfb22e0992ccbddf8cf966cf6eee a8ea1561775cc17d88ca73a2cca4bc4151d380987bac526e395b5d01f984c49b5b91cd07 ce437ef9bb5d7a35fb099032f8bc2afcbc8bef0067288337829f1e568717f99d2c0f13a2 3732f711e20defd6f85533c6ac2934b946a256e8472b3a4b24cb30ff2d2c5959846425ce e81ef638b4f054850057437bf2eb7bca34e9671253789c9ad24fae937e65a7c4850cec2b c3114cb7a68a78601   G   .f59cc31339d001d37570dc0ccd986f3f5ea737fa9185c15dbc17e6bfef29435c79a7c22 e8616738947cab8711b6a6e7b5704e5283b57892adad3b170c726f34d3a9859b1504e005 ee4b69d4803cd56773c50ab01d6546ce66dcdb2be4a34e15160d8e0eb69184b699246b42 28f6f25bfcc91970fa99ea3123409f6865b161423581a5f9522ef774f09818bfef6c2b1c 51d06218a07dc717ec94bb231b062936bfd8794cb39bdf8dc05cd2c8bd74b1d0acb14d39 bc293deb45fa52de89af30e4bc5688fb8116be7e7ad4332810c04903939ee2a356ea254b 83fb811c76898672d24997d8647f969a8e02aa2f2016cb1e0c8a9afe99760cd37bf2794d 4ea58951f   ={[OK ]}{[SAD ]}_```

Now, run the tape as follows:

`\$ cat seclab.tape | ./bin/ffa_calc 4096 2`

… and if successful, the output will be:

`OK`

A serious reader will also want to test any iron being considered for use with FFA, as previously described in Chapter 14B.

The following set of canonical GCD litmus tapes is available for download:

Verify the signature and unpack the archive. Inside, you will find five FFACalc tapes:

• 10k_null_gcd_8192bit.tape
Ten thousand 8192-bit GCD tests with only null inputs.
• 10k_small_gcd_8192bit.1.tape
Ten thousand 8192-bit GCD tests with randomly-generated inputs having a usually-small common factor.
• 10k_small_gcd_8192bit.2.tape
Similar to the above.
• 10k_large_gcd_8192bit.1.tape
Ten thousand 8192-bit GCD tests with randomly-generated inputs having a usually-large common factor.
• 10k_large_gcd_8192bit.2.tape
Similar to the above.

… and a simple shell script, gcd_litmus.sh.

Place all of these items into your ffa/ffacalc directory and execute the litmus script.

After less than half an hour (on reasonable iron), you will have a set of timer outputs (and no error outputs, unless your machine is catastrophically broken.)

Just as previously, with the tapes of Chapter 14B, if you discover that there is any substantial and persistent pattern of difference in the runtimes of the tapes, you have defective iron!

Please leave a comment here if you turn up a machine which fails this litmus! It is, for instance, conceivable that some piece of rubbish masquerading as a CPU performs a shift-by-zero in a different period of time than a shift-by-one; and I was recently quite certain that I had found one such CPU — but it turned out to be a false alarm.

And that is all, as far as GCD is concerned, for now.

~To be continued!~

by Stanislav 1 Comment

## A Solid-State HDD for Symbolics “MacIvory” Lisp Machines.

This post concerns the “MacIvory” Model 3 Lisp Machine. It is of interest strictly to bolixologists.

This is a recipe for a working replacement of an ancient SCSI HDD, such as found in the MacIvory, with an inexpensive solid-state disk.

You will need:

1. Download the softs, and verify the signature and hashes in MANIFEST.TXT.

2. Decompress the disk image, and copy it to a 16GB or larger SD card, via unix dd command, e.g.:

`dd if=MacIvory_Virginal_9GB.img of=/dev/sdb`

3. Insert the SD card into the SCSI2SD device, connect the latter to a PC via the USB jack, and run the configurator. Feed it the supplied config, and trigger the upload to the device.

Edit: if your MacIvory’s SCSI cable does not have an end terminator installed, you will need to re-enable the built-in termination in the SCSI2SD config.

4. Remove the original HDD assembly from the MacIvory:

5. Remove the HDD from the plate.

6. Affix the SCSI2SD device to the new adapter, and then to the original steel plate, using the standoffs and nuts:

Do not over-tension the nuts. (Use a tension wrench, if you have one, otherwise simply “know measure”.)

7. Affix the plate, now containing SCSI2SD device, into the original nest, and connect the cabling, via the 50-to-68-pin adapter (for SCSI) and the Molex-to-floppy (for +5v):

8. Boot!

You can now back up and otherwise manipulate the contents of the MacIvory HDD by connecting the SCSI2SD device’s USB jack to a Linux PC, or by simply removing the SD card.

~Fin~

## Symbolics “MacIvory”: PCB Photographs.

This post concerns the “MacIvory” Model 3 Lisp Machine. It is of interest strictly to bolixologists.

Click on a photo to see detailed version.

Machine chassis:

The Ivory NuBus Board Set (i.e. the LispM itself, the Mac Quadra is otherwise ordinary):

“Ivory” NuBus board, component-side:

## What’s under the labels?

Bolix Label Part Notes
115999-B PLUS20L8 PAL, 14 inputs, 8 outputs, no registers
116000-B PAL22V10-10P PAL, 12 inputs, 10 macrocells, can have registers
116001-B PAL22V10-10P PAL, 12 inputs, 10 macrocells, can have registers
116004-A GAL16V8 GAL, can have registers
116007-A PAL16R8-7PC PAL, 8 inputs, 8 registers
116011-A GAL16V8 GAL, can have registers
S/N 30328 CY7C261-55PC 8K x 8 PROM, probably unit serial number
SHF-ARR4.1/116025 Actel A1010A-1 Antifuse-programmed CPLD, 1200 gates (equiv. of 12 period PALs)

## What are the vertical DIP ICs???

“Ivory” NuBus board, bottom-side:

“Ivory” RAM daughterboard, component-side:

## What are the vertical DIP ICs???

“Ivory” RAM daughterboard, bottom-side:

Original 2400dpi scans of the boards (pre-photostitching) are available here:
1, 2, 3, 4, 5, 6 (WARNING: 200-300MB each!)

~To Be Continued~

## “Finite Field Arithmetic” vs MPI.

Let’s compare the CPU cost of modular exponentiation performed on Chapter 14 FFA vs ye olde MPI.

V-press the MPI tree to mpi_second_cut.vpatch (or use diana_coman’s cleaned-up variant, this should not affect the result of the test.)

Now, replace the test_mpi.c example I provided, with the following MPIistic implementation of the Ch.14 example test tape:

```#include "mpi.h" #include <stdlib.h> #include <stdio.h>     int main(int ac, char **av) { MPI base, exp, mod, res, known; int r, cmp;   r = secmem_init(1000); if (r==0) { fprintf(stderr, "eggog\n"); exit(1); }   base = mpi_alloc_secure(0); exp = mpi_alloc_secure(0); mod = mpi_alloc_secure(0); res = mpi_alloc_secure(0); known = mpi_alloc_secure(0);   mpi_fromstr(base, "0x09C98FCB9C40EC96BB9ECE3FD053D8468614D2C262A6ED7ACC613968CAF8A58FA725" "E113CCB16A000BF82170353B71789E23DD4620966EE23191C97F0290606CB7AEF750CF" "9EB6716BB474BA28DD3A615D3FC3D19812ABB2735676C7EBD505497A3080BF349F2833" "5439FC0567C150006CE796D9F4307B0A5C0617619C0F4C29CB496D164F09E363A8D535" "149B3485D4D0F0C8F2395CBC8E067910CDD9A0228AE6E2E37CA22F75EE91944C231899" "A1B340DB9968C6AB1EC7DBA911B0AF3CD3AD1AE83573A51A7DAC53E8FFCBF9D985BE9C" "C003409ADB8E068CBE71243C4A4EB678D169489FD0818B91C35F9A073DD2BC06173E13" "578481C1E98692B79C0A1CBB");   mpi_fromstr(exp, "0x513C8694309772EA44FEA976BE4DD5674F0A798D20385E1F18D71ACA6FD1680D9A09" "E9F075BEB7FE8A30911106618A4BC961CC0442E9BB8939D0CB249FEA16ECECB6762ADE" "FD1CD660842CDA6CF7EC2EF4C523E61E8F119A1357AF40DAE7E1ECE49F27C23728CD51" "86D197802AD03098ACFD8E058BD7F8340FFFB5E8ACB807D7B96D223FB0EC6D6E68C01C" "6293BC94B5BD370888C192F9C62C617B1598B8F19914C5F98DBB5B9D06936B5E97BDDE" "87FEB0192C9EB0A32D9F7B066D134FB5E9215FF2440340DC33568F5F4441A25F040D81" "CF584923A7DF28135F5F30282D344278E60D8DBCB0D36C8641057265406E31896C6BC1" "5DB82F0BDAFC5F456C9EB35F");   mpi_fromstr(mod, "0x1F8FF0E6C3FF3A40C18BBAB99D9DB19A6413C734507638D6F44B90848D7B4FE2E87E" "4D830B14F03E7FD87F47CCD4A715C51839952C5DB5B3F04E4C9633964881B761E763D7" "45B5DF2F8649418CB4B3DCA692535B862FC2C62F11DCEE2AB4191B25B35A6DEDC58547" "6DACF952A2F580C061F9DCD6C0CE4F08BF401379EC47CED1DE5A3C87EC98CE4DFFC70E" "B65A3C6700C5145107D3DD305BA1E0E0F6A957566A0452BE16E39CF4A17185644FBA8E" "D1CA79A3D6E28D29D936F2F953913EFBB94B7AF0CAA81CF3F5EF5C86E2F115424FEEEB" "D1DAD7FB371E27B2D727381F441ED5377E3DAB15BCF786E88582D6533AF673FA86047E" "BBC1667E6A12DD7102359052");     mpi_fromstr(known, "0x0FCB1DE435944F516B3F05D990C761985348076167AFCF56FEA05BE1DBA81286BA21" "39B0D10232577B79A4B9B872244E966EDC256678D1132B206D357D90F1F601B663DFD6" "2FE78449A0D1E4B7E7FC55C2B978282D37B70A6D8715451BC114282B3E9555E23612C2" "974974FA52ADDC6F2B030F0E98E6C5C6747C23A58FAE4C8730A37426B54EC604EA1EF1" "6F24B1FE8B190A993FC28A95960987D1768AC731DEC4E6B334C75B27589F0E99DAC4A7" "AC5CA9E7A014C2E0F05A72A18145A9B8958D0F26E179D3854E3C8CD0C3DD8E11DA318F" "9BE873175B1168580CEA42593FEC2795FDBDBC349D10B76118E646AA8EE5C2295452C2" "E78DC26528F7C4A4D2F90E21");   mpi_powm(res, base, exp, mod);   mpi_free(base); mpi_free(exp); mpi_free(mod);   cmp = mpi_cmp(res, known);   if (cmp == 0) { fprintf(stdout, "OK\n"); } else { fprintf(stdout, "SAD\n"); }   mpi_free(res); mpi_free(known);   fprintf(stdout, "\n");   secmem_term();   return 0; }```

Build it, and fire the test:

`\$ time ./koch`

… on my test iron, will produce the output:

```OK   real 0m0.639s user 0m0.636s sys 0m0.001s```

Now, let’s feed the equivalent problem to Ch. 14 FFA:

```.09C98FCB9C40EC96BB9ECE3FD053D8468614D2C262A6ED7ACC613968CAF8A58F A725E113CCB16A000BF82170353B71789E23DD4620966EE23191C97F0290606CB 7AEF750CF9EB6716BB474BA28DD3A615D3FC3D19812ABB2735676C7EBD505497A 3080BF349F28335439FC0567C150006CE796D9F4307B0A5C0617619C0F4C29CB4 96D164F09E363A8D535149B3485D4D0F0C8F2395CBC8E067910CDD9A0228AE6E2 E37CA22F75EE91944C231899A1B340DB9968C6AB1EC7DBA911B0AF3CD3AD1AE83 573A51A7DAC53E8FFCBF9D985BE9CC003409ADB8E068CBE71243C4A4EB678D169 489FD0818B91C35F9A073DD2BC06173E13578481C1E98692B79C0A1CBB   .513C8694309772EA44FEA976BE4DD5674F0A798D20385E1F18D71ACA6FD1680D 9A09E9F075BEB7FE8A30911106618A4BC961CC0442E9BB8939D0CB249FEA16ECE CB6762ADEFD1CD660842CDA6CF7EC2EF4C523E61E8F119A1357AF40DAE7E1ECE4 9F27C23728CD5186D197802AD03098ACFD8E058BD7F8340FFFB5E8ACB807D7B96 D223FB0EC6D6E68C01C6293BC94B5BD370888C192F9C62C617B1598B8F19914C5 F98DBB5B9D06936B5E97BDDE87FEB0192C9EB0A32D9F7B066D134FB5E9215FF24 40340DC33568F5F4441A25F040D81CF584923A7DF28135F5F30282D344278E60D 8DBCB0D36C8641057265406E31896C6BC15DB82F0BDAFC5F456C9EB35F   .1F8FF0E6C3FF3A40C18BBAB99D9DB19A6413C734507638D6F44B90848D7B4FE2 E87E4D830B14F03E7FD87F47CCD4A715C51839952C5DB5B3F04E4C9633964881B 761E763D745B5DF2F8649418CB4B3DCA692535B862FC2C62F11DCEE2AB4191B25 B35A6DEDC585476DACF952A2F580C061F9DCD6C0CE4F08BF401379EC47CED1DE5 A3C87EC98CE4DFFC70EB65A3C6700C5145107D3DD305BA1E0E0F6A957566A0452 BE16E39CF4A17185644FBA8ED1CA79A3D6E28D29D936F2F953913EFBB94B7AF0C AA81CF3F5EF5C86E2F115424FEEEBD1DAD7FB371E27B2D727381F441ED5377E3D AB15BCF786E88582D6533AF673FA86047EBBC1667E6A12DD7102359052   MX   .0FCB1DE435944F516B3F05D990C761985348076167AFCF56FEA05BE1DBA81286 BA2139B0D10232577B79A4B9B872244E966EDC256678D1132B206D357D90F1F60 1B663DFD62FE78449A0D1E4B7E7FC55C2B978282D37B70A6D8715451BC114282B 3E9555E23612C2974974FA52ADDC6F2B030F0E98E6C5C6747C23A58FAE4C8730A 37426B54EC604EA1EF16F24B1FE8B190A993FC28A95960987D1768AC731DEC4E6 B334C75B27589F0E99DAC4A7AC5CA9E7A014C2E0F05A72A18145A9B8958D0F26E 179D3854E3C8CD0C3DD8E11DA318F9BE873175B1168580CEA42593FEC2795FDBD BC349D10B76118E646AA8EE5C2295452C2E78DC26528F7C4A4D2F90E21   ={(do nothing if ok)}{[SAD ]}_```

`\$ time cat 2048.tape | ./bin/ffa_calc 2048 32`

… and we get, on same test iron:

```real 0m0.280s user 0m0.277s sys 0m0.002s```

Turns out, Koch’s pile of shit, despite eschewing constant time arithmetic, and being implemented in Overflowandcrashlangloses the footrace, when given a full-width modular exponentiation (i.e. one where it cannot cheat by skipping over leading zeroes.)

## “Finite Field Arithmetic.” Chapter 14B: Barrett’s Modular Reduction. (Part 2 of 2.)

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

Add the above vpatches and seals to your V-set, and press to ffa_ch14_barrett.kv.vpatch.

You should end up with the same directory structure as previously.

Now compile ffacalc:

```cd ffacalc gprbuild```

But do not run it quite yet.

First, the mail bag!

Reader diana_coman recently observed that the OF_in parameter taken by certain procedures in FZ_Shift is not checked for validity, and — if abused by being given an oversized (i.e. wider than the given shiftness) value — could result in a garbage output.

Her observation is entirely correct. And ideally, the range of OF_in would be constrained via a precondition. Unfortunately, Ada does not permit the use of preconditions in combination with inlining; and FZ_Shift routines are invoked in several costly inner loops, and absolutely must be subject to inlining. Therefore, it is impractical to actually verify the bit-width of OF_in on every invocation. It is, however, the case that these procedures are defined strictly for internal use in FFA, and hence do not constitute a danger to the operator. After giving the matter some thought, I took diana_coman’s suggestion and added comments to warn the reader of the potential rake he could step on if he were to insist on making direct use of FFA’s internal shift routines.

I will take this opportunity to remind the reader that FFA is designed to be “safe if used as prescribed”: if it is invoked via the provided external interface, the promised semantics are guaranteed to apply. The only prohibited operations are ones which over- or under-run the FFACalc stack, demand a division by zero, or attempt to violate other FFACalc rules. (These will bring the program to an orderly stop, and warn the operator.) All other actions will produce arithmetically-correct outputs for the given inputs. However it is impractical on extant iron to make this guarantee for each of the internal components taken separately!

This is why we want sane iron, with inexpensive bounds-checking instructions! But we do not have it yet. Hence, the reader who wishes to make use of FFA internals for some custom purpose of his own, must proceed with extreme caution.

Reader diana_coman also observed that Get_Argument in FFACalc’s command line handler:

``` procedure Get_Argument(Number : in Natural; Result : out String);```

… can be turned into the stricter type:

``` procedure Get_Argument(Number : in Natural; Result : out CmdLineArg);```

Thank you for these nitpicks, diana_coman! And for reading and signing Chapters 3, 4, 5, and 6:

Reader mircea_popescu observed that Chapter 13’s FZ_Measure can be slightly simplified, where:

`Index := W_Mux(Index + 1, Index, W_ZeroP(W));`

… can be safely turned into the equivalent:

`Index := Index + W_NZeroP(W);`

I have included the change in this Chapter. Thank you, reader mircea_popescu !

Now, let’s eat the meat of this Chapter.

We’ll start with a very minor extension of FFACalc. A Version command has been introduced:

Op Description # Ins # Outs Notes
V Put the FFACalc and FFA version numbers on the stack. 0 2 Kelvin Versioning is in use.

The implementation of this command is quite straightforward:

``` -- Put the FFACalc Program Version on the stack, -- followed by FFA Program Version. when 'V' => Push; Push; -- FFACalc Version: FFA_FZ_Clear(Stack(SP - 1)); FFA_FZ_Set_Head(Stack(SP - 1), Word(FFACalc_K_Version)); -- FFA Version: FFA_FZ_Clear(Stack(SP)); FFA_FZ_Set_Head(Stack(SP), Word(FFA_K_Version));```

```package Version is   pragma Pure;   ---------------------------------------------- -- Current 'deg. Kelvin' Version of FFACalc -- ---------------------------------------------- FFACalc_K_Version : constant Natural := 255; ----------------------------------------------   end Version;```

``` -- ...   ---------------------------------------------------------------------------- --- Current 'deg. Kelvin' Version of FFA ----------------------------------------------------------------------------   FFA_K_Version : constant Natural := 255;   -- ...```

The effect: FFACalc and FFA now have independent “Degrees Kelvin” versions — i.e. they are to decrement by one upon every published revision to each respective program. Observe that this constitutes a promise to the reader: no more than 255 changes to either FFACalc or FFA are to be published after this Chapter. In the quite unlikely event where a change is found to be required after a Kelvin version reaches zero degrees, it is expected that the program is to be renamed, Vtronically-reground, and some very pointed questions posed to the maintainer!

Now, let’s proceed to the originally-planned subject of Chapter 14B: the Ada implementation of Barrett’s Modular Reduction.

Don’t even think about proceeding further into this Chapter if you have not fully read and understood the two previous chapters:

Stop now and go back, study! Lest you become a danger to yourself and others.

We will now discuss the Ada implementation of the Algorithm 2 given in Chapter 14A. Please print the Algorithm and the physical bounds proof and refer to these while reading this Chapter.

```package FZ_Barr is   pragma Pure;   -- Precomputed data for Barrett's Modular Reduction type Barretoid(ZXMLength : Indices; BarretoidLength : Indices) is record ZXM : FZ(1 .. ZXMLength); -- Zero-Extended Modulus J : FZBit_Index; -- Jm B : FZ(1 .. BarretoidLength); -- The Barrettoid itself ZSlide : FZBit_Index; -- Amount to slide Z Degenerate : WBool; -- Is it degenerate case? end record;     -- Prepare the precomputed Barrettoid corresponding to a given Modulus procedure FZ_Make_Barrettoid(Modulus : in FZ; Result : out Barretoid) with Pre => Result.B'Length = 2 * Modulus'Length and Result.ZXM'Length = Modulus'Length + 1;     -- Reduce N using the given precomputed Barrettoid. procedure FZ_Barrett_Reduce(X : in FZ; Bar : in Barretoid; XReduced : in out FZ); pragma Inline_Always(FZ_Barrett_Reduce);   end FZ_Barr;```

In every instance of the pre-computed Barrettoid data structure, we will keep everything which is required for Barrett’s Modular Reduction by a given modulus. In particular, we will retain ZXM: a zero-extended (for convenient use in steps 6 and 8) copy of the modulus itself; the parameter JM; the Barrettoid proper, BM; ZSlide, the number of bits we must right-shift Z by to compute ZS; and, finally, the degeneracy indicator, i.e. DM.

Unsurprisingly, a Barrettoid is computed from a given modulus with the procedure FZ_Make_Barrettoid. All Barrettoids — like other FFA data — will exist as stack-allocations. (Heapism in any form whatsoever is forever banned in FFA.) And the only use of a Barrettoid is to compute Barrett’s Modular Reduction, using FZ_Barrett_Reduce. We will review both procedures in detail, below.

Here is how we create a Barrettoid:

``` -- Prepare the precomputed Barrettoid corresponding to a given Modulus procedure FZ_Make_Barrettoid(Modulus : in FZ; Result : out Barretoid) is   -- Length of Modulus and Remainder Lm : constant Indices := Modulus'Length;   -- Remainder register, starts as zero Remainder : FZ(1 .. Lm) := (others => 0);   -- Length of Quotient, with an extra Word for top bit (if Degenerate) Lq : constant Indices := (2 * Lm) + 1;   -- Valid indices into Quotient, using the above subtype Quotient_Index is Word_Index range 1 .. Lq;   -- The Quotient we need, i.e. 2^(2 * ModulusBitness) / Modulus Quotient : FZ(Quotient_Index);   -- Permissible 'cuts' for the Slice operation subtype Divisor_Cuts is Word_Index range 2 .. Lm;   -- Current bit of Pseudo-Dividend; high bit is 1, all others 0 Pb : WBool := 1;   -- Performs Restoring Division on a given segment procedure Slice(Index : Quotient_Index; Cut : Divisor_Cuts; Bits : Positive) is begin   declare   -- Borrow, from comparator C : WBool;   -- Left-Shift Overflow LsO : WBool;   -- Current cut of Remainder register Rs : FZ renames Remainder(1 .. Cut);   -- Current cut of Divisor Ds : FZ renames Modulus(1 .. Cut);   -- Current Word of Quotient being made, starting from the highest W : Word := 0;   -- Current bit of Quotient (inverted) nQb : WBool;   begin   -- For each bit in the current Pseudo-Dividend Word: for b in 1 .. Bits loop   -- Advance Rs, shifting in the current Pseudo-Dividend bit: FZ_ShiftLeft_O_I(N => Rs, ShiftedN => Rs, Count => 1, OF_In => Pb, -- Current Pseudo-Dividend bit Overflow => LsO);   -- Subtract Divisor-Cut from R-Cut; Underflow goes into C: FZ_Sub(X => Rs, Y => Ds, Difference => Rs, Underflow => C);   -- Negation of current Quotient bit nQb := C and W_Not(LsO);   -- If C=1, the subtraction underflowed, and we must undo it: FZ_Add_Gated(X => Rs, Y => Ds, Sum => Rs, Gate => nQb);   -- Save the bit of Quotient that we have found: W := Shift_Left(W, 1) or (1 - nQb); -- i.e. inverse of nQb   end loop;   -- We made a complete Word of the Quotient; save it: Quotient(Quotient'Last + 1 - Index) := W; -- Indexed from end   end;   end Slice; pragma Inline_Always(Slice);   -- Measure of the Modulus ModulusMeasure : constant FZBit_Index := FZ_Measure(Modulus);   begin   -- First, process the high Word of the Pseudo-Dividend: Slice(1, 2, 1); -- ... it has just one bit: a 1 at the bottom position   -- Once we ate the top 1 bit of Pseudo-Dividend, rest of its bits are 0 Pb := 0;   -- Process the Modulus-sized segment below the top Word: for i in 2 .. Lm - 1 loop   Slice(i, i + 1, Bitness); -- stay ahead by a word to handle carry   end loop;   -- Process remaining Words: for i in Lm .. Lq loop   Slice(i, Lm, Bitness);   end loop;   -- Record the Quotient (i.e. the Barrettoid proper, Bm) Result.B := Quotient(Result.B'Range);   -- Record whether we have the Degenerate Case (1 iff Modulus = 1) Result.Degenerate := W_NZeroP(Quotient(Lq));   -- Record a copy of the Modulus, extended with zero Word: Result.ZXM(Modulus'Range) := Modulus; Result.ZXM(Result.ZXM'Last) := 0;   -- Record the parameter Jm: Result.J := ModulusMeasure - 1;   -- Wm - Jm Result.ZSlide := FZBit_Index(Bitness * Modulus'Length) - ModulusMeasure + 1;   end FZ_Make_Barrettoid;```

The process may seem complicated, but it is merely a specialized form of FZ_Mod. With the difference that we are interested in the quotient, rather the remainder, and also wish to compute certain additional parameters corresponding to the given modulus.

Recall that a Barrettoid BM for a given modulus M, was defined as the quantity ⌊2k / M⌋. In FZ_Make_Barrettoid, we compute it via Knuth’s division. Afterwards we record: the modulus itself; the quotient; whether the given modulus corresponds to the degenerate case M = 1; the parameter JM; and the parameter ZSlide. After this, the contents of the Barrettoid can be used to perform modular reduction modulo M in constant time.

And now, let’s show exactly how:

``` -- Reduce X using the given precomputed Barrettoid. procedure FZ_Barrett_Reduce(X : in FZ; Bar : in Barretoid; XReduced : in out FZ) is   -- Wordness of X, the quantity being reduced Xl : constant Indices := X'Length;   -- Wordness of XReduced (result), one-half of Xl, and same as of Modulus Ml : constant Indices := XReduced'Length; -- i.e. # of Words in Wm.   -- The Modulus we will reduce X by Modulus : FZ renames Bar.ZXM(1 .. Ml);   -- Shifted X Xs : FZ(X'Range);   -- Z := Xs * Bm (has twice the length of X) Z : FZ(1 .. 2 * Xl);   -- Upper 3Wm-bit segment of Z that gets shifted to form Zs ZHi : FZ renames Z(Ml + 1 .. Z'Last );   -- Middle 2Wm-bit segment of Z, that gets multiplied by M to form Q Zs : FZ renames Z(Z'First + Ml .. Z'Last - Ml );   -- Sub-terms of the Zs * M multiplication: ZsLo : FZ renames Zs(Zs'First .. Zs'Last - Ml ); ZsHi : FZ renames Zs(Zs'First + Ml .. Zs'Last ); ZsHiM : FZ(1 .. Ml);   -- Q := Modulus * Zs, i.e. floor(X / M)*M + E*M Q : FZ(1 .. Xl); QHi : FZ renames Q(Q'First + Ml .. Q'Last );   -- R is made one Word longer than Modulus (see proof re: why) Rl : constant Indices := Ml + 1;   -- Reduction estimate, overshot by 0, 1, or 2 multiple of Modulus R : FZ(1 .. Rl);   -- Scratch for the outputs of the gated subtractions S : FZ(1 .. Rl);   -- Borrow from the gated subtractions C : WBool;   -- Barring cosmic ray, no underflow can take place in (4) and (5) NoCarry : WZeroOrDie := 0;   begin   -- Result is initially zero (and will stay zero if Modulus = 1) FZ_Clear(XReduced);   -- (1) Ns := X >> Jm FZ_Quiet_ShiftRight(N => X, ShiftedN => Xs, Count => Bar.J);   -- (2) Z := X * Bm FZ_Multiply_Unbuffered(X => Bar.B, Y => Xs, XY => Z);   -- (3) Zs := Z >> 2Wm - Jm (already thrown lower Wm, so only Wm - Jm now) FZ_Quiet_ShiftRight(N => ZHi, ShiftedN => ZHi, Count => Bar.ZSlide);   -- (4) Q := Zs * M (this is split into three operations, see below)   -- ... first, Q := ZsLo * M, FZ_Multiply_Unbuffered(ZsLo, Modulus, Q);   -- ... then, compute ZsHiM := ZsHi * M, FZ_Low_Multiply_Unbuffered(ZsHi, Modulus, ZsHiM);   -- ... finally, add ZsHiM to upper half of Q. FZ_Add_D(X => QHi, Y => ZsHiM, Overflow => NoCarry);   -- (5) R := X - Q (we only need Rl-sized segments of X and Q here) FZ_Sub(X => X(1 .. Rl), Y => Q(1 .. Rl), Difference => R, Underflow => NoCarry);   -- (6) S1 := R - M, C1 := Borrow (1st gated subtraction of Modulus) FZ_Sub(X => R, Y => Bar.ZXM, Difference => S, Underflow => C);   -- (7) R := {C1=0 -> S1, C1=1 -> R} FZ_Mux(X => S, Y => R, Result => R, Sel => C);   -- (8) S2 := R - M, C2 := Borrow (2nd gated subtraction of Modulus) FZ_Sub(X => R, Y => Bar.ZXM, Difference => S, Underflow => C);   -- (9) R := {C2=0 -> S2, C2=1 -> R} FZ_Mux(X => S, Y => R, Result => R, Sel => C);   -- (10) RFinal := {DM=0 -> R, DM=1 -> 0} (handle the degenerate case) FZ_Mux(X => R(1 .. Ml), Y => XReduced, Result => XReduced, Sel => Bar.Degenerate); -- If Modulus = 1, then XReduced is 0.   end FZ_Barrett_Reduce;```

Notice anything unfamiliar? The astute reader will observe that the above is an exact implementation of the process described in Chapter 14A-Bis; the only new subcomponent is the FZ_Low_Multiply_Unbuffered routine used in Step 4. So let’s learn what it’s made of.

But first, review the elementary multiplication equivalence from Chapter 10:

 XLo XHi × YLo YHi = XLoYLo + XLoYHi + XHiYLo + XHiYHi = XY

Suppose, however, that we were only interested in calculating the bottom half of XY. We can then write the following schematic instead:

 XLoYLo + (XLoYHi)Lo + (XHiYLo)Lo = XYLo

Observe that this method is exactly analogous to the mechanism used in Chapter 9’s Mul_Word, where we find the lower half of a Word × Word multiplication by:

``` -- ........   -- XL * YL LL : constant Word := Mul_HalfWord_Iron(XL, YL);   -- XL * YH LH : constant Word := Mul_HalfWord_Iron(XL, YH);   -- XH * YL HL : constant Word := Mul_HalfWord_Iron(XH, YL);   -- ........   begin   -- Get the bottom half of the Product: XY_LW := LL + Shift_Left(LH + HL, HalfBitness);   -- ........```

So let’s now see how this works for FZ rather than Word:

```-- "Low Multiplication" computes only the bottom half of the product XY. -- Presently, it is used solely in Barrett's Modular Reduction.   package body FZ_LoMul is   -- Low-Only Comba's multiplier. (CAUTION: UNBUFFERED) procedure FZ_Low_Mul_Comba(X : in FZ; Y : in FZ; XY : out FZ) is   -- Words in each multiplicand (and also in the half-product) L : constant Word_Index := X'Length;   -- 3-word Accumulator A2, A1, A0 : Word := 0;   begin   -- Compute the lower half of the Product, which is all we want: for N in 0 .. L - 1 loop   -- Compute the Nth (indexed from zero) column of the Half-Product declare   -- The outputs of a Word multiplication Lo, Hi : Word;   -- Carry for the Accumulator addition C : WBool;   -- Sum for Accumulator addition Sum : Word;   begin   -- For lower half of XY, will go from 0 to N -- For upper half of XY, will go from N - L + 1 to L - 1 for j in 0 .. N loop   -- Hi:Lo := j-th Word of X * (N - j)-th Word of Y Mul_Word(X(X'First + j), Y(Y'First - j + N), Lo, Hi);   -- Now add Hi:Lo into the Accumulator:   -- A0 += Lo; C := Carry Sum := A0 + Lo; C := W_Carry(A0, Lo, Sum); A0 := Sum;   -- A1 += Hi + C; C := Carry Sum := A1 + Hi + C; C := W_Carry(A1, Hi, Sum); A1 := Sum;   -- A2 += A2 + C A2 := A2 + C;   end loop;   -- We now have the Nth (indexed from zero) word of XY XY(XY'First + N) := A0;   -- Right-Shift the Accumulator by one Word width: A0 := A1; A1 := A2; A2 := 0; end;   end loop;   end FZ_Low_Mul_Comba;     -- Low-Only Multiplier. (CAUTION: UNBUFFERED) procedure Low_Mul(X : in FZ; Y : in FZ; XY : out FZ) is   -- L is the wordness of a multiplicand. Guaranteed to be a power of two. L : constant Word_Count := X'Length;   -- K is HALF of the length of a multiplicand. K : constant Word_Index := L / 2;   -- A 'KSeg' is the same length as HALF of a multiplicand. subtype KSeg is FZ(1 .. K);   -- The two K-sized variables of the half-product equation: LH, HL : KSeg;   -- Bottom and Top K-sized halves of the multiplicand X. XLo : KSeg renames X( X'First .. X'Last - K ); XHi : KSeg renames X( X'First + K .. X'Last );   -- Bottom and Top K-sized halves of the multiplicand Y. YLo : KSeg renames Y( Y'First .. Y'Last - K ); YHi : KSeg renames Y( Y'First + K .. Y'Last );   -- Top K-sized half of the half-product XY. XYHi : KSeg renames XY( XY'First + K .. XY'Last );   -- Carry from individual term additions. C : WBool; pragma Unreferenced(C);   begin   -- Recurse to FULL-width multiplication: XY := XLo * YLo FZ_Multiply_Unbuffered(XLo, YLo, XY);   -- Recurse to HALF-width multiplication: LH := XLo * YHi FZ_Low_Multiply_Unbuffered(XLo, YHi, LH);   -- Recurse to HALF-width multiplication: HL := XHi * YLo FZ_Low_Multiply_Unbuffered(XHi, YLo, HL);   -- XY += 2^(K * Bitness) * LH FZ_Add_D(X => XYHi, Y => LH, Overflow => C);   -- XY += 2^(K * Bitness) * HL FZ_Add_D(X => XYHi, Y => HL, Overflow => C);   end Low_Mul; -- CAUTION: Inlining prohibited for Low_Mul !     -- Low-Only Multiplier. (CAUTION: UNBUFFERED) procedure FZ_Low_Multiply_Unbuffered(X : in FZ; Y : in FZ; XY : out FZ) is   -- The length of either multiplicand L : constant Word_Count := X'Length;   begin   if L < = Low_Mul_Thresh then   -- Base case: FZ_Low_Mul_Comba(X, Y, XY);   else   -- Recursive case: Low_Mul(X, Y, XY);   end if;   end FZ_Low_Multiply_Unbuffered;```

FZ_Low_Mul_Comba, of course, is merely a cut-down Comba from Chapter 9; while the recursion is analogous to the one in Chapter 10's Karatsuba and Chapter 12's Square Karatsuba.

Now, let's see where Barrett's Reduction is put to use. You will recall the conclusion of Chapter 12B, where we discussed the fact that the use of Knuth's division for modular reduction is quite expensive, and constitutes the bulk of the cost of modular exponentiation as presented in Chapters 6 through 13. And now, at last, we can write a fast modular reducer -- one that uses Barrett's method instead of Knuth's division:

``` -- (Barrettronic) Modular Exponent: Result := Base^Exponent mod Modulus procedure FZ_Mod_Exp(Base : in FZ; Exponent : in FZ; Modulus : in FZ; Result : out FZ) is   -- Double-width scratch buffer for the modular operations D : FZ(1 .. Base'Length * 2);   -- Working register for the squaring; initially is copy of Base B : FZ(Base'Range) := Base;   -- Register for the Mux operation T : FZ(Result'Range);   -- Buffer register for the Result R : FZ(Result'Range);   -- Space for Barrettoid Bar : Barretoid(ZXMLength => Modulus'Length + 1, BarretoidLength => 2 * B'Length);   begin   -- First, pre-compute the Barretoid for the given Modulus: FZ_Make_Barrettoid(Modulus => Modulus, Result => Bar);   -- Result := 1 WBool_To_FZ(1, R);   -- For each Word of the Exponent: for i in Exponent'Range loop   declare   -- The current Word of the Exponent Wi : Word := Exponent(i);   begin   -- For each bit of Wi: for j in 1 .. Bitness loop   -- T := Result * B mod Modulus FZ_Multiply_Unbuffered(X => R, Y => B, XY => D); FZ_Barrett_Reduce(X => D, Bar => Bar, XReduced => T);   -- Sel is the current bit of Exponent; -- When Sel=0 -> Result := Result; -- When Sel=1 -> Result := T FZ_Mux(X => R, Y => T, Result => R, Sel => Wi and 1);   -- Advance to the next bit of Wi (i.e. next bit of Exponent) Wi := Shift_Right(Wi, 1);   -- B := B^2 mod Modulus FZ_Square_Unbuffered(X => B, XX => D); FZ_Barrett_Reduce(X => D, Bar => Bar, XReduced => B);   end loop;   end;   end loop;   -- Output the Result: Result := R;   end FZ_Mod_Exp;```

Quite straightforward: we precompute the Barrettoid, and use it for all of the necessary modular multiplications and squarings modulo the given modulus.

The reader may be curious regarding how to properly test this program. And so I will invite him to download a complete package of test tapes, generated using a RNG:

Each modular exponentiation test tape was mechanically-produced on Chapter 13 FFA, and contains a series of modular exponentiations, each followed by an equality comparison with the expected result. E.g., a 2048-bit test:

```.09C98FCB9C40EC96BB9ECE3FD053D8468614D2C262A6ED7ACC613968CAF8A58F A725E113CCB16A000BF82170353B71789E23DD4620966EE23191C97F0290606CB 7AEF750CF9EB6716BB474BA28DD3A615D3FC3D19812ABB2735676C7EBD505497A 3080BF349F28335439FC0567C150006CE796D9F4307B0A5C0617619C0F4C29CB4 96D164F09E363A8D535149B3485D4D0F0C8F2395CBC8E067910CDD9A0228AE6E2 E37CA22F75EE91944C231899A1B340DB9968C6AB1EC7DBA911B0AF3CD3AD1AE83 573A51A7DAC53E8FFCBF9D985BE9CC003409ADB8E068CBE71243C4A4EB678D169 489FD0818B91C35F9A073DD2BC06173E13578481C1E98692B79C0A1CBB   .513C8694309772EA44FEA976BE4DD5674F0A798D20385E1F18D71ACA6FD1680D 9A09E9F075BEB7FE8A30911106618A4BC961CC0442E9BB8939D0CB249FEA16ECE CB6762ADEFD1CD660842CDA6CF7EC2EF4C523E61E8F119A1357AF40DAE7E1ECE4 9F27C23728CD5186D197802AD03098ACFD8E058BD7F8340FFFB5E8ACB807D7B96 D223FB0EC6D6E68C01C6293BC94B5BD370888C192F9C62C617B1598B8F19914C5 F98DBB5B9D06936B5E97BDDE87FEB0192C9EB0A32D9F7B066D134FB5E9215FF24 40340DC33568F5F4441A25F040D81CF584923A7DF28135F5F30282D344278E60D 8DBCB0D36C8641057265406E31896C6BC15DB82F0BDAFC5F456C9EB35F   .1F8FF0E6C3FF3A40C18BBAB99D9DB19A6413C734507638D6F44B90848D7B4FE2 E87E4D830B14F03E7FD87F47CCD4A715C51839952C5DB5B3F04E4C9633964881B 761E763D745B5DF2F8649418CB4B3DCA692535B862FC2C62F11DCEE2AB4191B25 B35A6DEDC585476DACF952A2F580C061F9DCD6C0CE4F08BF401379EC47CED1DE5 A3C87EC98CE4DFFC70EB65A3C6700C5145107D3DD305BA1E0E0F6A957566A0452 BE16E39CF4A17185644FBA8ED1CA79A3D6E28D29D936F2F953913EFBB94B7AF0C AA81CF3F5EF5C86E2F115424FEEEBD1DAD7FB371E27B2D727381F441ED5377E3D AB15BCF786E88582D6533AF673FA86047EBBC1667E6A12DD7102359052   MX   .0FCB1DE435944F516B3F05D990C761985348076167AFCF56FEA05BE1DBA81286 BA2139B0D10232577B79A4B9B872244E966EDC256678D1132B206D357D90F1F60 1B663DFD62FE78449A0D1E4B7E7FC55C2B978282D37B70A6D8715451BC114282B 3E9555E23612C2974974FA52ADDC6F2B030F0E98E6C5C6747C23A58FAE4C8730A 37426B54EC604EA1EF16F24B1FE8B190A993FC28A95960987D1768AC731DEC4E6 B334C75B27589F0E99DAC4A7AC5CA9E7A014C2E0F05A72A18145A9B8958D0F26E 179D3854E3C8CD0C3DD8E11DA318F9BE873175B1168580CEA42593FEC2795FDBD BC349D10B76118E646AA8EE5C2295452C2E78DC26528F7C4A4D2F90E21   ={(do nothing if ok)}{[SAD ]}_```

One invokes the tapes as follows, e.g. for the 1024-bit 10,000 shot tape:

`\$ time cat 10k_shots_1024bit_ffa_unif_rnd.tape | ./bin/ffa_calc 1024 32`

... and if successful (i.e. all outputs are correct) it will emit only the output of the unix "time" command; e.g. on my test iron:

``` real 7m24.751s user 7m24.081s sys 0m0.290s```

Now, the reader has probably read Dijkstra and recalls that "testing can reveal the presence of bugs, but never their absence." So why bother? The answer is, it is necessary to test your iron.

The test tapes in the signed TAR come in two variants, slid (where there are randomly-sized stretches of leading zeroes in the arguments to modular exponentiation) and uniform -- where there are not. You can use the test tapes as a litmus of whether your iron provides a constant-time iron multiplier and a constant-time barrel shifter. If you find that the "slid" tapes reliably execute faster on your machine than the "unif" tapes of the same respective FFA width, you have sad iron and must enable the workarounds (i.e. Mul_HalfWord_Soft and/or HaveBarrelShifter := False.)

Now let's find out what we actually achieve by using Barrett's Reduction:

Or, for those who prefer the raw numbers to the logarithmic plot,

Cost of one modular exponentiation operation (sec):
FFA Bitness Ch.13 (Conventional Knuth Mod. Red.) Ch.14 (Barrettronic Mod. Red.)
1024 0.395 0.043
2048 2.895 0.276
4096 21.895 1.703
8192 169.394 10.400

It would appear that "the game was worth the candles" -- we now have (AFAIK: the first and only presently-published...) fully constant-time Barrettron. And it is one that (with reasonable effort on the reader's part) fits-in-head.

In the next chapter, 15, we will begin to assemble the necessary ingredients for the generation of cryptographic primes. Please stay tuned!

~To be continued!~

## “Finite Field Arithmetic.” Chapter 14A-Bis: Barrett’s Modular Reduction. (Physical Bounds Proof.)

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

• All of the materials from Chapters 1 – 14A
• There is no vpatch in Chapter 14A-Bis.

Given the trickiness of Barrett’s Reduction under the constraints of FFA, and the fact that we want to hard-guarantee, as adults do, the exact amount of physical space required to hold the intermediate results of the computation, it is necessary to extend the proof of Chapter 14A with a physical bounds proofbefore we can dare to look at the corresponding program code.

We will illuminate said proof with “electrical” illustrations, in the style of Chapter 12A.

Please print out and refer to Algorithm 2 from Chapter 14A when working through this Chapter.

First, three lemmas:

Lemma 1:
For any integers A ≥ 0, B > 0, C = ⌊A / B⌋, Measure(C) ≤ Measure(A) - Measure(B) + 1.

Lemma 2:

For any integers A ≥ 0, B ≥ 0, C = A × B, Measure(C) ≤ Measure(A) + Measure(B).

Lemma 3:

For any integers A ≥ 0, B ≥ 0, C = A + B, Measure(C) ≤ Max(Measure(A), Measure(B)) + 1.

If the correctness of these is not obvious, dear reader, stop reading and prove them. Right now.

We will use ⇠dashed arrows⇢ to refer to the variant bitness bounds of hypothetical zero and nonzero segments of registers, for the purpose of min/max width analysis of the intermediate results in our computation. ←Solid arrows→ will refer to invariant bounds (i.e. the physical sizes of particular registers.)

First, let’s describe the physical space occupied by the quantities which must exist prior to performing Barrett’s Reduction of a given X.

We begin with the modulus: 0 < M < 2WM. It lives in a WM-bit register:

 ←WM→ M

Now let’s picture, strictly for illustrative purposes, an M which is a bit less than “half-full”:

 ←WM→ 0 M ⇠JM+1⇢

Recall that we picked the concrete value of JM (0 ≤ JM < WM) to equal one less than the measure of M, i.e.,
JM = Measure(M) - 1.

Observe that this value is a secret (just like M and X) and therefore it is forbidden in FFA to branch on its bits, index memory by it, etc. The only mechanical thing we will do with it is to shift certain registers (via a constant-time method) by it.

Now, let’s describe the Barrettoid BM corresponding to a given M.

By Lemma 1, the Measure of 0 < BM < 2KM will always be less than or equal to:

(2WM + 1) - Measure(M) + 1 = 2WM - JM.

And so, BM will look like:

 ←2WM→ 0 BM ⇠JM⇢ ⇠2WM - JM⇢

Now, of course, we also need the dividend:

0 ≤ X < 2KM is the integer to be reduced modulo M, and its register has twice the bitness of M:

 X ←2WM→

Now, let’s describe the space required to carry out Barrett’s Reduction itself, where in the end we obtain X mod M.

The first of the two shifts in the algorithm, where we obtain the term ⌊X / 2j:

1: Xs = X >> JM

 ←2WM→ 0 XS ⇠JM⇢ ⇠2WM - JM⇢

The first of the two multiplications. This one is 2WM × 2WM-sized, and it will give us the intermediate result:
⌊X / 2j⌋⌊2k / M⌋.
The second multiplicand is our pre-computed Barrettoid for the current M. And so:

2: Z = Xs × BM

By Lemma 2:

 ←2WM→ 0 XS ⇠JM⇢ ⇠2WM - JM⇢ ←2WM→ × 0 BM ⇠JM⇢ ⇠2WM - JM⇢ ←4WM→ = 0 Z ⇠2JM⇢ ⇠4WM - 2JM⇢

The second of the two shifts. This one gets us the entire Barrett approximation of the quotient, i.e. ⌊X / M⌋ + E :

 ⌊2k / M⌋⌊X / 2j⌋ ——————————————— 2k - j

3: Zs := Z >> 2WM - JM

 ←4WM→ 0 ZS ⇠2WM + JM⇢ ⇠2WM - JM⇢

Observe that we do not need to spend the cost of shifting the whole of Z in order to obtain ZS, given as the shift will always be of at least WM+1-bits (since 0 ≤ JM < WM). Ergo, the lower WM bits of Z are always discarded in this process, and can be simply ignored:

 ←4WM→ 0 Z ⇠2JM⇢ ⇠4WM - 2JM⇢
 ←4WM→ = ZHi ZMid ZLo ←WM→ ←2WM→ ←WM→

… so it actually suffices to right-shift only the 3WM-bit segment consisting of ZHi and ZMid, by WM - JM bits, and afterwards to take the ZMid-wide lower segment of the output as the resulting ZS.

ZS always fits inside a 2WM-bit register:

 ←2WM→ 0 ZS ⇠JM⇢ ⇠2WM - JM⇢

Here, we have something a little more interesting. We have an asymmetric (2WM × WM-sized) multiplication, where we want to multiply ZS, the Barrett approximation of the quotient, by the modulus M:

4: Q := Zs × M

 ←2WM→ 0 ZS ⇠JM⇢ ⇠2WM - JM⇢ ←WM→ × 0 M ⇠JM+1⇢ ←3WM→ = 0 Q ←WM→ ←2WM→

By Lemma 2, a 2WM-bit × WM-bit product in the general case must occupy 3WM bits; while, looking closer, it would appear that, given as we have multiplied the (2WM - JM)-bit contents of ZS by the JM+1-bit contents of M, we could end up with a 2WM+1-bit result…

However, in Chapter 14A we proved that:

 ⌊2k / M⌋⌊X / 2j⌋ X mod M = X - M × ——————————————— - E × M 2k - j

And so we know that Q ≤ X, and therefore, regardless of the value of JM, the term Q will always fit inside 2WM bits.

 ←2WM→ Q ←2WM→ X

If the “secret” of this “miracle” eludes you, please go back to Chapter 14A and walk the proof, until it fits-in-head.

Now, we find the initial estimate of X mod M:

5: R := X - Q

 ←2WM→ X ←2WM→ - Q ←2WM→ = R

But do we actually need to perform a 2WM-bit subtraction here? It turns out that we don’t, because R is at most W+2-bit. We know this because we have proven that at most two subtractions of M — a WM-bit quantity — will be required after this step to produce the final, correct X mod M, a WM-bit quantity. Therefore, what we actually need to do instead of the above, is:

 ←2WM→ Ignore X ←WM - L→ ←WM + L→ ←2WM→ - Ignore Q ←WM - L→ ←WM + L→ = R ←WM + L→

L here is the bitness of our machine Word.

And by Lemma 3, we know that each of the following two gated subtractions can remove a maximum of one bit from the length of R; and from the Chapter 14A proof, we know that the final result will be of the same maximal bitness (i.e. will fit in the same size of FFA register) as the modulus M.

Now, we will do the first of our two gated subtractions. We will subtract a WM + L zero-extended (on the senior end, by one machine word) M from R, to obtain a difference S1 and a “borrow”, C1.

6: S1 := R - M, C1 := Borrow

 ←WM + L→ R ←WM + L→ - 0 M ←L→ ←WM→ ←WM + L→ = S1 C1 ← Borrow

Now, the “gating” of that subtraction:

7: R := {C1=0 → S1, C1=1 → R}

 ←WM + L→ ←WM + L→ S1 ⇒Mux(C1)⇐ R ↳C1=0⇘ ⇓ ⇙C1=1↲ R ←WM + L→

That is, if the subtraction in step 6 did not underflow, as indicated by C1 = 0, then we know that R was greater than or equal to M prior to step 6, and we then keep S1 (the result of the subtraction) and assign it back to R. On the other hand, if it did underflow, as indicated by C1 = 1, then we discard S1, and R stays as it was. Note that this process is to be carried out in constant time, via FZ_Mux.

Now, the second of the two gated subtractions. Exactly like in Step 6:

8: S2 := R - M, C2 := Borrow

 ←WM + L→ R ←WM + L→ - 0 M ←L→ ←WM→ ←WM + L→ = S2 C2 ← Borrow

And, finally, the “gating” of the second (and last) subtraction from Step 8:

9: R := {C2=0 → S2, C2=1 → R}

 ←WM + L→ ←WM + L→ S2 ⇒Mux(C2)⇐ R ↳C2=0⇘ ⇓ ⇙C2=1↲ R ←WM + L→

Observe that we are not yet finished with R. The final result R = X mod M could still be one of two possible things:

• If M ≠ 1, the final result will be found in the bottom WM bits of R.
• If M = 1, then we had the degenerate case, and everything we did in steps 1-9 was simply to occupy constant-time, and has no bearing on the answer to the calculation of X mod M; the final result in this case is to be zero.

Ergo, we need one last Mux:

10: RFinal := {DM=0 → R, DM=1 → 0}

First, observe that:

 ←WM + L→ R = 0 RF ←L→ ←WM→

The two gated subtractions will at this point have zeroed any of the L upper bits of R that may have been previously non-zero.

And so, all that remains to do now, is:

 ←WM→ ←WM→ RF ⇒Mux(DM)⇐ 0 ↳DM=0⇘ ⇓ ⇙DM=1↲ R = X mod M ←WM→

…and that’s it, we have R = X mod M, as was proven already in Chapter 14A.

In the next Chapter, 14B, we will finally walk through the Ada program which implements this Barrettron. Stay tuned!

## Uncrating of Symbolics “MacIvory” Machine

At long last, got hold of one of these:

## “Finite Field Arithmetic.” Chapter 14A: Barrett’s Modular Reduction. (Part 1 of 2)

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

• A Keccak-based VTron (for this and all subsequent chapters.)
• All of the materials from Chapters 1 – 13.
• There is no vpatch in Chapter 14A.

On account of the substantial heft of this chapter, I have cut it into two parts, 14A and 14B; you are presently reading 14A, which consists strictly of the exposition of Barrett’s Reduction algorithm and its proof of correctness. 14B will appear here in the course of the next several days; it contains the Ada realization of said algorithm, along with performance measurements.

Replies to the reader mail bag from Chapter 13 will also appear in 14B.

Recall that in Chapter 12B, we determined that Knuth’s Divider accounted for most of the CPU cost of modular exponentiation in FFA.

Each invocation of our modular exponentiator on, e.g., a 4096-bit Base/Exponent/Modulus tuple, requires, in addition to other work, 8192 modular reductions using FZ_Mod; and in each of these, a 8192-bit quantity is divided by the 4096-bit modulus to calculate a remainder. This is not merely expensive, but, more importantly, on account of Knuth’s Divider, it is cubically (i.e. O(N^3), where N is your FFA Bitness) expensive!

Fortunately, there are several published approaches to substantially lightening the cost of modular exponentiation, by deriving a “pre-computed” datum from a given modulus — in such a way that the necessary series of repeated reductions modulo that modulus can be performed “on the cheap”, using a handful of inexpensive (sub-quadratic) operations.

I presently know of two traditional algorithms of this kind which admit a constant-spacetime implementation.

The first is Montgomery’s Reduction. It is quite efficient, but imposes painful constraints on the domains of the modulus and the dividend. Specifically, if an integer to be reduced is not coprime with the modulus, the output of the calculation will be rubbish. And testing for this condition in constant time is quite expensive– and even if it were not, if the test were to reveal a prohibited input, one would be forced to fall back to a different modular reduction method with a different timing! This would destroy the constant time guarantee. Montgomery’s Reduction is wildly popular in single-purpose RSAtrons, but is entirely unsuitable for use in a general-purpose arithmetizer like FFA. And therefore nothing more will be said of it in this series.

The second of these algorithms, and the subject of this chapter, is Barrett’s Reduction. This method, if implemented correctly, will run in constant spacetime and produce correct modular reductions — at the cost of a single double-bitness Knuth division per change of modulus, and a handful of relatively-inexpensive multiplications and subtractions per change of dividend.

Carrying out Barrett’s Reduction in constant spacetime demands constant-spacetime “bitness measure” and shift operations, which were introduced in Chapter 13. (If you have not yet satisfied yourself that you understand chapters 1 – 13, don’t even think about proceeding here! Go back and study!)

Implementing (and grasping) this algorithm is not difficult, but to make it suitable for use in cryptographic arithmetic, a proof of correctness of this particular implementation is absolutely necessary: “a sapper errs only once!” It is very easy to produce a subtly-broken implementation of Barrett’s Reduction which yields rubbish on 0.01% (or less) of the input domain, while giving perfectly-correct outputs on the rest!

Therefore we must have the proof. And the reader who is contemplating any serious use for FFA must not only read and understand this proof, but verify that the offered implementation actually corresponds to the algorithm given in said proof. The same thing has already been said about FFA as a whole. But it applies especially acutely to this chapter! You will not find this proof in Knuth’s AOP, or even in Barrett’s original 1986 paper. As far as I know, the proof in this article is the only public one which completely treats a constant-time implementation of Barrett’s Reduction.

And if you, dear reader, after working carefully through this chapter and the next, have any doubt at all with respect to the correctness of the proof, or of the actual program’s conforming to the proof, do not use the program! And please leave a comment!

Having said all of this, let’s proceed with the algorithm and its correctness proof. The Ada implementation will appear in the next chapter, 14B.

Observe that for integers X ≥ 0, M > 0, X mod M = X - M⌊X / M⌋. Therefore, if we can compute ⌊X / M⌋, we can determine X mod M.

Barrett gave the following equivalence for X / M:

 X 2k X 1 — = —— × —— × ————— M M 2j 2k - j

… where k and j are integers, the choice of values for which will depend on the concrete implementation, and will be discussed in detail further below.

Barrett’s clever pill consists of the following transformation of the equivalence into pure-integer arithmetic:

 X ( ⌊ 2k ⌋ 2k mod M ) ( ⌊ X ⌋ X mod 2j ) 1 — = —— + ———————— × —— + ———————— × ————— M M M 2j 2j 2k - j

For clarity, let’s bring out the remainder subterms:

 R1 = 2k mod M R2 = X mod 2j

…then, the following equation will hold true:

 X ⌊2k / M⌋⌊X / 2j⌋ R2⌊2k / M⌋ R1⌊X / 2j⌋ R1R2 — = ——————————————— + —————————— + —————————— + ————— M 2k - j 2k M2k - j M2k

Marked in green is Barrett’s approximation of the division X / M. This approximation is inexact, i.e. it will differ from the true value of X / M by some finite quantity, equal to the sum of the three terms of the error equation, marked in yellow. If we can put a bound on this quantity, it will be possible to write a constant-time modular reducer.

The quantity ⌊2k / M⌋ is to be computed only once, per change of the modulus M. We will refer to it later as the “Barrettoid”. This is the item we will want to pre-compute at the beginning of a modular exponentiation, and use for all of the required modular reductions afterwards.

The modular reduction of a particular X by modulus M, if the Barrettoid corresponding to M is available, requires no Knuth division, but can instead be performed using a series of relatively-inexpensive shifts, multiplications, additions, and subtractions.

Now let’s find that error bound, along with the domains of the parameters where it will apply.

Our approach to this will be: if each of the error terms can be guaranteed to be less than 1, the sum of all three terms can be guaranteed to be less than 3, and then a Barrett’s Reduction can be performed in constant-time.

 R2⌊2k / M⌋ (X mod 2j)⌊2k / M⌋ —————————— = ————————————————————— < 1 2k 2k

In order for this inequality to hold true, the denominator must exceed the numerator:

(X mod 2j)⌊2k / M⌋ < 2k

It is clear that the left-hand side of the inequality, when variables other than X are held constant, reaches its maximum when X ≡ 2j - 1, while the right-hand side does not depend on X. Therefore, we can rewrite it as:

(2j - 1)⌊2k / M⌋ < 2k

Observe that ⌊2k / M⌋ × M ≤ 2k. It is then known that:

(2j - 1)⌊2k / M⌋M < M2k,

(2j - 1)2k < M2k,

2j - 1 < M,

2j ≤ M
must be true for the inequality to hold.

Now let’s take the same approach with the second error term:

 R1⌊X / 2j⌋ (2k mod M)⌊X / 2j⌋ —————————— = ——————————————————— < 1 M2k - j M2k - j

In order for this inequality to hold true, the denominator must exceed the numerator:

(2k mod M)⌊X / 2j⌋ < M2k - j

It is clear that the left-hand side of the inequality, when variables other than M are held constant, reaches its maximum when 2k mod M = M - 1. Therefore, we can rewrite it as:

(M - 1)⌊X / 2j⌋ < M2k - j

Observe that ⌊X / 2j⌋ × 2j ≤ X. It is then known that:

(M - 1)⌊X / 2j⌋2j < M2k - j2j,
(M - 1)X < M2k,

… which will hold true for all possible M if and only if:

X ≤ 2k.

Finally, let’s apply this approach to the third error term:

 R1R2 (2k mod M)(X mod 2j) ——————— = ——————————————————— < 1 M2k M2k

In order for this inequality to hold true, the denominator must exceed the numerator:

(2k mod M)(X mod 2j) < M2k

It is clear that the left-hand side of the inequality, when variables other than M and X are held constant, reaches its maximum when 2k mod M = M - 1 and X ≡ 2j - 1. Therefore, we can rewrite it as:

(M - 1)(2j - 1) < M2k

… which will hold true for all possible M if and only if:

2j - 1 < 2k,

2j ≤ 2k,

j ≤ k.

And so, we now know that if we satisfy the constraints:

2j ≤ M
X  ≤ 2k
j  ≤ k

… the error terms will obey the equation:

 R2⌊2k / M⌋ R1⌊X / 2j⌋ R1R2 0 ≤ —————————— + —————————— + ————— < 3 2k M2k - j M2k

And from this fact, it follows that:

 ⌊ X ⌋ ⌊2k / M⌋⌊X / 2j⌋ — = ——————————————— + E ∈ {0, 1, 2} M 2k - j

It then is clear that:

 ⌊2k / M⌋⌊X / 2j⌋ X mod M = X - M × ——————————————— - E × M 2k - j

That is, to compute the modular reduction of X modulo M, we calculate Barrett’s approximation, multiply the result by M, subtract this quantity from X, and then finally must subtract M zero, once, or twice, i.e. until the working register is less than M and thereby contains the reduced X.

The algorithm of our Barrettron will then take this form:

Algorithm 1:

To compute the reduction R := X mod M:

1. Xs := X >> j
2. Z  := Xs × ⌊2k / M
3. Zs := Z >> k - j
4. Q  := Zs × M
5. R  := X - Q
6. If R ≥ M:
R := R - M
7. If R ≥ M:
R := R - M
8. R is now equal to X mod M.

A sharp reader will have noticed that Algorithm 1 is not directly usable under the constraints of FFA, where conditional branching on operand bits is forbidden, and all registers subject to multiplication or user I/O must have powers-of-two bitnesses. Steps 6 and 7 violate the former constraint; the potentially 2k + 1 bitness of the Barrettoid ⌊2k / M (if M were to equal 1) violates the latter.

And so, let’s re-write it to obey these restrictions, and at the same time incorporate concrete values for the parameters k and j to make for correctly-FFA-sized input, output, and intermediate result registers; we will also split the algorithm into separate pre-computation and reduction phases.

Algorithm 2:

For each new modulus M, we will precompute the constants that correspond to it:

1. WM := FZ_Bitness(M)
This is simply the register bitness of the modulus M, and, in ordinary circumstances, will correspond to the currently-set FFAcalc bitness.
0 < M < 2WM is then the domain of M.

2. KM := 2 × WM
This corresponds to the parameter k in the proof; and it also equals the register bitness available for any X that will be reduced, since we will be using the Barrettron only inside modular exponentiation, where any given X will be the product of two multiplicands each having the same bitness as M.
0 ≤ X < 2KM is then the domain of X.

3. BM := The lower KM bits of ⌊2KM / M
This is the Barrettoid of the given M, constrained to a KM-bit register. If M ≠ 1, BM is valid; otherwise we have the degenerate case, and the output of the reduction calculation must be nulled (given as any integer reduced modulo 1 will equal zero.)
0 < BM < 2KM.

4. DM := Top bit of ⌊2KM / M
If and only if equal to 1, indicates that the modulus M is the degenerate case where M = 1.
0 ≤ DM ≤ 1.

5. JM := Measure(M) - 1
This corresponds to the parameter j in the proof.
0 ≤ JM < WM and 2JM < M < 2JM + 1.

6. SM := KM - JM
This corresponds to the expression k - j in the proof.
Observe that WM < SM ≤ KM.

For each new input X, to compute the reduction R := X mod M:

1. Xs := X >> JM
2. Z  := Xs × BM
3. Zs := Z >> SM
4. Q  := Zs × M
5. R  := X - Q
6. R  := R - M, C := Borrow
7. R  := R + (M × C)
8. R  := R - M, C := Borrow
9. R  := R + (M × C)
10. R  := R - (R × DM)
11. R is now equal to X mod M.

The “reversion” steps in Algorithm 2 (7, 9, and 10) are described arithmetically for clarity, but we will want to carry them out using the much-faster FZ_Mux operator, rather than via arithmetic.

In Chapter 14B, we will walk through the Ada implementation of Algorithm 2, and measure the performance gained from rewriting our modular exponentiator to make use of Barrett’s Reduction instead of the classical Knuthian reduction.

Stay tuned!

## “Finite Field Arithmetic.” Chapter 13: “Width-Measure” and “Quiet Shifts.”

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

Add the above vpatches and seals to your V-set, and press to ffa_ch13_measure_and_qshifts.kv.vpatch.

You should end up with the same directory structure as previously.

Now compile ffacalc:

```cd ffacalc gprbuild```

But do not run it quite yet.

First, the mail bag!

Reader diana_coman recently found an aesthetically-pleasing equivalence for Chapter 1’s W_Borrow function:

``` -- Find the Borrow, from a subtraction where it is known that A - B == D: function W_Borrow(A : in Word; B : in Word; D : in Word) return WBool is begin return WBool(Shift_Right(((not A) and B) or (((not A) or B) and D), Bitness - 1)); end W_Borrow;```

And then diana_coman found that it can be reduced to a much prettier:

``` -- Find the Borrow, from a subtraction where it is known that A - B == D: function W_Borrow(A : in Word; B : in Word; D : in Word) return WBool is begin return WBool(Shift_Right((B and D) or ((B or D) and (not A)), Bitness - 1)); end W_Borrow;```

Thank you, diana_coman, for this tidbit! And for reading and signing Chapter 1 and Chapter 2:

I would like to thank diana_coman (as I understand: a potential industrial user!) for her attentive reading, and wish her a happy time with the remaining material in FFA!

In this chapter, we will introduce three new operations on FZ integers. As hinted in Chapter 12B, they will be required for fast modular reduction, which we will meet with later on in the series. But first, we must walk through some older material.

The time has come to review the instruction set of ffacalc, as seen in Chapter 12B:

FFACalc Instruction Set (as seen in Chapter 12B)
Op Description # Ins # Outs Notes
Modes
( Enter Comment Mode 0 0 All further ops ignored until mode is exited; supports nesting
) Exit Comment Mode 0 0 Fatal if not in comment mode
[ Enter Quote Mode 0 0 All further ops are echoed verbatim until mode is exited; supports nesting
] Exit Quote Mode 0 0 Fatal if not in quote mode
{ Enter Conditional Branch 1 0 If input = 1, execute all ops until }; otherwise skip them; supports nesting
} Exit Conditional Branch 0 1 Produces 1 if branch being exited had been taken, otherwise produces 0
Stack Motion
" Dup 1 2 Put two copies of the input on stack
_ Drop 1 0 Discard the item on top of stack
' Swap 2 2 Exchange top and second item
` Over 2 3 Put copy of second item on stack
Constants
. Push Zero 0 1 Put a brand-new zero on stack
0..9, A..F, a..f Insert Hexadecimal Digit 1 1 Insert the given hexadecimal digit as the junior-most nibble into the top item
Predicates
= Equals 2 1 Put 1 on stack if the top and second items are bitwise-equal, otherwise 0
< Less-Than 2 1 Put 1 on stack if the second item is less than the top item
> Greater-Than 2 1 Put 1 on stack if the second item is greater than the top item
Bitwise
& Bitwise-AND 2 1 Compute bitwise-AND of the top and second item, put result on stack
| Bitwise-OR 2 1 Compute bitwise-OR of the top and second item, put result on stack
^ Bitwise-XOR 2 1 Compute bitwise-XOR of the top and second item, put result on stack
~ Bitwise-Complement 1 1 Compute 1s-complement negation of the top item on stack
U Bitwise-MUX 3 1 If top item’s junior-most Word is nonzero, the second item will be the result; otherwise the third item will.
Arithmetic: Basic
- Subtract 2 1 Subtract top item from second item, put result on stack, and save borrow bit into Flag
+ Add 2 1 Add top and second item, put result on stack, and save carry bit into Flag
O Push Overflow Flag 0 1 Put Flag, the register containing carry or borrow from the most recent arithmetic op, on the stack
Arithmetic: Division
\ Divide with Remainder 2 2 Divide second item by top item, put quotient and then remainder on stack; division by zero is fatal
/ Divide without Remainder 2 1 Divide second item by top item, put only quotient on stack; division by zero is fatal
% Modulus 2 1 Divide second item by top item, put only remainder on stack; division by zero is fatal
Arithmetic: Multiplication
* Multiply 2 2 Multiply second item and top item, put the junior half of the result on stack, and then the senior half
S Square 1 2 Square the top item, put the junior half of the result on stack, and then the senior half
M Modular Multiplication 3 1 Multiply third item and second item, modulo the top item, put the result on stack (it necessarily fits in one FZ); division by zero is fatal
X Modular Exponentiation 3 1 Raise third item to the power of the second item, modulo the top item, put the result on stack (it necessarily fits in one FZ); division by zero is fatal
I/O
# Print FZ 1 0 Print top item to console, in hexadecimal representation
? Random 0 1 Fill a FZ from the active RNG and put on stack
Control
Z Zap 0 0 Reset ffacalc, clearing stack and all other state
Q Quit 0 0 Quit ffacalc, after printing stack contents

The attentive reader should not find any surprises in the above summary.

However, do not get too comfortable with this instruction set, for we are about to make a few quite substantial changes to it.

In particular, you have probably already guessed that the set of printable ASCII characters is not in fact large enough to cleanly represent all cryptographically-necessary operations on integers. And therefore we will be introducing a small number of prefixed ops:

``` -- ...   -- Prefixed Operators PrevC : Character := ' '; HavePrefix : Boolean := False;   -- Clear the stack and set SP to bottom. procedure Zap is begin -- ................. -- .................   -- Clear prefix HavePrefix := False; PrevC := ' '; end Zap;   -- ...   -- Denote that the given op is a prefix procedure IsPrefix is begin HavePrefix := True; end IsPrefix;   -- ...   -- Execute a Normal Op procedure Op_Normal(C : in Character) is   -- ... -------------- -- Prefixes -- --------------   -- 'Left...' : when 'L' => IsPrefix;   -- 'Right...' : when 'R' => IsPrefix;   -- 'Modular...' : when 'M' => IsPrefix;   -- Execute a Prefixed Op procedure Op_Prefixed(Prefix : in Character; O : in Character) is begin   -- The Prefixed Op: case Prefix is   --------------------------------------------------------- -- Left... when 'L' =>   -- Which L-op? case O is   -- ... Shift : when 'S' => Want(2); declare -- Number of bit positions to shift by: ShiftCount : FZBit_Index := FZBit_Index(FFA_FZ_Get_Head(Stack(SP))); begin FFA_FZ_Quiet_ShiftLeft(N => Stack(SP - 1), ShiftedN => Stack(SP - 1), Count => ShiftCount); end; Drop;   -- ... Rotate : when 'R' => E("Left-Rotate not yet defined!");   -- ... Unknown: when others => E("Undefined Op: L" & O);   end case; --------------------------------------------------------- -- Right... when 'R' =>   -- Which R-op? case O is   -- ... Shift: when 'S' => Want(2); declare -- Number of bit positions to shift by: ShiftCount : FZBit_Index := FZBit_Index(FFA_FZ_Get_Head(Stack(SP))); begin FFA_FZ_Quiet_ShiftRight(N => Stack(SP - 1), ShiftedN => Stack(SP - 1), Count => ShiftCount); end; Drop;   -- ... Rotate: when 'R' => E("Right-Rotate not yet defined!");   -- ... Unknown: when others => E("Undefined Op: R" & O);   end case; --------------------------------------------------------- -- Modular... when 'M' =>   -- Which M-op? case O is   -- ... Multiplication : when '*' => Want(3); MustNotZero(Stack(SP)); FFA_FZ_Modular_Multiply(X => Stack(SP - 2), Y => Stack(SP - 1), Modulus => Stack(SP), Product => Stack(SP - 2)); Drop; Drop;   -- ... Exponentiation : when 'X' => Want(3); MustNotZero(Stack(SP)); FFA_FZ_Modular_Exponentiate(Base => Stack(SP - 2), Exponent => Stack(SP - 1), Modulus => Stack(SP), Result => Stack(SP - 2)); Drop; Drop;   -- ... Unknown: when others => E("Undefined Op: M" & O);   end case; --------------------------------------------------------- -- ... Unknown: (impossible per mechanics, but must handle case) when others => E("Undefined Prefix: " & Prefix);   end case;   end Op_Prefixed;     -- Process a Symbol procedure Op(C : in Character) is begin --- .............. --- ..............   --- ... if in a prefixed op: elsif HavePrefix then   -- Drop the prefix-op hammer, until another prefix-op cocks it HavePrefix := False;   -- Dispatch this op, where prefix is the preceding character Op_Prefixed(Prefix => PrevC, O => C);   else -- This is a Normal Op, so proceed with the normal rules. Op_Normal(C); end if;   end Op;     -- Current Character C : Character;   begin -- Reset the Calculator Zap; -- Process characters until EOF: loop if Read_Char(C) then -- Execute Op: Op(C); -- Advance Odometer Pos := Pos + 1; -- Save the op for use in prefixed ops PrevC := C; else Zap; Quit(0); -- if EOF, we're done end if; end loop; end;```

The mechanism is quite simple — certain operation codes now signify prefixes, which permit clean and readable names for certain commands which come in multiple flavours — left vs right shifts, ordinary vs. modular arithmetic procedures, and so on.

Also note that, from this point on, the use of undefined printable ASCII characters is a fatal error in ffacalc:

``` --- ....... --------------------------------------------------------- -- Reserved Ops, i.e. ones we have not defined yet: -- --------------------------------------------------------- when '!' | '@' | '\$' | ':' | ';' | ',' | 'G' | 'H' | 'I' | 'J' | 'K' | 'N' | 'P' | 'T' | 'V' | 'X' | 'Y' =>   E("This Operator is not defined yet: " & C); ---------------------------------------------------------```

Let’s summarize the changes in this chapter with a new instruction set table:

FFACalc Instruction Set (as seen in Chapter 13)
Op Description # Ins # Outs Notes
Modes
( Enter Comment Mode 0 0 All further ops ignored until mode is exited; supports nesting
) Exit Comment Mode 0 0 Fatal if not in comment mode
[ Enter Quote Mode 0 0 All further ops are echoed verbatim until mode is exited; supports nesting
] Exit Quote Mode 0 0 Fatal if not in quote mode
{ Enter Conditional Branch 1 0 If input = 1, execute all ops until }; otherwise skip them; supports nesting
} Exit Conditional Branch 0 1 Produces 1 if branch being exited had been taken, otherwise produces 0
Stack Motion
" Dup 1 2 Put two copies of the input on stack
_ Drop 1 0 Discard the item on top of stack
' Swap 2 2 Exchange top and second item
` Over 2 3 Put copy of second item on stack
Constants
. Push Zero 0 1 Put a brand-new zero on stack
0..9, A..F, a..f Insert Hexadecimal Digit 1 1 Insert the given hexadecimal digit as the junior-most nibble into the top item
Predicates
= Equals 2 1 Put 1 on stack if the top and second items are bitwise-equal, otherwise 0
< Less-Than 2 1 Put 1 on stack if the second item is less than the top item
> Greater-Than 2 1 Put 1 on stack if the second item is greater than the top item
Bitwise
& Bitwise-AND 2 1 Compute bitwise-AND of the top and second item, put result on stack
| Bitwise-OR 2 1 Compute bitwise-OR of the top and second item, put result on stack
^ Bitwise-XOR 2 1 Compute bitwise-XOR of the top and second item, put result on stack
~ Bitwise-Complement 1 1 Compute 1s-complement negation of the top item on stack
U Bitwise-MUX 3 1 If top item’s junior-most Word is nonzero, the second item will be the result; otherwise the third item will.
W Width-Measure 1 1 Find the position of the senior-most bit in the top item that equals 1 (or return 0 if there are none).
RS Right-Shift 2 1 Shift the second item right by the number of bits given in the top item modulo the FZ bitness.
LS Left-Shift 2 1 Shift the second item left by the number of bits given in the top item modulo the FZ bitness.
Arithmetic: Basic
- Subtract 2 1 Subtract top item from second item, put result on stack, and save borrow bit into Flag
+ Add 2 1 Add top and second item, put result on stack, and save carry bit into Flag
O Push Overflow Flag 0 1 Put Flag, the register containing carry or borrow from the most recent arithmetic op, on the stack
Arithmetic: Division
\ Divide with Remainder 2 2 Divide second item by top item, put quotient and then remainder on stack; division by zero is fatal
/ Divide without Remainder 2 1 Divide second item by top item, put only quotient on stack; division by zero is fatal
% Modulus 2 1 Divide second item by top item, put only remainder on stack; division by zero is fatal
Arithmetic: Multiplication
* Multiply 2 2 Multiply second item and top item, put the junior half of the result on stack, and then the senior half
S Square 1 2 Square the top item, put the junior half of the result on stack, and then the senior half
M* Modular Multiplication 3 1 Multiply third item and second item, modulo the top item, put the result on stack (it necessarily fits in one FZ); division by zero is fatal
MX Modular Exponentiation 3 1 Raise third item to the power of the second item, modulo the top item, put the result on stack (it necessarily fits in one FZ); division by zero is fatal
I/O
# Print FZ 1 0 Print top item to console, in hexadecimal representation
? Random 0 1 Fill a FZ from the active RNG and put on stack
Control
Z Zap 0 0 Reset ffacalc, clearing stack and all other state
Q Quit 0 0 Quit ffacalc, after printing stack contents
Not Yet Defined:
! Undefined 0 0 Prohibited
@ Undefined 0 0 Prohibited
\$ Undefined 0 0 Prohibited
: Undefined 0 0 Prohibited
; Undefined 0 0 Prohibited
, Undefined 0 0 Prohibited
G Undefined 0 0 Prohibited
H Undefined 0 0 Prohibited
I Undefined 0 0 Prohibited
J Undefined 0 0 Prohibited
K Undefined 0 0 Prohibited
N Undefined 0 0 Prohibited
P Undefined 0 0 Prohibited
T Undefined 0 0 Prohibited
V Undefined 0 0 Prohibited
X Undefined 0 0 Prohibited (outside of MX)
Y Undefined 0 0 Prohibited

Observe that the modular exponentiation command is now MX, and modular multiplication is now M*. This is a cleaner notation than what was given previously, and permits room for other modular variants of common arithmetic operations, as well as e.g. non-modular exponentiation (in a future chapter.)

Observe also that there are three new operations given: W, RS, and LS. The remainder of this chapter is devoted to illuminating these.

Let’s begin with W, or Width-Measure:

``` -- ... -- Find the position of eldest nonzero bit, if any exist when 'W' => Want(1); declare Measure : Word; begin -- Find the measure ( 0 if no 1s, or 1 .. FZBitness ) Measure := FFA_FZ_Measure(Stack(SP)); -- Put on top of stack FFA_FZ_Clear(Stack(SP)); FFA_FZ_Set_Head(Stack(SP), Measure); end;```

So let’s see how FFA_FZ_Measure works:

``` -- Find the index of eldest nonzero bit ( 0 if none, or 1 .. FZBitness ) function FZ_Measure(N : in FZ) return Word is   -- The result (default : 0, will remain 0 if N is in fact zero) Index : Word := 0;   -- The currently-examined Word, starting from the junior-most Ni : Word;   -- The most recently-seen nonzero Word, if indeed any exist W : Word := 0;   -- 1 if currently-examined Word is zero, otherwise 0 NiZ : WBool;   begin   -- Find, in constant time, eldest non-zero Word: for i in 0 .. Indices(N'Length - 1) loop Ni := N(N'First + i); -- Ni := current Word; NiZ := W_ZeroP(Ni); -- ... is this Word = 0? Index := W_Mux(Word(i), Index, NiZ); -- If NO, save the Index, W := W_Mux(Ni, W, NiZ); -- ... and save that Word. end loop;   -- Set Index to be the bit-position of the eldest non-zero Word: Index := Shift_Left(Index, BitnessLog2); -- Index := Index * Bitness   -- Find, in constant time, eldest non-zero bit in that Word: for b in 1 .. Bitness loop -- If W is non-zero, advance the Index... Index := W_Mux(Index + 1, Index, W_ZeroP(W)); -- ... in either case, advance W: W := Shift_Right(W, 1); end loop;   -- If N = 0, result will be 0; otherwise: index of the eldest 1 bit. return Index;   end FZ_Measure;```

The key to grasping this FFA routine, just the same as others, is to think: “How would I do this as a constant-spacetime operation? Chances are, you would come up with exactly the same algorithm, if you were to sit down with a clean sheet of paper and give it a try.

We want to find the eldest 1-bit of the input N (supposing any exist) by performing the same set of mechanical operations regardless of what the contents of N are.

First, we walk through all of the Words of N, from junior-most to senior-most, while saving the position of the highest Word we have found to be nonzero.

After this, we do the exact same thing for the individual bits in that Word (or, observe, those of a null Word, if no non-zero Words exist in N.)

In the end, we will have found a value Index, which represents the position of the highest bit in N that is equal to 1. If no such bits exist, Index will equal 0 — and we do this without taking any branches or memory-accesses that depend on the contents of N.

Observe also that the process works not only in constant time, but also regardless of the particular array indexing frame of N — this will come in handy when FZ_Measure is later used as an internal subcomponent of other FFA operations.

Play with W to get a sense of what it’s good for, e.g.:

```\$ echo ".W#" | ./bin/ffa_calc 256 32 0000000000000000000000000000000000000000000000000000000000000000```
```\$ echo ".1W#" | ./bin/ffa_calc 256 32 0000000000000000000000000000000000000000000000000000000000000001```
```\$ echo ".DEADF00DW#" | ./bin/ffa_calc 256 32 0000000000000000000000000000000000000000000000000000000000000020```
```\$ echo ".~W#" | ./bin/ffa_calc 256 32 0000000000000000000000000000000000000000000000000000000000000100```
```\$ echo ".~W#" | ./bin/ffa_calc 4096 32 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000000000 0000000000000000000000000000000000000000000000000000000000001000```

When you have satisfied yourself that you understand W, we can proceed to the remaining new operations, the Quiet Shifts:

``` --------------------------------------------------------- -- Left... when 'L' =>   -- Which L-op? case O is   -- ... Shift : when 'S' => Want(2); declare -- Number of bit positions to shift by: ShiftCount : FZBit_Index := FZBit_Index(FFA_FZ_Get_Head(Stack(SP))); begin FFA_FZ_Quiet_ShiftLeft(N => Stack(SP - 1), ShiftedN => Stack(SP - 1), Count => ShiftCount); end; Drop;   -- ................   --------------------------------------------------------- -- Right... when 'R' =>   -- Which R-op? case O is   -- ... Shift: when 'S' => Want(2); declare -- Number of bit positions to shift by: ShiftCount : FZBit_Index := FZBit_Index(FFA_FZ_Get_Head(Stack(SP))); begin FFA_FZ_Quiet_ShiftRight(N => Stack(SP - 1), ShiftedN => Stack(SP - 1), Count => ShiftCount); end; Drop;```

Let’s dig into the two new mechanisms: FFA_FZ_Quiet_ShiftLeft and FFA_FZ_Quiet_ShiftRight:

``` -- Constant-time arbitrary Right-Shift. procedure FZ_Quiet_ShiftRight(N : in FZ; ShiftedN : in out FZ; Count : in FZBit_Index) is   -- Total number of bit positions to shift by C : constant Word := Word(Count);   -- Number of sub-Word bit positions to shift by Bits : constant Natural := Natural(C and (2**BitnessLog2 - 1));   -- The Bitness of N's Length Wb : constant Positive := FZ_Bitness_Log2(N);   -- Number of whole-Word bitnesses to shift by Words : Word := Shift_Right(C, BitnessLog2);   -- Current 'shiftness level' S : Indices := 1;   begin   -- Subword shift first: if HaveBarrelShifter then -- If permitted, use iron shifter: FZ_ShiftRight(N, ShiftedN, Bits); else -- Otherwise, use the soft subword shifter: FZ_Quiet_ShiftRight_SubW_Soft(N, ShiftedN, Bits); end if;   -- Then whole-Word shift: for i in 1 .. Wb loop   declare   -- Current bit of Words WordsBit : constant WBool := Words and 1;   begin   -- Shift at the current shiftness for i in ShiftedN'First .. ShiftedN'Last - S loop ShiftedN(i) := W_Mux(ShiftedN(i), ShiftedN(i + S), WordsBit); end loop;   -- Fill the emptiness for i in ShiftedN'Last - S + 1 .. ShiftedN'Last loop ShiftedN(i) := W_Mux(ShiftedN(i), 0, WordsBit); end loop;   -- Go to the next shiftness level S := S * 2; Words := Shift_Right(Words, 1);   end;   end loop;   end FZ_Quiet_ShiftRight;     -- Constant-time arbitrary Left-Shift. procedure FZ_Quiet_ShiftLeft(N : in FZ; ShiftedN : in out FZ; Count : in FZBit_Index) is   -- Total number of bit positions to shift by C : constant Word := Word(Count);   -- Number of sub-Word bit positions to shift by Bits : constant Natural := Natural(C and (2**BitnessLog2 - 1));   -- The Bitness of N's Length Wb : constant Positive := FZ_Bitness_Log2(N);   -- Number of whole-Word bitnesses to shift by Words : Word := Shift_Right(C, BitnessLog2);   -- Current 'shiftness level' S : Indices := 1;   begin   -- Subword shift first: if HaveBarrelShifter then -- If permitted, use iron shifter: FZ_ShiftLeft(N, ShiftedN, Bits); else -- Otherwise, use the soft subword shifter: FZ_Quiet_ShiftLeft_SubW_Soft(N, ShiftedN, Bits); end if;   -- Then whole-Word shift: for i in 1 .. Wb loop   declare   -- Current bit of Words WordsBit : constant WBool := Words and 1;   begin   -- Shift at the current shiftness for i in reverse ShiftedN'First + S .. ShiftedN'Last loop ShiftedN(i) := W_Mux(ShiftedN(i), ShiftedN(i - S), WordsBit); end loop;   -- Fill the emptiness for i in ShiftedN'First .. ShiftedN'First + S - 1 loop ShiftedN(i) := W_Mux(ShiftedN(i), 0, WordsBit); end loop;   -- Go to the next shiftness level S := S * 2; Words := Shift_Right(Words, 1);   end;   end loop;   end FZ_Quiet_ShiftLeft;```

The leitmotif here is exactly the same as in FZ_Measure: in either the left or right case, we find the amount of whole-Word motion and the amount of sub-Word motion; these feed into separate operations that perform the requisite amount of shifting in constant time.

The attentive reader will have noticed the presence of FZ_Bitness_Log2, a new routine. This is a simple mechanism to determine the “bitness” of the Word count of the FZ N:

``` -- Determine the Bitness of the given FZ's Length function FZ_Bitness_Log2(N : in FZ) return Positive is W : Bit_Count := N'Length; R : Positive := 1; begin while W > 1 loop W := W / 2; R := R + 1; end loop; return R - 1; end FZ_Bitness_Log2;```

And now it is time for… an exercise!

Chapter 13 Exercise #1:

Explain why the branching logic appearing in FZ_Bitness_Log2 does not compromise the constant-time operation of FZ_Quiet_ShiftLeft and FZ_Quiet_ShiftRight.

When you have solved the Exercise, we can move on to the other interesting components of the Quiet Shifts routines: FZ_Quiet_ShiftLeft_SubW_Soft and FZ_Quiet_ShiftRight_SubW_Soft.

On some particularly-sad machine architectures, there is no barrel shifter.

And therefore, FZ_ShiftRight and FZ_ShiftLeft will bleed, via a timing side-channel, the number of positions being shifted. This is absolutely prohibited in FFA, and therefore we must introduce a knob which toggles a set of optional “soft” fallback shifters:

``` -- Whether we have a barrel shifter: HaveBarrelShifter : constant Boolean := True;```

These being:

``` -- Constant-time subword shift, for where there is no barrel shifter procedure FZ_Quiet_ShiftRight_SubW_Soft(N : in FZ; ShiftedN : in out FZ; Count : in WBit_Index) is Nw : constant Word := Word(Count); nC : constant WBool := W_ZeroP(Nw); -- 'no carry' for Count == 0 case Ni : Word := 0; -- Current word C : Word := 0; -- Current carry S : Positive; -- Current shiftness level B : Word; -- Quantity of shift (bitwalked over) CB : Word; -- Quantity of carry counter-shift (bitwalked over) St : Word; -- Temporary word shift candidate Ct : Word; -- Temporary carry counter-shift candidate begin for i in reverse N'Range loop Ni := N(i); ShiftedN(i) := C; C := W_Mux(Ni, 0, nC); S := 1; B := Nw; CB := Word(Bitness) - B; -- For each shift level (of the subword shiftvalue width) : for j in 1 .. BitnessLog2 loop -- Shift and mux the current word St := Shift_Right(Ni, S); Ni := W_Mux(Ni, St, B and 1); -- Shift and mux the current carry Ct := Shift_Left(C, S); C := W_Mux(C, Ct, CB and 1); -- Go to the next shiftness level S := S * 2; B := Shift_Right(B, 1); CB := Shift_Right(CB, 1); end loop; -- Slide in the carry from the previous shift ShiftedN(i) := ShiftedN(i) or Ni; end loop; end FZ_Quiet_ShiftRight_SubW_Soft;     -- Constant-time subword shift, for where there is no barrel shifter procedure FZ_Quiet_ShiftLeft_SubW_Soft(N : in FZ; ShiftedN : in out FZ; Count : in WBit_Index) is Nw : constant Word := Word(Count); nC : constant WBool := W_ZeroP(Nw); -- 'no carry' for Count == 0 case Ni : Word := 0; -- Current word C : Word := 0; -- Current carry S : Positive; -- Current shiftness level B : Word; -- Quantity of shift (bitwalked over) CB : Word; -- Quantity of carry counter-shift (bitwalked over) St : Word; -- Temporary word shift candidate Ct : Word; -- Temporary carry counter-shift candidate begin for i in N'Range loop Ni := N(i); ShiftedN(i) := C; C := W_Mux(Ni, 0, nC); S := 1; B := Nw; CB := Word(Bitness) - B; -- For each shift level (of the subword shiftvalue width) : for j in 1 .. BitnessLog2 loop -- Shift and mux the current word St := Shift_Left(Ni, S); Ni := W_Mux(Ni, St, B and 1); -- Shift and mux the current carry Ct := Shift_Right(C, S); C := W_Mux(C, Ct, CB and 1); -- Go to the next shiftness level S := S * 2; B := Shift_Right(B, 1); CB := Shift_Right(CB, 1); end loop; -- Slide in the carry from the previous shift ShiftedN(i) := ShiftedN(i) or Ni; end loop; end FZ_Quiet_ShiftLeft_SubW_Soft;```

However, note that HaveBarrelShifter is enabled by default in canonical FFA.

If you suspect that you are running FFA on iron which lacks a barrel shifter, you are responsible for determining whether this is so, and then, if the answer is yes, disabling the toggle:

``` -- Whether we have a barrel shifter: HaveBarrelShifter : constant Boolean := False;```

… but in either case you must fully grasp how the fallback mechanism operates — you never know when you will need it, and at any rate: one ought not to make battlefield use of cryptographic code without first having fitted it into your head!

And now, time for another exercise:

Chapter 13 Exercise #2:

Bit-rotation routines have not yet been given in FFA. Write a set of working right- and left- rotators of arbitrary integers as ffacalc tapes. Ensure that they Do The Right Thing if supplied with deltas which exceed the given FZ Bitness.

In Chapter 14, we will meet with a serious constructive use for FFA_FZ_Measure, FFA_FZ_Quiet_ShiftLeft and FFA_FZ_Quiet_ShiftRight. Stay tuned!

~To be continued!~

## “Finite Field Arithmetic.” Chapter 12B: Karatsuba Redux. (Part 2 of 2)

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

Add the above vpatch and seal to your V-set, and press to ffa_ch12_karatsuba_redux.kv.vpatch.

You should end up with the same directory structure as previously.

Now compile ffacalc:

```cd ffacalc gprbuild```

But do not run it quite yet.

This is where we will wrap up the subject of multiplication, and thereafter introduce no major changes to the respective routines, until the concluding chapters of the FFA series — where we will discuss optional platform-specific optimizations.

Is the algorithmic cost of integer squaring the same as that of multiplication?

Intuitively it seems that calculating a b-bit X2 ought to be cheaper than multiplying a b-bit X by a b-bit Y, where XY. But just how much cheaper? Let’s find out.

Consider this trivial lemma:

For all integers x, y:

(x + y)2 - (x - y)2 = 4xy

From this, it follows that the product of any b-bit X and b-bit Y could be expressed as two squarings, one addition, two subtractions, and a division by 4 (i.e. bit-shift right by two places.)

Therefore a squaring has at least half the cost of a multiplication. This is a lower bound, i.e. it is demonstrably impossible for squaring to be asymptotically cheaper than multiplication by a factor greater than two — since any multiplication can be expressed as a difference of two squares.

However, in physical practice, we will find that squarings costs somewhat more than half of what multiplications cost: there are similar constant (O(1)) and linear (O(b)) expense factors that go into setting the stage for either process.

There is also the fact that (x + y) is a b+1-bit quantity, making for a gnarly impedance mismatch with our “all integers will have power-of-two bit lengths” dictum.

And therefore we will not be rewriting all FFA integer multiplication in terms of difference-of-squares. But we will have a reasonably-optimized squaring routine, given as it is a quite-common operation (in, e.g. modular exponentiation.) Let’s see how close to the theoretical minimum its cost can be brought.

Review the Karatsuba equivalences given in Chapter 12A:

LL = XLoYLo
HH = XHiYHi

Dx = |XLo - XHi|
Dy = |YLo - YHi|

DD = Dx × Dy

DDSub = CX XNOR CY

XY = LL + 2b(LL + HH + (-1DDSub)DD) + 22bHH

Now, let Y = X. Consequently:

LL = XLoXLo
HH = XHiXHi

Dx = |XLo - XHi|
DD = Dx2

… elementarily. And as for:

DDSub = CX XNOR CX

… now permanently equals 1, no matter what, and therefore we no longer need to care about the value of CX, the “borrow” bit from computing Dx. And so we get the following equation for Karatsuba’s squaring:

XX = LL + 2b(LL + HH - DD) + 22bHH

Let’s now make the “tasty sandwich” illustration of this equation, in the style of chapters 10 and 12A: (as before, junior bits of registers are on the left hand side, senior — on right hand) :

 LL HH TC := 0 + LL TC += Carry + HH TC += Carry - DD TC -= Borrow + TC = XX

And just as before, we know that:

LL + HH >= DD

Therefore, at the time of the final carry ripple, it will always remain true that 0 < = TC <= 2, i.e. it will never be necessary to ripple a “borrow” into the senior-most quarter of the result XX.

We can now confidently write a Karatsuba’s Squaring routine. But first, let’s introduce a few necessary parts:

FZ_Sub_D is quite similar to the FZ_Add_D discussed in chapter Chapter 12A; but here, we subtract, rather than add, in-place:

``` -- ...   -- Destructive Sub: X := X - Y; Underflow := Borrow procedure FZ_Sub_D(X : in out FZ; Y : in FZ; Underflow : out WBool) is Borrow : WBool := 0; begin for i in 0 .. Word_Index(X'Length - 1) loop declare A : constant Word := X(X'First + i); B : constant Word := Y(Y'First + i); S : constant Word := A - B - Borrow; begin X(X'First + i) := S; Borrow := W_Borrow(A, B, S); end; end loop; Underflow := Borrow; end FZ_Sub_D;```

We will also dispense with the pragma Assert mechanism for enforcing the mandatory nullity of the carry after the final TC ripple-out, in favour of this cleanly-adatronic device:

``` -- ... -- For when a computed Word is mandatorily expected to equal zero: subtype WZeroOrDie is Word range 0 .. 0;```

Any assignment of a value other than zero to a variable of this ranged subtype will trigger a CONSTRAINT_ERROR exception, signaling a catastrophic failure (of your iron! — there is no other way that this can happen) — and bring the program to a full stop.

Likewise, we will use a similar ranged subtype for enforcing the 0 < = TC <= 2 constraint. (And, note, the chapter 12B version of regular Karatsuba multiplication has been brought into conformance with this style.) The pieces in question are identical in the regular and squaring forms of Karatsuba:

``` -- Barring a cosmic ray, 0 < = TC <= 2 subtype TailCarry is Word range 0 .. 2;   -- Tail-Carry accumulator, for the final ripple-out into XXHiHi TC : TailCarry := 0;   -- Barring a cosmic ray, the tail ripple will NOT overflow. FinalCarry : WZeroOrDie := 0;```

And now we can make the entire Karatsuba’s Squaring routine:

``` -- Karatsuba's Squaring. (CAUTION: UNBUFFERED) procedure Sqr_Karatsuba(X : in FZ; XX : out FZ) is   -- L is the wordness of X. Guaranteed to be a power of two. L : constant Word_Count := X'Length;   -- An 'LSeg' is the same length as X. subtype LSeg is FZ(1 .. L);   -- K is HALF of the length of X. K : constant Word_Index := L / 2;   -- A 'KSeg' is the same length as HALF of X. subtype KSeg is FZ(1 .. K);   -- The three L-sized variables of the product equation, i.e.: -- XX = LL + 2^(K*Bitness)(LL + HH - DD) + 2^(2*K*Bitness)HH LL, DD, HH : LSeg;   -- K-sized term Dx, Dx := abs(XLo - XHi) to then get DD := Dx^2 Dx : KSeg;   -- IGNORED Subtraction borrow, sign of (XL - XH) Cx : WBool; -- given as DD := Dx^2, DD is always positive pragma Unreferenced(Cx);   -- Bottom and Top K-sized halves of X. XLo : KSeg renames X( X'First .. X'Last - K ); XHi : KSeg renames X( X'First + K .. X'Last );   -- L-sized middle segment of the product XX (+/- K from the midpoint). XXMid : LSeg renames XX( XX'First + K .. XX'Last - K );   -- Bottom and Top L-sized halves of the product XX. XXLo : LSeg renames XX( XX'First .. XX'Last - L ); XXHi : LSeg renames XX( XX'First + L .. XX'Last );   -- Topmost K-sized quarter segment of the product XX, or 'tail' XXHiHi : KSeg renames XXHi( XXHi'First + K .. XXHi'Last );   -- Carry from addition of HH and LL terms; borrow from subtraction of DD C : WBool;   -- Barring a cosmic ray, 0 < = TC <= 2 subtype TailCarry is Word range 0 .. 2;   -- Tail-Carry accumulator, for the final ripple-out into XXHiHi TC : TailCarry := 0;   -- Barring a cosmic ray, the tail ripple will NOT overflow. FinalCarry : WZeroOrDie := 0;   begin   -- Recurse: LL := XLo^2 FZ_Square_Unbuffered(XLo, LL);   -- Recurse: HH := XHi^2 FZ_Square_Unbuffered(XHi, HH);   -- Dx := |XLo - XHi| , and we don't care about the borrow FZ_Sub_Abs(X => XLo, Y => XHi, Difference => Dx, Underflow => Cx);   -- Recurse: DD := Dx^2 (always positive, which is why we don't need Cx) FZ_Square_Unbuffered(Dx, DD);   -- XX := LL + 2^(2 * K * Bitness) * HH XXLo := LL; XXHi := HH;   -- XX += 2^(K * Bitness) * HH, carry goes into Tail Carry accumulator. FZ_Add_D(X => XXMid, Y => HH, Overflow => TC);   -- XX += 2^(K * Bitness) * LL, ... FZ_Add_D(X => XXMid, Y => LL, Overflow => C);   -- ... add the carry from LL addition into the Tail Carry accumulator. TC := TC + C;   -- XX -= 2^(K * Bitness) * DD FZ_Sub_D(X => XXMid, -- Because DD is always positive, Y => DD, -- this term is always subtractive. Underflow => C); -- C is the borrow from DD term subtraction   -- Get final Tail Carry for the ripple by subtracting DD term's borrow TC := TC - C;   -- Ripple the Tail Carry into the tail. FZ_Add_D_W(X => XXHiHi, W => TC, Overflow => FinalCarry);   end Sqr_Karatsuba; -- CAUTION: Inlining prohibited for Sqr_Karatsuba !```

We will not be "chewing apart" Sqr_Karatsuba -- it is closely analogous to the Chapter 12A item, which the reader is presumed to have fully grasped.

Naturally, this Karatsuba will need its own base case logic for the recursive invocations (as we cannot use the one from FZ_Mul) :

``` -- Squaring. (CAUTION: UNBUFFERED) procedure FZ_Square_Unbuffered(X : in FZ; XX : out FZ) is begin   if X'Length < = Sqr_Karatsuba_Thresh then   -- Base case: FZ_Mul.FZ_Mul_Comba(X, X, XX);   else   -- Recursive case: Sqr_Karatsuba(X, XX);   end if;   end FZ_Square_Unbuffered;```

... and a proper bufferization wrapper:

``` -- Squaring. Preserves the inputs. procedure FZ_Square_Buffered(X : in FZ; XX_Lo : out FZ; XX_Hi : out FZ) is   -- Product buffer. P : FZ(1 .. 2 * X'Length);   begin   FZ_Square_Unbuffered(X, P);   XX_Lo := P(P'First .. P'First + X'Length - 1); XX_Hi := P(P'First + X'Length .. P'Last);   end FZ_Square_Buffered;```

But if you were to build this program, you would observe that the resulting squaring routine has a barely-detectable CPU-economy win over regular Karatsuba, i.e. Mul_Karatsuba(X => X, Y => X, XY => XX). Why?

The obvious answer is: most of the CPU cost of Karatsuba is paid in the base case. And so, here is what we really want:

``` procedure FZ_Square_Unbuffered(X : in FZ; XX : out FZ) is begin   if X'Length < = Sqr_Karatsuba_Thresh then   -- Base case: FZ_Sqr_Comba(X, XX);   else   -- Recursive case: Sqr_Karatsuba(X, XX);   end if;   end FZ_Square_Unbuffered;```

That is, we would like a specialized FZ_Sqr_Comba base case mechanism, which would take maximal advantage of the fact that we are computing X2 rather than X x Y, X ≠ Y.

To find an approach to this problem, let's begin by reviewing our ordinary Comba's Multiplication routine:

``` -- Comba's multiplier. (CAUTION: UNBUFFERED) procedure FZ_Mul_Comba(X : in FZ; Y : in FZ; XY : out FZ) is   -- Words in each multiplicand L : constant Word_Index := X'Length;   -- Length of Product, i.e. double the length of either multiplicand LP : constant Word_Index := 2 * L;   -- 3-word Accumulator A2, A1, A0 : Word := 0;   -- Type for referring to a column of XY subtype ColXY is Word_Index range 0 .. LP - 1;   -- Compute the Nth (indexed from zero) column of the Product procedure Col(N : in ColXY; U : in ColXY; V : in ColXY) is   -- The outputs of a Word multiplication Lo, Hi : Word;   -- Carry for the Accumulator addition C : WBool;   -- Sum for Accumulator addition Sum : Word;   begin   -- For lower half of XY, will go from 0 to N -- For upper half of XY, will go from N - L + 1 to L - 1 for j in U .. V loop   -- Hi:Lo := j-th Word of X * (N - j)-th Word of Y Mul_Word(X(X'First + j), Y(Y'First - j + N), Lo, Hi);   -- Now add Hi:Lo into the Accumulator:   -- A0 += Lo; C := Carry Sum := A0 + Lo; C := W_Carry(A0, Lo, Sum); A0 := Sum;   -- A1 += Hi + C; C := Carry Sum := A1 + Hi + C; C := W_Carry(A1, Hi, Sum); A1 := Sum;   -- A2 += A2 + C A2 := A2 + C;   end loop;   -- We now have the Nth (indexed from zero) word of XY XY(XY'First + N) := A0;   -- Right-Shift the Accumulator by one Word width: A0 := A1; A1 := A2; A2 := 0;   end Col; pragma Inline_Always(Col);   begin   -- Compute the lower half of the Product: for i in 0 .. L - 1 loop   Col(i, 0, i);   end loop;   -- Compute the upper half (sans last Word) of the Product: for i in L .. LP - 2 loop   Col(i, i - L + 1, L - 1);   end loop;   -- The very last Word of the Product: XY(XY'Last) := A0;   end FZ_Mul_Comba;```

FZ_Mul_Comba is a quite simple mechanism, and the reader is expected to have understood it in Chapter 9. The product XY is obtained by computing columnwise, starting from the junior-most Word of XY, and proceeding through to the senior-most; carry from each columnar summation is found in the accumulator A2:A1:A0 after the Word-sized right shift, and gets added to the next column.

Now STOP! and... time for an exercise:

Chapter 12B Exercise #1:

Is it possible for A2:A1:A0 to overflow? And could this happen in Ch.12B FFA ? If not, why not?

Now let's suppose that we were to invoke the ordinary FZ_Mul_Comba on an X and Y consisting of 16 Words each, to produce a 32-Word XY product. This will not ordinarily happen in FFA, given our setting for the Karatsuba base case transition knob, but the illustration remains valid.

And let's also suppose that Y = X. The astute reader already anticipates that a certain portion of the work performed by ordinary FZ_Mul_Comba on such an input is redundant. So let's find out exactly where, so that we can conceive of a method for eliminating the redundancy.

Let's trace the execution of FZ_Mul_Comba in the above example. In square brackets, we will show which indices of multiplicands X and Y (arrays indexed from 1, in this illustration) are subjected to Mul_Word in a particular instance of the inner loop; N is the current column of the product being calculated.

We will mark in green all instances where an optimized Word x Word squaring ought to be taking place. And we will mark in yellow all cases where a Word x Word multiplication takes place unnecessarily, given as integer multiplication is commutative and we already have access to the result of that particular multiplication.

First, we trace the computation of the first half of the Comba X x X multiplication, i.e.:

``` -- Compute the lower half of the Product: for i in 0 .. L - 1 loop   Col(i, 0, i);   end loop;```

And we get:

``` for i in 0 .. 16 - 1 loop N=0; for j in 0 .. 0 loop 0 [ 1 x 1 ]```

``` N=1; for j in 0 .. 1 loop 0 [ 1 x 2 ] 1 [ 2 x 1 ] N=2; for j in 0 .. 2 loop 0 [ 1 x 3 ] 1 [ 2 x 2 ] 2 [ 3 x 1 ] N=3; for j in 0 .. 3 loop 0 [ 1 x 4 ] 1 [ 2 x 3 ] 2 [ 3 x 2 ] 3 [ 4 x 1 ] N=4; for j in 0 .. 4 loop 0 [ 1 x 5 ] 1 [ 2 x 4 ] 2 [ 3 x 3 ] 3 [ 4 x 2 ] 4 [ 5 x 1 ] N=5; for j in 0 .. 5 loop 0 [ 1 x 6 ] 1 [ 2 x 5 ] 2 [ 3 x 4 ] 3 [ 4 x 3 ] 4 [ 5 x 2 ] 5 [ 6 x 1 ] N=6; for j in 0 .. 6 loop 0 [ 1 x 7 ] 1 [ 2 x 6 ] 2 [ 3 x 5 ] 3 [ 4 x 4 ] 4 [ 5 x 3 ] 5 [ 6 x 2 ] 6 [ 7 x 1 ] N=7; for j in 0 .. 7 loop 0 [ 1 x 8 ] 1 [ 2 x 7 ] 2 [ 3 x 6 ] 3 [ 4 x 5 ] 4 [ 5 x 4 ] 5 [ 6 x 3 ] 6 [ 7 x 2 ] 7 [ 8 x 1 ] N=8; for j in 0 .. 8 loop 0 [ 1 x 9 ] 1 [ 2 x 8 ] 2 [ 3 x 7 ] 3 [ 4 x 6 ] 4 [ 5 x 5 ] 5 [ 6 x 4 ] 6 [ 7 x 3 ] 7 [ 8 x 2 ] 8 [ 9 x 1 ] N=9; for j in 0 .. 9 loop 0 [ 1 x 10 ] 1 [ 2 x 9 ] 2 [ 3 x 8 ] 3 [ 4 x 7 ] 4 [ 5 x 6 ] 5 [ 6 x 5 ] 6 [ 7 x 4 ] 7 [ 8 x 3 ] 8 [ 9 x 2 ] 9 [ 10 x 1 ] N=10; for j in 0 .. 10 loop 0 [ 1 x 11 ] 1 [ 2 x 10 ] 2 [ 3 x 9 ] 3 [ 4 x 8 ] 4 [ 5 x 7 ] 5 [ 6 x 6 ] 6 [ 7 x 5 ] 7 [ 8 x 4 ] 8 [ 9 x 3 ] 9 [ 10 x 2 ] 10 [ 11 x 1 ] N=11; for j in 0 .. 11 loop 0 [ 1 x 12 ] 1 [ 2 x 11 ] 2 [ 3 x 10 ] 3 [ 4 x 9 ] 4 [ 5 x 8 ] 5 [ 6 x 7 ] 6 [ 7 x 6 ] 7 [ 8 x 5 ] 8 [ 9 x 4 ] 9 [ 10 x 3 ] 10 [ 11 x 2 ] 11 [ 12 x 1 ] N=12; for j in 0 .. 12 loop 0 [ 1 x 13 ] 1 [ 2 x 12 ] 2 [ 3 x 11 ] 3 [ 4 x 10 ] 4 [ 5 x 9 ] 5 [ 6 x 8 ] 6 [ 7 x 7 ] 7 [ 8 x 6 ] 8 [ 9 x 5 ] 9 [ 10 x 4 ] 10 [ 11 x 3 ] 11 [ 12 x 2 ] 12 [ 13 x 1 ] N=13; for j in 0 .. 13 loop 0 [ 1 x 14 ] 1 [ 2 x 13 ] 2 [ 3 x 12 ] 3 [ 4 x 11 ] 4 [ 5 x 10 ] 5 [ 6 x 9 ] 6 [ 7 x 8 ] 7 [ 8 x 7 ] 8 [ 9 x 6 ] 9 [ 10 x 5 ] 10 [ 11 x 4 ] 11 [ 12 x 3 ] 12 [ 13 x 2 ] 13 [ 14 x 1 ] N=14; for j in 0 .. 14 loop 0 [ 1 x 15 ] 1 [ 2 x 14 ] 2 [ 3 x 13 ] 3 [ 4 x 12 ] 4 [ 5 x 11 ] 5 [ 6 x 10 ] 6 [ 7 x 9 ] 7 [ 8 x 8 ] 8 [ 9 x 7 ] 9 [ 10 x 6 ] 10 [ 11 x 5 ] 11 [ 12 x 4 ] 12 [ 13 x 3 ] 13 [ 14 x 2 ] 14 [ 15 x 1 ] ```

```N=15; for j in 0 .. 15 loop 0 [ 1 x 16 ] 1 [ 2 x 15 ] 2 [ 3 x 14 ] 3 [ 4 x 13 ] 4 [ 5 x 12 ] 5 [ 6 x 11 ] 6 [ 7 x 10 ] 7 [ 8 x 9 ] 8 [ 9 x 8 ] 9 [ 10 x 7 ] 10 [ 11 x 6 ] 11 [ 12 x 5 ] 12 [ 13 x 4 ] 13 [ 14 x 3 ] 14 [ 15 x 2 ] 15 [ 16 x 1 ] ```

Now, we trace the execution of the second half of the X x X computation, i.e.:

``` -- Compute the upper half (sans last Word) of the Product: for i in L .. LP - 2 loop   Col(i, i - L + 1, L - 1);   end loop;```

And we get:

``` for i in 16 .. 2 * 16 - 2 loop```

``` N=16; for j in 1 .. 15 loop 1 [ 2 x 16 ] 2 [ 3 x 15 ] 3 [ 4 x 14 ] 4 [ 5 x 13 ] 5 [ 6 x 12 ] 6 [ 7 x 11 ] 7 [ 8 x 10 ] 8 [ 9 x 9 ] 9 [ 10 x 8 ] 10 [ 11 x 7 ] 11 [ 12 x 6 ] 12 [ 13 x 5 ] 13 [ 14 x 4 ] 14 [ 15 x 3 ] 15 [ 16 x 2 ] N=17; for j in 2 .. 15 loop 2 [ 3 x 16 ] 3 [ 4 x 15 ] 4 [ 5 x 14 ] 5 [ 6 x 13 ] 6 [ 7 x 12 ] 7 [ 8 x 11 ] 8 [ 9 x 10 ] 9 [ 10 x 9 ] 10 [ 11 x 8 ] 11 [ 12 x 7 ] 12 [ 13 x 6 ] 13 [ 14 x 5 ] 14 [ 15 x 4 ] 15 [ 16 x 3 ] N=18; for j in 3 .. 15 loop 3 [ 4 x 16 ] 4 [ 5 x 15 ] 5 [ 6 x 14 ] 6 [ 7 x 13 ] 7 [ 8 x 12 ] 8 [ 9 x 11 ] 9 [ 10 x 10 ] 10 [ 11 x 9 ] 11 [ 12 x 8 ] 12 [ 13 x 7 ] 13 [ 14 x 6 ] 14 [ 15 x 5 ] 15 [ 16 x 4 ] N=19; for j in 4 .. 15 loop 4 [ 5 x 16 ] 5 [ 6 x 15 ] 6 [ 7 x 14 ] 7 [ 8 x 13 ] 8 [ 9 x 12 ] 9 [ 10 x 11 ] 10 [ 11 x 10 ] 11 [ 12 x 9 ] 12 [ 13 x 8 ] 13 [ 14 x 7 ] 14 [ 15 x 6 ] 15 [ 16 x 5 ] N=20; for j in 5 .. 15 loop 5 [ 6 x 16 ] 6 [ 7 x 15 ] 7 [ 8 x 14 ] 8 [ 9 x 13 ] 9 [ 10 x 12 ] 10 [ 11 x 11 ] 11 [ 12 x 10 ] 12 [ 13 x 9 ] 13 [ 14 x 8 ] 14 [ 15 x 7 ] 15 [ 16 x 6 ] N=21; for j in 6 .. 15 loop 6 [ 7 x 16 ] 7 [ 8 x 15 ] 8 [ 9 x 14 ] 9 [ 10 x 13 ] 10 [ 11 x 12 ] 11 [ 12 x 11 ] 12 [ 13 x 10 ] 13 [ 14 x 9 ] 14 [ 15 x 8 ] 15 [ 16 x 7 ] N=22; for j in 7 .. 15 loop 7 [ 8 x 16 ] 8 [ 9 x 15 ] 9 [ 10 x 14 ] 10 [ 11 x 13 ] 11 [ 12 x 12 ] 12 [ 13 x 11 ] 13 [ 14 x 10 ] 14 [ 15 x 9 ] 15 [ 16 x 8 ] N=23; for j in 8 .. 15 loop 8 [ 9 x 16 ] 9 [ 10 x 15 ] 10 [ 11 x 14 ] 11 [ 12 x 13 ] 12 [ 13 x 12 ] 13 [ 14 x 11 ] 14 [ 15 x 10 ] 15 [ 16 x 9 ] N=24; for j in 9 .. 15 loop 9 [ 10 x 16 ] 10 [ 11 x 15 ] 11 [ 12 x 14 ] 12 [ 13 x 13 ] 13 [ 14 x 12 ] 14 [ 15 x 11 ] 15 [ 16 x 10 ] N=25; for j in 10 .. 15 loop 10 [ 11 x 16 ] 11 [ 12 x 15 ] 12 [ 13 x 14 ] 13 [ 14 x 13 ] 14 [ 15 x 12 ] 15 [ 16 x 11 ] N=26; for j in 11 .. 15 loop 11 [ 12 x 16 ] 12 [ 13 x 15 ] 13 [ 14 x 14 ] 14 [ 15 x 13 ] 15 [ 16 x 12 ] N=27; for j in 12 .. 15 loop 12 [ 13 x 16 ] 13 [ 14 x 15 ] 14 [ 15 x 14 ] 15 [ 16 x 13 ] N=28; for j in 13 .. 15 loop 13 [ 14 x 16 ] 14 [ 15 x 15 ] 15 [ 16 x 14 ] N=29; for j in 14 .. 15 loop 14 [ 15 x 16 ] 15 [ 16 x 15 ] ```

```N=30; for j in 15 .. 15 loop 15 [ 16 x 16 ] ```

And, of course, the final Word of the result is obtained by:

``` -- The very last Word of the Product: XY(XY'Last) := A0;```

So we find precisely what we expected to find: nearly half of the CPU work taken up by an invocation of ordinary FZ_Mul_Comba on two identical multiplicands, is avoidable. So now let's take a shot at avoiding it, by writing a new Comba squaring subroutine:

``` -- Square case of Comba's multiplier. (CAUTION: UNBUFFERED) procedure FZ_Sqr_Comba(X : in FZ; XX : out FZ) is   -- Words in each multiplicand L : constant Word_Index := X'Length;   -- Length of Product, i.e. double the length of X LP : constant Word_Index := 2 * L;   -- 3-word Accumulator A2, A1, A0 : Word := 0;   Lo, Hi : Word; -- Output of WxW multiply/square   -- Type for referring to a column of XX subtype ColXX is Word_Index range 0 .. LP - 1;   procedure Accum is C : WBool; -- Carry for the Accumulator addition Sum : Word; -- Sum for Accumulator addition begin -- Now add add double-Word Hi:Lo to accumulator A2:A1:A0: -- A0 += Lo; C := Carry Sum := A0 + Lo; C := W_Carry(A0, Lo, Sum); A0 := Sum; -- A1 += Hi + C; C := Carry Sum := A1 + Hi + C; C := W_Carry(A1, Hi, Sum); A1 := Sum; -- A2 += A2 + C A2 := A2 + C; end Accum; pragma Inline_Always(Accum);   procedure SymmDigits(N : in ColXX; From : in ColXX; To : in ColXX) is begin for j in From .. To loop -- Hi:Lo := j-th * (N - j)-th Word, and then, Mul_Word(X(X'First + j), X(X'First - j + N), Lo, Hi); Accum; Accum; -- Accum += 2 * (Hi:Lo) end loop; end SymmDigits; pragma Inline_Always(SymmDigits);   procedure SqrDigit(N : in ColXX) is begin Sqr_Word(X(X'First + N), Lo, Hi); -- Hi:Lo := Square(N-th digit) Accum; -- Accum += Hi:Lo end SqrDigit; pragma Inline_Always(SqrDigit);   procedure HaveDigit(N : in ColXX) is begin -- Save the Nth (indexed from zero) word of XX: XX(XX'First + N) := A0; -- Right-Shift the Accumulator by one Word width: A0 := A1; A1 := A2; A2 := 0; end HaveDigit; pragma Inline_Always(HaveDigit);   -- Compute the Nth (indexed from zero) column of the Product procedure Col(N : in ColXX; U : in ColXX; V : in ColXX) is begin -- The branch pattern depends only on FZ wordness if N mod 2 = 0 then -- If we're doing an EVEN-numbered column: SymmDigits(N, U, V - 1); -- Stop before V: it is the square case SqrDigit(V); -- Compute the square case at V else -- If we're doing an ODD-numbered column: SymmDigits(N, U, V); -- All of them are the symmetric case end if; -- After either even or odd column: HaveDigit(N); -- We have the N-th digit of the result. end Col; pragma Inline_Always(Col);   begin -- First col always exists: SqrDigit(ColXX'First); HaveDigit(ColXX'First);   -- Compute the lower half of the Product: for i in 1 .. L - 1 loop Col(i, 0, i / 2); end loop;   -- Compute the upper half (sans last Word) of the Product: for i in L .. LP - 2 loop Col(i, i - L + 1, i / 2); end loop;   -- The very last Word of the Product: XX(XX'Last) := A0; -- Equiv. of XX(XX'First + 2*L - 1) := A0; end FZ_Sqr_Comba;```

This variant correctly isolates the Word x Word squarings, and avoids carrying out the symmetrically-redundant Word x Word multiplications.

Now STOP! and... time for an exercise:

Chapter 12B Exercise #2:

Prove that the conditional branch statement in the Col routine:

``` if N mod 2 = 0 then -- If we're doing an EVEN-numbered column: SymmDigits(N, U, V - 1); -- Stop before V: it is the square case SqrDigit(V); -- Compute the square case at V else -- If we're doing an ODD-numbered column: SymmDigits(N, U, V); -- All of them are the symmetric case end if; -- ...```

... does not entail a branch on secret bits, i.e. an act that would violate the constant-time guarantee offered by the FFA system.

Done with the exercise? Let's carry on...

Notice anything missing? Of course, it's Sqr_Word -- we haven't defined it yet. So let's define it.

First, take a look at our existing portable Word x Word multiplier, Mul_Word:

``` -- Carry out X*Y mult, return lower word XY_LW and upper word XY_HW. procedure Mul_Word(X : in Word; Y : in Word; XY_LW : out Word; XY_HW : out Word) is   -- Bottom half of multiplicand X XL : constant HalfWord := BottomHW(X);   -- Top half of multiplicand X XH : constant HalfWord := TopHW(X);   -- Bottom half of multiplicand Y YL : constant HalfWord := BottomHW(Y);   -- Top half of multiplicand Y YH : constant HalfWord := TopHW(Y);   -- XL * YL LL : constant Word := Mul_HalfWord_Iron(XL, YL);   -- XL * YH LH : constant Word := Mul_HalfWord_Iron(XL, YH);   -- XH * YL HL : constant Word := Mul_HalfWord_Iron(XH, YL);   -- XH * YH HH : constant Word := Mul_HalfWord_Iron(XH, YH);   -- Carry CL : constant Word := TopHW(TopHW(LL) + BottomHW(LH) + BottomHW(HL));   begin   -- Get the bottom half of the Product: XY_LW := LL + Shift_Left(LH + HL, HalfBitness);   -- Get the top half of the Product: XY_HW := HH + TopHW(HL) + TopHW(LH) + CL;   end Mul_Word;```

And now we will want to make a similar and equally-portable Word-squaring operator:

``` --------------------------------------------------------------------------- -- LET A CURSE FALL FOREVER on the authors of GCC, and on the Ada committee, -- neither of whom saw it fit to decree a primitive which returns both -- upper and lower halves of an iron MUL instruction's result. Consequently, -- portable Mul_Word demands ~four~ MULs (and several additions and shifts); -- while portable Sqr_Word demands ~three~ MULs (and likewise adds/shifts.) -- If it were not for their idiocy, BOTH routines would weigh 1 CPU instr.! ---------------------------------------------------------------------------   -- Carry out X*X squaring, return lower word XX_LW and upper word XX_HW. procedure Sqr_Word(X : in Word; XX_LW : out Word; XX_HW : out Word) is   -- Bottom half of multiplicand X XL : constant HalfWord := BottomHW(X);   -- Top half of multiplicand X XH : constant HalfWord := TopHW(X);   -- XL^2 LL : constant Word := Mul_HalfWord_Iron(XL, XL);   -- XL * XH LH : constant Word := Mul_HalfWord_Iron(XL, XH);   -- XH^2 HH : constant Word := Mul_HalfWord_Iron(XH, XH);   -- Carry CL : constant Word := TopHW(TopHW(LL) + Shift_Left(BottomHW(LH), 1));   begin   -- Get the bottom half of the Product: XX_LW := LL + Shift_Left(LH, HalfBitness + 1);   -- Get the top half of the Product: XX_HW := HH + Shift_Left(TopHW(LH), 1) + CL;   end Sqr_Word;```

Satisfy yourself that this works, and only then proceed.

Now, we'll naturally want to find out what, if anything, all of these new moving parts achieve.

So let's add a squaring operator to our old friend ffacalc:

``` -- ... --------------------------------------------------------- -- Ch. 12B: -- Square, give bottom and top halves when 'S' => Want(1); Push; FFA_FZ_Square(X => Stack(SP - 1), XX_Lo => Stack(SP - 1), XX_Hi => Stack(SP)); ---------------------------------------------------------```

Now STOP! and... time for an exercise:

Chapter 12B Exercise #3:

a) Create a ffacalc tape which obtains a random number, squares it using the above mechanism, and verifies that the result is equal to the output of a squaring carried out via ordinary multiplication.

b) Given the fact that your RNG (supposing it is a genuine TRNG!) is not able to produce output in constant time, how would you write the above tape such that you can verify that all of the squarings in fact take place in constant time? (Hint: ffacalc tapes can produce ffacalc tapes...)

Now let's run a Thousand Squares benchmark, on various FFA bitnesses -- including a few quite outrageous ones, by cryptographic standards:

Or, for those who prefer the raw numbers to the logarithmic plot,

Cost of 1000 Squaring Operations (sec):
FFA Bitness Ch.12B Karatsuba (Ordinary) Ch.12B Karatsuba (Optimized Squaring)
4096 0.035 0.024
8192 0.106 0.073
16384 0.328 0.226
32768 0.991 0.686
65536 3.000 2.080

It turns out that we have achieved a one-third reduction, vs. ordinary Karatsuba, in the cost of integer squaring.

It is possible to do better than this: a good chunk of the potential "win" from the optimized squaring, "evaporates" (on x86 and other common iron) into branch prediction friction inside FZ_Sqr_Comba, and into the wasteful Mul_Word and Sqr_Word portable Word x Word multipliers.

In the final chapters of the FFA series, we will consider some cures, including unrolled Comba multiplication and iron-specific inline ASM. These however will forever remain optional components -- as they are inevitably bought with significant cost to portability and clarity.

For now, we will stop here, and see what, if anything, our new optimized squaring method does to the cost of the king of expensive FFA operations: modular exponentiation (as used in e.g. RSA.)

So let's rewrite the Chapter 11 FZ_Mod_Exp to make use of Sqr_Karatsuba:

``` -- Modular Squaring: Product := X*X mod Modulus procedure FZ_Mod_Sqr(X : in FZ; Modulus : in FZ; Product : out FZ) is   -- The wordness of both operands is equal: L : constant Indices := X'Length;   -- Double-width register for squaring and modulus operations XX : FZ(1 .. L * 2);   -- To refer to the lower and upper halves of the working register: XX_Lo : FZ renames XX(1 .. L); XX_Hi : FZ renames XX(L + 1 .. XX'Last);   begin   -- XX_Lo:XX_Hi := X^2 FZ_Square_Buffered(X, XX_Lo, XX_Hi);   -- Product := XX mod M FZ_Mod(XX, Modulus, Product);   end FZ_Mod_Sqr;     -- Modular Exponent: Result := Base^Exponent mod Modulus procedure FZ_Mod_Exp(Base : in FZ; Exponent : in FZ; Modulus : in FZ; Result : out FZ) is   -- Working register for the squaring; initially is copy of Base B : FZ(Base'Range) := Base;   -- Copy of Exponent, for cycling through its bits E : FZ(Exponent'Range) := Exponent;   -- Register for the Mux operation T : FZ(Result'Range);   -- Buffer register for the Result R : FZ(Result'Range);   begin -- Result := 1 WBool_To_FZ(1, R);   -- For each bit of R width: for i in 1 .. FZ_Bitness(R) loop   -- T := Result * B mod Modulus FZ_Mod_Mul(X => R, Y => B, Modulus => Modulus, Product => T);   -- Sel is the current low bit of E; -- When Sel=0 -> Result := Result; -- When Sel=1 -> Result := T FZ_Mux(X => R, Y => T, Result => R, Sel => FZ_OddP(E));   -- Advance to the next bit of E FZ_ShiftRight(E, E, 1);   -- B := B^2 mod Modulus FZ_Mod_Sqr(X => B, Modulus => Modulus, Product => B);   end loop;   -- Output the Result: Result := R;   end FZ_Mod_Exp;```

... and then perform the familiar "RSA benchmark", a la Chapter 9:

Or, for those who prefer the raw numbers to the logarithmic plot,

Cost of One Modular Exponentiation Operation (sec):
FFA Bitness Ch.12B Karatsuba (Ordinary) Ch.12B Karatsuba (Optimized Squaring)
1024 0.395 0.395
2048 2.895 2.895
4096 21.920 21.895
8192 169.729 169.394

It would appear that optimized squaring had virtually no effect! ...to the limit of the timer resolution!

It ought to be mentioned that the resolution of the timer used here is quite poor, i.e. it is the customary unix "time" command.

But let's find out what is happening! Recall the wunderwaffe that was promised to the reader in Chapter 12A:

"We will also make use of a simple means of profiling the execution of the FFA routines -- one that is unique in its simplicity, while generally inapplicable to heathen cryptographic libraries on account of their failure to avoid branching on operand bits."

And now recall that FFA operations do not branch on operand bits. This means, among other things, that it is possible to profile the execution of individual routines simply via selective nulling. Naturally, you will not obtain arithmetically-correct outputs from a thusly-mutilated FFA, but you will get a measure of the CPU cost of an individual component; nulling it out, provided that Gnat's optimizer is prevented from discarding anything else, will give an accurate approximation of a computation's cost minus that of the particular component.

So let's perform the necessary vivisections:

``` -- Modular Multiply: Product := X*Y mod Modulus procedure FZ_Mod_Mul(X : in FZ; Y : in FZ; Modulus : in FZ; Product : out FZ) is   -- The wordness of all three operands is equal: L : constant Indices := X'Length;   -- Double-width register for multiplication and modulus operations XY : FZ(1 .. L * 2);   -- To refer to the lower and upper halves of the working register: XY_Lo : FZ renames XY(1 .. L); XY_Hi : FZ renames XY(L + 1 .. XY'Last);   begin   -- XY_Lo:XY_Hi := X * Y FZ_Multiply_Buffered(X, Y, XY_Lo, XY_Hi);   -- Product := XY mod M -- FZ_Mod(XY, Modulus, Product);   Product := XY_Lo;   end FZ_Mod_Mul;     -- Modular Squaring: Product := X*X mod Modulus procedure FZ_Mod_Sqr(X : in FZ; Modulus : in FZ; Product : out FZ) is   -- The wordness of both operands is equal: L : constant Indices := X'Length;   -- Double-width register for squaring and modulus operations XX : FZ(1 .. L * 2);   -- To refer to the lower and upper halves of the working register: XX_Lo : FZ renames XX(1 .. L); XX_Hi : FZ renames XX(L + 1 .. XX'Last);   begin   -- XX_Lo:XX_Hi := X^2 FZ_Square_Buffered(X, XX_Lo, XX_Hi);   -- Product := XX mod M -- FZ_Mod(XX, Modulus, Product);   Product := XX_Lo;   end FZ_Mod_Sqr;```

Naturally, this program will not produce a correct modular exponentiation output! But it does fool Gnat's optimizer, so that it will not drop the calls to FZ_Mod_Mul and FZ_Mod_Sqr, and so we can get a picture of what the effect of optimized squaring would be if the modular reduction step weren't there to drown out the signal.

And we get:

This experiment confirms that reader apeloyee was indeed correct: modular reduction in the form currently in use (Knuth's integer division method) -- is so outrageously expensive that it dwarfs the cost of all other components of FZ_Mod_Exp !

In Chapter 13, we will begin laying the groundwork for a
modular reduction mechanism that is not hobbled by the use of the slow FZ_Mod Knuthian integer-division operation, and instead relies almost entirely on multiplication.

~To be continued!~