“Finite Field Arithmetic.” Chapter 15: Greatest Common Divisor.
This article is part of a series of handson tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical "Open Sores" abomination, in that  rather than trusting the author blindly with their lives  prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.
 Chapter 1: Genesis.
 Chapter 2: Logical and Bitwise Operations.
 Chapter 3: Shifts.
 Chapter 4: Interlude: FFACalc.
 Chapter 5: "Egyptological" Multiplication and Division.
 Chapter 6: "Geological" RSA.
 Chapter 7: "Turbo Egyptians."
 Chapter 8: Interlude: Randomism.
 Chapter 9: "Exodus from Egypt" with Comba's Algorithm.
 Chapter 10: Introducing Karatsuba's Multiplication.
 Chapter 11: Tuning and Unified API.
 Chapter 12A: Karatsuba Redux. (Part 1 of 2)
 Chapter 12B: Karatsuba Redux. (Part 2 of 2)
 Chapter 13: "WidthMeasure" and "Quiet Shifts."
 Chapter 14A: Barrett's Modular Reduction. (Part 1 of 2)
 Chapter 14ABis: Barrett's Modular Reduction. (Physical Bounds Proof.)
 Chapter 14B: Barrett's Modular Reduction. (Part 2 of 2.)
 Chapter 15: Greatest Common Divisor.
(May 2020) WARNING: The algorithm and implementation given in this Chapter are subtly defective! Please see Chapter 21ABis for the corrected version and its proof.
You will need:
 A Keccakbased VTron (for this and all subsequent chapters.)
 All of the materials from Chapters 1  14B.
 ffa_ch15_gcd.kv.vpatch
 ffa_ch15_gcd.kv.vpatch.asciilifeform.sig
Add the above vpatches and seals to your Vset, and press to ffa_ch15_gcd.kv.vpatch.
You should end up with the same directory structure as previously.
As of Chapter 15, the versions of FFACalc and FFA are 254 and 254, respectively.
Now compile ffacalc:
cd ffacalc gprbuild
But do not run it quite yet.
First, the mail bag!
Reader diana_coman has given me to know that she has read and signed Chapters 7 and 8:
Thank you, reader diana_coman!
Now, let's eat the meat of this Chapter.
We'll start with two very minor extensions of FFACalc:
Op  Description  # Ins  # Outs  Notes 

R*  Multiply top and second item and place only the junior half of the product on the stack.  2  1  The "LowMultiply" from Ch. 14B. 
MS  Square the second item modulo the top item and place the result on the stack.  2  1  Conventional modular squaring. 
R* (i.e. RightMultiply) is simply the "Low Mul" helper routine from the previous chapter, now made available directly from FFACalc. It is arithmeticallyequivalent to an ordinary * multiplication followed by dropping the senior half of the product, but gives a twofold saving of CPU time by avoiding the calculation of the senior half to begin with.
MS (i.e. ModularSquare) is directly analogous to M* (Modular Multiply) as seen in Chapter 13.
Both of these new FFACalc operators simply expose previouslydiscussed routines, and therefore do not merit further discussion.
Two typos in the comments of Chapter 14B were found, on lines 258 and 261 of:
 (1) Ns := X >> Jm FZ_Quiet_ShiftRight(N => X, ShiftedN => Xs, Count => Bar.J);  (2) Z := X * Bm FZ_Multiply_Unbuffered(X => Bar.B, Y => Xs, XY => Z);
They have been corrected:
 (1) Xs := X >> Jm FZ_Quiet_ShiftRight(N => X, ShiftedN => Xs, Count => Bar.J);  (2) Z := Xs * Bm FZ_Multiply_Unbuffered(X => Bar.B, Y => Xs, XY => Z);
Finally, let's proceed to the main subject of Chapter 15: Greatest Common Divisor. We have a new FFACalc operator:
Op  Description  # Ins  # Outs  Notes 

G  Find the Greatest Common Divisor of the top and second item, and place on the stack.  2  1  GCD(0,0) is conventionally defined as 0. 
This is the Greatest Common Divisor operator, as originally defined by Euclid. Several constructive uses for it will become apparent in subsequent chapters.
There did not, to my knowledge, previously exist in the public literature a constantspacetime algorithm for GCD. However, it is not difficult to produce one. As a starting point, we will use Stein's Algorithm, also known as binary GCD (refer to Vol. 2 of D. Knuth's AOP, where a detailed analysis is found.)
It should be noted that there are several other classic algorithms for GCD (e.g. Lehmer's method). Unfortunately, none of them are suitable for a constanttime implementation, as every single one of them relies on periodically examining some number of bits in the working registers and performing a variant set of operations (i.e. branching) depending on their value. This leaks, via the timing sidechannel, information about the numbers being operated on  and is therefore absolutely prohibited in FFA.
Additionally, all known GCD algorithms run in quadratic time on their worstcase input. And per the definition of constanttime, all FFA algorithms must always run in the worstcase time. Hence, the simplest practical algorithm is best, as it will have the smallest constant factor in its runtime (as well as being the easiest to FitinHead !) Hence, we will be sticking with Stein's Binary GCD as the basis for our method.
Let's begin by looking at a commonlystudied recursive variant of this algorithm:
Algorithm 1: Stein's Recursive GCD.
For positive integers A and B:

If B > A:
return Stein(B, A) 
If B = 0:
return A 
If A and B are both even:
return 2 × Stein(A / 2, B / 2) 
If only A is even:
return Stein(A / 2, B) 
If only B is even:
return Stein(A, B / 2) 
Else:
return Stein((A  B) / 2, B)
Chapter 15 Exercise #1:
Prove that Algorithm 1 computes GCD(A, B) in a finite number of steps.
As it stands, this algorithm is not suitable for constanttime implementation. However, it illustrates the equivalences that we will use to construct one which is.
Let's illuminate how Stein's Algorithm works:
Step 1 forces the relationship A ≥ B to hold at the start of each recursive call.
Step 2 terminates the recursion when B is found to equal zero, and there is nothing further to do but to return the value of A  which will be equal to the sought GCD.
Step 3 determines whether the current values of A and B have a common factor of two (i.e. both are even numbers), and accumulates this common factor for later reintroduction into the computed GCD.
Steps 4 and 5 eliminate a possible noncommon factor of two in the current value of either A or B.
Finally, Step 6 makes use of the elementary fact that a difference between two odd numbers (A and B are both known to be odd at this point) is always even, in order to remove a known noncommon (with B) factor of two from the quantity A  B, and then proceeds to the next level of the recursion (similarly to the wellknown "kindergarten" variant of Euclid's original GCD  where only such subtractions take place.)
The reintroduction of the shared poweroftwo factor accumulated in Step 3 happens as a result of the unwinding of the recursion.
It is important to note that Steps 3, 4, and 5 are the reason why Stein's Algorithm converges in quadratic (i.e. O(N^{2}), where N is the total number of bits being operated on) time, rather than the O(max(A, B)) convergence time of "Euclid's subtractive GCD". These steps shave a single bit of length from A, B, or both whenever the respective quantities are even (i.e. have a zero for their juniormost bit.)
Recall the unsurprising Lemma 3 of Ch. 14ABis: if a small number is being subtracted from a muchlarger one, it will take very many subtractions to shorten the quantity being subtracted from by even a single bit  you are liable to die of old age waiting for Euclid's subtractive GCD to converge if the difference between the inputs is large.
The eventual result is that one of the quantities will equal zero, at which point Step 2 terminates the recursion. The zero always ends up in B, while the sought GCD ends up in A.
Now, let's transform Algorithm 1 into a similar but iterative one.
Observe that a constanttime algorithm must always execute exactly the same  from an algebraic point of view  computations, regardless of the particular inputs. Therefore, the subtractive step must be performed during every iteration of the loop, and likewise the divisionsbytwo must also take place during every iteration. Fortunately, the following equivalence holds for all integers A, B, and not merely when A  B is odd:
GCD(A, B) = GCD(A  B, Min(A, B))
Therefore we can dispense with the divisionbytwo in Step 6, and likewise with the "shortcuts" in Steps 15, and get the following modified algorithm:
Algorithm 2: Iterative QuasiStein GCD.
For positive integers A and B:
 Twos := 0
 Iterate until B = 0:
 A_{e}, B_{e} := IsEven(A), IsEven(B)
 A := A >> A_{e}
 B := B >> B_{e}
 Twos := Twos + (A_{e} AND B_{e})
 B_{next} := Min(A, B)
 A_{next} := A  B
 A, B := A_{next}, B_{next}
 A := A × 2^{Twos}
 A contains the GCD.
Algorithm 2 is not yet constanttime, but the missing ingredient should at this point be quite apparent to the astute reader.
Observe that once B = 0, the value of A at step 10 will not be affected by any number of "redundant" iterations of the loop. (Wrong! See Ch. 21ABis!)
This fact is the key to deriving a constanttime, i.e. alwaysworstcase version of Algorithm 2. Let's write it out in a form directly suitable for implementation in FFA:
Algorithm 3: ConstantTime Iterative GCD.
For FZ integers A and B:
 Twos := 0
 Iterate Bitness(A) + Bitness(B) times:
 A_{e} := 1  (A AND 1)
 B_{e} := 1  (B AND 1)
 A := A >> A_{e}
 B := B >> B_{e}
 Twos := Twos + (A_{e} AND B_{e})
 D := A  B, C ← Borrow
 B := {C=0 → B, C=1 → A}
 A := D
 A := A << Twos
 A contains the GCD.
Chapter 15 Exercise #2:
Prove that the number of iterations given in Algorithm 3 always suffices, and that the "redundant" iterations have no effect on the final output. (Wrong! See Ch. 21ABis!)
Chapter 15 Exercise #3:
What values of A and B, for a given FZ_Bitness, actually demand the maximum given number of iterations in order to produce the correct GCD?
And now, we will write an Ada program to perform Algorithm 3:
 Find Greatest Common Divisor (GCD) of X and Y.  Note that by convention, GCD(0, 0) = 0. procedure FZ_Greatest_Common_Divisor(X : in FZ; Y : in FZ; Result : out FZ) is  Widths of X, Y, and Result are equal subtype Width is Word_Index range X'Range;  Working buffers for GCD computation, initially equal to the inputs A : FZ(Width) := X;  GCD will appear in A in the end B : FZ(Width) := Y;  Evenness (negation of lowest bit) of A and B respectively Ae, Be : WBool;  Common powerof2 factor Twos : Word := 0;  A  B D : FZ(Width);  This flag is set iff A < B A_lt_B : WBool; begin  For convergence, requires number of shots equal to 2 * FZ_Bitness: for i in 1 .. 2 * FZ_Bitness(X) loop  If A is even, A := A >> 1; otherwise A := A Ae := 1  FZ_OddP(A); FZ_ShiftRight(A, A, WBit_Index(Ae));  If B is even, B := B >> 1; otherwise B := B Be := 1  FZ_OddP(B); FZ_ShiftRight(B, B, WBit_Index(Be));  If both A and B were even, increment the common poweroftwo Twos := Twos + (Ae and Be);  D := A  B FZ_Sub_Abs(X => A, Y => B, Difference => D, Underflow => A_lt_B);  B' := min(A, B) FZ_Mux(X => B, Y => A, Result => B, Sel => A_lt_B);  A' := A  B A := D; end loop;  Reintroduce the common powerof2 factor stored in 'Twos' FZ_Quiet_ShiftLeft(N => A, ShiftedN => A, Count => Indices(Twos));  Output final result Result := A; end FZ_Greatest_Common_Divisor;
Take note that we have defined GCD(0, 0) = 0. This is technically an abuse of mathematical rigour  the GCD of two zeroes is not a uniquelydetermined value. However (unlike division by zero) there are no known downsides to permitting a computation of GCD(0, 0).
I have found that this choice was made in every currentlyextant arithmetron, including popular computer algebra systems (e.g. Octave and Wolfram's.) (Reader: if you know of one which signals an error when given GCD(0, 0), please let me know which, and the author's reasoning, if it is known.)
Now we will want to test the new GCD routine.
Let's create a test FFACalc tape for GCD by taking a famous pair of braindamaged RSA moduli from the Phuctor zoo  the stars of the SecLab incident:
.c08b0693f9ae0854829cd88d6538756df69ff8067d1556678f7e45b17437014374174c4 aca94bf1f83640928832b398f88c935c6a08177c4cbaa8b85002fee95068bd68487f286f e3b814d92d6147b3d90fba606701f72e1f205c3e06dba55f5e180e45c2225a6ca2061d2d 638ef42609c5d8225620107519628b35983e92e0930ff2e2b8a3a0d9da57a4f50aaefe21 c0b02f8a91587f3ea2337df593f2faea40cb0d6359fee2df45b14b4e8f20988c542b81c7 862f74ea3a3761c22f6ecef64efb2014ccdcf13fb251ed3160ee20f392d0a2200db105c4 5bc12badbaa53a00a1371a77e12de455824c10dafd87f9c150f1e3fb622a8bb68134764a 77a939371bbe63ede53591d1b2bf35ff2f15776a2e1670c8c0006973782c52e97ded5ad1 e4cc96cc4bffd73061e14059aa40dbbc89d46ea1e20500a0e5ac7ac374c277e8d745dc45 449505d1c1bafecd9df8aa75096fffe4cd2f164e2a12d35000782dd73a5b58f8064ea4c0 afe2066f31d336fe65c50a9dff8e3db8a379b182e6d440cb8903fad5ed8477bde7ac2c13 1a7cd47d94630e92f98f68b86d6288607d1ef03880ca18f4176caf08869df93e93433a08 20af7e82e5eed7fd39a2480d98c34f5862dd7ceb4f8382a84acad97d1ee8da685d2e4aa5 f26167a3385f0a3412e168162916dd7ec1a864431f649e610d0ed2593d1be2abda31bb48 a66214a3f8e0ba011 .ed9917eb72fe4b283ba43bd98f163dc5331d47ddeef7319d1e339ab2ccbfa912e7a41c0 f02628c858a511578d173a0ab425dfbfe3d50d279649a0487ca1eff34ee220bc13b207f2 382d76d414ec849784dfd4dc86c5b4f1bac60976d737db018abb94e14f4c91cef8db6b6a 49ed7ece31d054281a92224ebad99c9bafd9b4931b3135e0e03ae55559512ba43725070f ff9912831d49a77c2eeefde1b557c6845166d401aedaf73dd7aefe2a6c1f5a90b7a62207 6b97f1fae8597525dcda6886f736b73990e371a5c424e802e6e9b846998a6dce0a8f4e21 97619373f965dda46ee8ee47a84cb2071321c0ec4186502fa03edf4a63437069440d1b78 889f68ededf9356b8b55df65b5dbb358ff0606eeb5d15b4a433d082a35fbcdf95a97561d e0c99b4f207f326c54cd14093f77e2063c782a14a6dfc7e45800cd87e800d2d875995ff0 1d3540292725283edf6abc78c4f5aba7422b563071e2cfb22e0992ccbddf8cf966cf6eee a8ea1561775cc17d88ca73a2cca4bc4151d380987bac526e395b5d01f984c49b5b91cd07 ce437ef9bb5d7a35fb099032f8bc2afcbc8bef0067288337829f1e568717f99d2c0f13a2 3732f711e20defd6f85533c6ac2934b946a256e8472b3a4b24cb30ff2d2c5959846425ce e81ef638b4f054850057437bf2eb7bca34e9671253789c9ad24fae937e65a7c4850cec2b c3114cb7a68a78601 G .f59cc31339d001d37570dc0ccd986f3f5ea737fa9185c15dbc17e6bfef29435c79a7c22 e8616738947cab8711b6a6e7b5704e5283b57892adad3b170c726f34d3a9859b1504e005 ee4b69d4803cd56773c50ab01d6546ce66dcdb2be4a34e15160d8e0eb69184b699246b42 28f6f25bfcc91970fa99ea3123409f6865b161423581a5f9522ef774f09818bfef6c2b1c 51d06218a07dc717ec94bb231b062936bfd8794cb39bdf8dc05cd2c8bd74b1d0acb14d39 bc293deb45fa52de89af30e4bc5688fb8116be7e7ad4332810c04903939ee2a356ea254b 83fb811c76898672d24997d8647f969a8e02aa2f2016cb1e0c8a9afe99760cd37bf2794d 4ea58951f ={[OK ]}{[SAD ]}_
Now, run the tape as follows:
$ cat seclab.tape  ./bin/ffa_calc 4096 2
... and if successful, the output will be:
OK
A serious reader will also want to test any iron being considered for use with FFA, as previously described in Chapter 14B.
The following set of canonical GCD litmus tapes is available for download:
Verify the signature and unpack the archive. Inside, you will find five FFACalc tapes:
 10k_null_gcd_8192bit.tape
Ten thousand 8192bit GCD tests with only null inputs.  10k_small_gcd_8192bit.1.tape
Ten thousand 8192bit GCD tests with randomlygenerated inputs having a usuallysmall common factor.  10k_small_gcd_8192bit.2.tape
Similar to the above.  10k_large_gcd_8192bit.1.tape
Ten thousand 8192bit GCD tests with randomlygenerated inputs having a usuallylarge common factor.  10k_large_gcd_8192bit.2.tape
Similar to the above.
... and a simple shell script, gcd_litmus.sh.
Place all of these items into your ffa/ffacalc directory and execute the litmus script.
After less than half an hour (on reasonable iron), you will have a set of timer outputs (and no error outputs, unless your machine is catastrophically broken.)
Just as previously, with the tapes of Chapter 14B, if you discover that there is any substantial and persistent pattern of difference in the runtimes of the tapes, you have defective iron!
Please leave a comment here if you turn up a machine which fails this litmus! It is, for instance, conceivable that some piece of rubbish masquerading as a CPU performs a shiftbyzero in a different period of time than a shiftbyone; and I was recently quite certain that I had found one such CPU  but it turned out to be a false alarm.
And that is all, as far as GCD is concerned, for now.
~To be continued!~
In more general settings, you can define the GCD via the partial order a < b \iff a divides b. This is unique up to multiplication with "units" like 1, agrees with the old definition wherever applicable, and satisfies GCD(0,0) = 0. Admittedly this is a posthoc rationalization on my end, but it goes to show that it's not that much of an abuse mathematically. Very nice series by the way.