“Finite Field Arithmetic.” Chapter 7: “Turbo Egyptians.”

This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.

You will need:

Add the above vpatch and seal to your V-set, and press to ffa_ch7_turbo_egyptians.vpatch.

You should end up with the same directory structure as previously.

Now compile ffacalc:

cd ffacalc

But do not run it quite yet.

No one submitted a solution to the “trivial optimizations” problem posed in Chapter 5, but reader Diana Coman did show symptoms of knowing the magic ingredient which happens to be the subject of this Chapter.

The attentive reader may already have begun to suspect that the “Egyptian” multiplication algorithm (FZ_Mul_Egyptian from Ch. 5) is not the final word on the subject of integer multiplication in FFA. But before we can explore subquadratic multiplication (and — much later — clever methods for speeding up modular reduction) it is necessary to set up the arena for a “fair fight”, by making reasonably well-optimized variants of both of the quadratic-runtime “Egyptian” algos, FZ_Mul_Egyptian and FZ_Mod.

Previously, the primary design objectives of the FFA algorithms were correctness and constant-spacetime operation; followed, secondarily, by simplicity (in the fits-in-head sense.) In this chapter, we add another objective: speed.

At no point will correctness or constant-spacetime (branch-free and offsetting-by-secrets-free) operation be sacrificed under any pretext whatsoever. However, in order to obtain, e.g., an RSAtron that is practically usable on commonplace machines, it will be necessary to sacrifice a certain amount of mechanical simplicity. This Chapter, along with the bulk of the remaining material, will be devoted to this painful — but not uninteresting! — task.

Let’s begin with a new, “turbo” FZ_Mul_Egyptian:


   -- 'Egyptological' multiplier. XY_Lo and XY_Hi hold result of X*Y.
   procedure FZ_Mul_Egyptian(X     : in  FZ;
                             Y     : in  FZ;
                             XY_Lo : out FZ;
                             XY_Hi : out FZ) is
      L : constant Indices := X'Length;
      -- Register holding running product
      XY : FZ(1 .. X'Length + Y'Length);
      -- X-Slide
      XS : FZ(1 .. X'Length + Y'Length);
      -- Product register begins empty
      -- X-Slide initially equals X:
      XS(1            .. X'Length) := X;
      XS(X'Length + 1 .. XS'Last)  := (others => 0);
      -- For each word of Y:
      for i in Y'Range loop
            -- Current word of Y
            W : Word  := Y(i);
            -- Current cut of XY and XS. Stay ahead by a word to handle carry.
            Cut : constant Indices := L + i;
            XYc : FZ renames XY(1 .. Cut);
            XSc : FZ renames XS(1 .. Cut);
            for b in 1 .. Bitness loop
               -- If current Y bit is 1, X-Slide Cut is added into XY Cut
               FZ_Add_Gated(X    => XYc, Y => XSc, Sum => XYc,
                            Gate => W and 1);
               -- Crank the next bit of Y into the bottom position of W
               W := Shift_Right(W, 1);
               -- X-Slide := X-Slide * 2
               FZ_ShiftLeft(XSc, XSc, 1);
            end loop;
      end loop;
      -- Write out the Product's lower and upper FZs:
      XY_Lo := XY(1                .. XY_Lo'Length);
      XY_Hi := XY(XY_Lo'Length + 1 .. XY'Last);
   end FZ_Mul_Egyptian;
   pragma Inline_Always(FZ_Mul_Egyptian);

Observe that the Y-Slide from Chapter 5’s multiplier is gone. Instead, we now walk through the bits of the multiplicand Y without having to shift the entire thing FZ_Bitness(Y) times: each word of Y is loaded into W, starting with the first; and the “egyptology” is then performed Bitness times, once for each bit of W.

The other optimization is the introduction of the cut concept. Observe that an addition of two FZ integers of identical bitness, can produce a result with an intrinsic bitness that is larger than that of the greater parent’s by a maximum of one bit. The consequence of this is that the Chapter 5 FZ_Mul_Egyptian wastes roughly half of its CPU time shifting and adding words that never, at the particular times they are touched, depart from zero.

Thus, the first iteration of the loop is carried out on a cut of length X’Length + 1; the second, X’Length + 2; and forth; the last iteration is the only one which is performed on the entire XY product-accumulator and the entire XS X-Slide. We “run ahead” of the segment of XY which has been touched, by one word, so as to have a place to which to ripple the carry.

At the expense of a certain amount of “obviousness”, we win a 2x gain in multiplication speed.

And now we will apply exactly the same optimizations to the modulus routine:


   -- Modulus. Permits the asymmetric Dividend and Divisor in FZ_Mod_Exp.
   procedure FZ_Mod(Dividend  : in FZ;
                    Divisor   : in FZ;
                    Remainder : out FZ) is
      -- Length of Divisor and Remainder; < = Dividend'Length
      L : constant Indices := Divisor'Length;
      -- Remainder register, starts as zero
      R : FZ(1 .. L) := (others => 0);
      -- Indices into the words of Dividend
      subtype Dividend_Index is Word_Index range Dividend'Range;
      -- Permissible 'cuts' for the Slice operation
      subtype Divisor_Cuts   is Word_Index range 2 .. Divisor'Length;
      -- Performs Restoring Division on a given segment of Dividend:Divisor
      procedure Slice(Index : Dividend_Index;
                      Cut   : Divisor_Cuts) is
            -- Borrow, from comparator
            C   : WBool;
            -- Left-Shift Overflow
            LsO : WBool;
            -- Current cut of Remainder register
            Rs  : FZ renames R(1 .. Cut);
            -- Current cut of Divisor
            Ds  : FZ renames Divisor(1 .. Cut);
            -- Current word of Dividend, starting from the highest
            W   : Word  := Dividend(Dividend'Last + 1 - Index);
            -- For each bit in the current Dividend word:
            for b in 1 .. Bitness loop
               -- Send top bit of current Dividend word to the bottom of W
               W := Rotate_Left(W, 1);
               -- Advance Rs, shifting in the current Dividend bit
               FZ_ShiftLeft_O_I(N => Rs, ShiftedN => Rs, Count => 1,
                                OF_In => W and 1,
                                Overflow => LsO);
               -- Subtract Divisor-Cut from R-Cut; Underflow goes into C
               FZ_Sub(X => Rs, Y => Ds, Difference => Rs, Underflow => C);
               -- If C=1, subtraction underflowed, and we must undo it:
               FZ_Add_Gated(X => Rs, Y => Ds, Sum => Rs,
                            Gate => C and W_Not(LsO));
            end loop;
      end Slice;
      -- Process bottom half of dividend:
      for i in 1 .. L - 1 loop
         Slice(i, i + 1); -- stay ahead by a word to handle carry
      end loop;
      -- Process top half of dividend
      for i in L .. Dividend'Length loop
         Slice(i, L);
      end loop;
      -- Output the Remainder.
      Remainder := R;
   end FZ_Mod;
   pragma Inline_Always(FZ_Mod);

In addition to the two optimizations analogous to those we had applied to FZ_Mul_Egyptian, we also get the ability to apply FZ_Mod to a Dividend which exceeds the bitness of the Divisor. This will allow us to abolish the zero-padding of Modulus in Chapter 6’s FZ_Mod_Mul.

Correspondingly, we carefully alter the preconditions of FZ_Mod:


   -- Modulus. Permits the asymmetric Dividend and Divisor in FZ_Mod_Exp.
   procedure FZ_Mod(Dividend  : in FZ;
                    Divisor   : in FZ;
                    Remainder : out FZ);
   pragma Precondition(Dividend'Length >= Divisor'Length and
                         Divisor'Length = Remainder'Length);

The only place in FFA where the asymmetric invocation of FZ_Mod will be permitted, is FZ_Mod_Mul, the modular multiplier procedure, which will now look like this:


   -- Modular Multiply: Product := X*Y mod Modulus
   procedure FZ_Mod_Mul(X        : in  FZ;
                        Y        : in  FZ;
                        Modulus  : in  FZ;
                        Product  : out FZ) is
      -- The wordness of all three operands is equal:
      L     : constant Indices := X'Length;
      -- Double-width register for multiplication and modulus operations
      XY    : FZ(1 .. L * 2);
      -- To refer to the lower and upper halves of the working register:
      XY_Lo : FZ renames XY(1     .. L);
      XY_Hi : FZ renames XY(L + 1 .. XY'Last);
      -- XY_Lo:XY_Hi := X * Y
      FZ_Mul_Egyptian(X, Y, XY_Lo, XY_Hi);
      -- Product := XY mod M
      FZ_Mod(XY, Modulus, Product);
   end FZ_Mod_Mul;
   pragma Inline_Always(FZ_Mod_Mul);

And now, on the same machine as in Ch.6 , timed and plotted logarithmically:

Cost of Ch.7 'X' Operation, vs FFA Bitness.

Or, for those who prefer the raw numbers,

FFA Bitness Ch.6 ‘X’ (sec) Ch.7 ‘X’ (sec)
256 0.072 0.040
512 0.505 0.240
1024 3.685 1.672
2048 27.975 12.024
4096 217.966 90.439
8192 1720.127 699.979

And finally, let’s turn this into a FFACalc tape, and verify the signature of the Chapter 7 material, using itself, just like we did in the previous chapter:

( the seal of ffa_ch7_turbo_egyptians.vpatch : )
( my public rsa exponent : )
( my public rsa modulus : )
( now modularly-exponentiate it : )
( ... and print the output .)

… and play the tape! :

$ time  cat ch7_rsa_tape.txt | ./bin/ffa_calc 2048 32

… and unsurprisingly:

real    0m12.018s 
user    0m12.004s 
sys     0m0.002s

Guess what? The correct answer. (Use the patched GPG from Ch. 6, and compare.)


1. Prove that the new FZ_Mod produces, for all valid inputs, the equivalent answer to the one in the previous chapter.

2. Formulate a hypothesis regarding why the observed speedup of ‘X’ was greater than two-fold.

~To be continued!~

This entry was written by Stanislav , posted on Saturday January 13 2018 , filed under Bitcoin, Cold Air, Computation, Cryptography, FFA, Friends, Mathematics, ShouldersGiants, SoftwareSucks . Bookmark the permalink . Post a comment below or leave a trackback: Trackback URL.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" highlight="">