“Finite Field Arithmetic.” Chapter 14A-Bis: Barrett’s Modular Reduction. (Physical Bounds Proof.)
This article is part of a series of hands-on tutorials introducing FFA, or the Finite Field Arithmetic library. FFA differs from the typical “Open Sores” abomination, in that — rather than trusting the author blindly with their lives — prospective users are expected to read and fully understand every single line. In exactly the same manner that you would understand and pack your own parachute. The reader will assemble and test a working FFA with his own hands, and at the same time grasp the purpose of each moving part therein.
- Chapter 1: Genesis.
- Chapter 2: Logical and Bitwise Operations.
- Chapter 3: Shifts.
- Chapter 4: Interlude: FFACalc.
- Chapter 5: “Egyptological” Multiplication and Division.
- Chapter 6: “Geological” RSA.
- Chapter 7: “Turbo Egyptians.”
- Chapter 8: Interlude: Randomism.
- Chapter 9: “Exodus from Egypt” with Comba’s Algorithm.
- Chapter 10: Introducing Karatsuba’s Multiplication.
- Chapter 11: Tuning and Unified API.
- Chapter 12A: Karatsuba Redux. (Part 1 of 2)
- Chapter 12B: Karatsuba Redux. (Part 2 of 2)
- Chapter 13: “Width-Measure” and “Quiet Shifts.”
- Chapter 14A: Barrett’s Modular Reduction. (Part 1 of 2)
- Chapter 14A-Bis: Barrett’s Modular Reduction. (Physical Bounds Proof.)
You will need:
- All of the materials from Chapters 1 – 14A
- There is no vpatch in Chapter 14A-Bis.
Given the trickiness of Barrett’s Reduction under the constraints of FFA, and the fact that we want to hard-guarantee, as adults do, the exact amount of physical space required to hold the intermediate results of the computation, it is necessary to extend the proof of Chapter 14A with a physical bounds proof — before we can dare to look at the corresponding program code.
We will illuminate said proof with “electrical” illustrations, in the style of Chapter 12A.
Please print out and refer to Algorithm 2 from Chapter 14A when working through this Chapter.
First, three lemmas:
Lemma 1:
For any integers A ≥ 0, B > 0, C = ⌊A / B⌋, Measure(C) ≤ Measure(A) - Measure(B) + 1.
Lemma 2:
For any integers A ≥ 0, B ≥ 0, C = A × B, Measure(C) ≤ Measure(A) + Measure(B).
Lemma 3:
For any integers A ≥ 0, B ≥ 0, C = A + B, Measure(C) ≤ Max(Measure(A), Measure(B)) + 1.
If the correctness of these is not obvious, dear reader, stop reading and prove them. Right now.
We will use ⇠dashed arrows⇢ to refer to the variant bitness bounds of hypothetical zero and nonzero segments of registers, for the purpose of min/max width analysis of the intermediate results in our computation. ←Solid arrows→ will refer to invariant bounds (i.e. the physical sizes of particular registers.)
First, let’s describe the physical space occupied by the quantities which must exist prior to performing Barrett’s Reduction of a given X.
We begin with the modulus: 0 < M < 2^{WM}. It lives in a W_{M}-bit register:
←W_{M}→ | |
M |
Now let’s picture, strictly for illustrative purposes, an M which is a bit less than “half-full”:
←W_{M}→ | ||
0 | M | |
⇠J_{M}+1⇢ |
Recall that we picked the concrete value of J_{M} (0 ≤ J_{M} < W_{M}) to equal one less than the measure of M, i.e.,
J_{M} = Measure(M) - 1.
This is a valid choice under the inequalities yielded by the Ch. 14A proof.
Observe that this value is a secret (just like M and X) and therefore it is forbidden in FFA to branch on its bits, index memory by it, etc. The only mechanical thing we will do with it is to shift certain registers (via a constant-time method) by it.
Now, let’s describe the Barrettoid B_{M} corresponding to a given M.
By Lemma 1, the Measure of 0 < B_{M} < 2^{KM} will always be less than or equal to:
(2W_{M} + 1) - Measure(M) + 1 = 2W_{M} - J_{M}.
And so, B_{M} will look like:
←2W_{M}→ | ||
0 | B_{M} | |
⇠J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ |
Now, of course, we also need the dividend:
0 ≤ X < 2^{KM} is the integer to be reduced modulo M, and its register has twice the bitness of M:
X | |
←2W_{M}→ |
Now, let’s describe the space required to carry out Barrett’s Reduction itself, where in the end we obtain X mod M.
The first of the two shifts in the algorithm, where we obtain the term ⌊X / 2^{j}⌋:
1: X_{s} = X >> J_{M}
←2W_{M}→ | ||
0 | X_{S} | |
⇠J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ |
The first of the two multiplications. This one is 2W_{M} × 2W_{M}-sized, and it will give us the intermediate result:
⌊X / 2^{j}⌋⌊2^{k} / M⌋.
The second multiplicand is our pre-computed Barrettoid for the current M. And so:
2: Z = X_{s} × B_{M}
By Lemma 2:
←2W_{M}→ | |||||
0 | X_{S} | ||||
⇠J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ | ||||
←2W_{M}→ | |||||
× | 0 | B_{M} | |||
⇠J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ | ||||
←4W_{M}→ | |||||
= | 0 | Z | |||
⇠2J_{M}⇢ | ⇠4W_{M} - 2J_{M}⇢ |
The second of the two shifts. This one gets us the entire Barrett approximation of the quotient, i.e. ⌊X / M⌋ + E :
⌊2^{k} / M⌋⌊X / 2^{j}⌋ |
——————————————— |
2^{k - j} |
3: Z_{s} := Z >> 2W_{M} - J_{M}
←4W_{M}→ | ||
0 | Z_{S} | |
⇠2W_{M} + J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ |
Observe that we do not need to spend the cost of shifting the whole of Z in order to obtain Z_{S}, given as the shift will always be of at least W_{M}+1-bits (since 0 ≤ J_{M} < W_{M}). Ergo, the lower W_{M} bits of Z are always discarded in this process, and can be simply ignored:
←4W_{M}→ | ||||
0 | Z | |||
⇠2J_{M}⇢ | ⇠4W_{M} - 2J_{M}⇢ |
←4W_{M}→ | ||||
= | Z_{Hi} | Z_{Mid} | Z_{Lo} | |
←W_{M}→ | ←2W_{M}→ | ←W_{M}→ |
… so it actually suffices to right-shift only the 3W_{M}-bit segment consisting of Z_{Hi} and Z_{Mid}, by W_{M} - J_{M} bits, and afterwards to take the Z_{Mid}-wide lower segment of the output as the resulting Z_{S}.
Z_{S} always fits inside a 2W_{M}-bit register:
←2W_{M}→ | ||
0 | Z_{S} | |
⇠J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ |
Here, we have something a little more interesting. We have an asymmetric (2W_{M} × W_{M}-sized) multiplication, where we want to multiply Z_{S}, the Barrett approximation of the quotient, by the modulus M:
4: Q := Z_{s} × M
←2W_{M}→ | |||||
0 | Z_{S} | ||||
⇠J_{M}⇢ | ⇠2W_{M} - J_{M}⇢ | ||||
←W_{M}→ | |||||
× | 0 | M | |||
⇠J_{M}+1⇢ | |||||
←3W_{M}→ | |||||
= | 0 | Q | |||
←W_{M}→ | ←2W_{M}→ |
By Lemma 2, a 2W_{M}-bit × W_{M}-bit product in the general case must occupy 3W_{M} bits; while, looking closer, it would appear that, given as we have multiplied the (2W_{M} - J_{M})-bit contents of Z_{S} by the J_{M}+1-bit contents of M, we could end up with a 2W_{M}+1-bit result…
However, in Chapter 14A we proved that:
⌊2^{k} / M⌋⌊X / 2^{j}⌋ | ||||||||
X mod M | = | X | - | M | × | ——————————————— | - | E × M |
2^{k - j} |
And so we know that Q ≤ X, and therefore, regardless of the value of J_{M}, the term Q will always fit inside 2W_{M} bits.
←2W_{M}→ | |
Q | |
←2W_{M}→ | |
X |
If the “secret” of this “miracle” eludes you, please go back to Chapter 14A and walk the proof, until it fits-in-head.
Now, we find the initial estimate of X mod M:
5: R := X - Q
←2W_{M}→ | |
X | |
←2W_{M}→ | |
- | Q |
←2W_{M}→ | |
= | R |
But do we actually need to perform a 2W_{M}-bit subtraction here? It turns out that we don’t, because R is at most W+2-bit. We know this because we have proven that at most two subtractions of M — a W_{M}-bit quantity — will be required after this step to produce the final, correct X mod M, a W_{M}-bit quantity. Therefore, what we actually need to do instead of the above, is:
←2W_{M}→ | ||
Ignore | X | |
←W_{M} - L→ | ←W_{M} + L→ | |
←2W_{M}→ | ||
- | Ignore | Q |
←W_{M} - L→ | ←W_{M} + L→ | |
= | R | |
←W_{M} + L→ |
L here is the bitness of our machine Word.
And by Lemma 3, we know that each of the following two gated subtractions can remove a maximum of one bit from the length of R; and from the Chapter 14A proof, we know that the final result will be of the same maximal bitness (i.e. will fit in the same size of FFA register) as the modulus M.
Now, we will do the first of our two gated subtractions. We will subtract a W_{M} + L zero-extended (on the senior end, by one machine word) M from R, to obtain a difference S_{1} and a “borrow”, C_{1}.
6: S_{1} := R - M, C_{1} := Borrow
←W_{M} + L→ | |||
R | |||
←W_{M} + L→ | |||
- | 0 | M | |
←L→ | ←W_{M}→ | ||
←W_{M} + L→ | |||
= | S_{1} | C_{1} ← Borrow |
Now, the “gating” of that subtraction:
7: R := {C_{1}=0 → S_{1}, C_{1}=1 → R}
←W_{M} + L→ | ←W_{M} + L→ | ||
S_{1} | ⇒Mux(C_{1})⇐ | R | |
↳C_{1}=0⇘ | ⇓ | ⇙C_{1}=1↲ | |
R | |||
←W_{M} + L→ |
That is, if the subtraction in step 6 did not underflow, as indicated by C_{1} = 0, then we know that R was greater than or equal to M prior to step 6, and we then keep S_{1} (the result of the subtraction) and assign it back to R. On the other hand, if it did underflow, as indicated by C_{1} = 1, then we discard S_{1}, and R stays as it was. Note that this process is to be carried out in constant time, via FZ_Mux.
Now, the second of the two gated subtractions. Exactly like in Step 6:
8: S_{2} := R - M, C_{2} := Borrow
←W_{M} + L→ | |||
R | |||
←W_{M} + L→ | |||
- | 0 | M | |
←L→ | ←W_{M}→ | ||
←W_{M} + L→ | |||
= | S_{2} | C_{2} ← Borrow |
And, finally, the “gating” of the second (and last) subtraction from Step 8:
9: R := {C_{2}=0 → S_{2}, C_{2}=1 → R}
←W_{M} + L→ | ←W_{M} + L→ | ||
S_{2} | ⇒Mux(C_{2})⇐ | R | |
↳C_{2}=0⇘ | ⇓ | ⇙C_{2}=1↲ | |
R | |||
←W_{M} + L→ |
Observe that we are not yet finished with R. The final result R = X mod M could still be one of two possible things:
- If M ≠ 1, the final result will be found in the bottom W_{M} bits of R.
- If M = 1, then we had the degenerate case, and everything we did in steps 1-9 was simply to occupy constant-time, and has no bearing on the answer to the calculation of X mod M; the final result in this case is to be zero.
Ergo, we need one last Mux:
10: R_{Final} := {D_{M}=0 → R, D_{M}=1 → 0}
First, observe that:
←W_{M} + L→ | |||
R | |||
= | |||
0 | R_{F} | ||
←L→ | ←W_{M}→ |
The two gated subtractions will at this point have zeroed any of the L upper bits of R that may have been previously non-zero.
And so, all that remains to do now, is:
←W_{M}→ | ←W_{M}→ | ||
R_{F} | ⇒Mux(D_{M})⇐ | 0 | |
↳D_{M}=0⇘ | ⇓ | ⇙D_{M}=1↲ | |
R = X mod M | |||
←W_{M}→ |
…and that’s it, we have R = X mod M, as was proven already in Chapter 14A.
In the next Chapter, 14B, we will finally walk through the Ada program which implements this Barrettron. Stay tuned!