Euclid's algorithm finds the greatest common divisor (GCD) of 2 integers. This algorithm is
very old, but is a foundation stone of number theory to this day.
(It is arguably the oldest algorithm known)
The target audience of this document is high school mathematics
students, but moves steadily along to a level that may stretch them.
I use a little matrix theory within - high school students may
need a textbook on the topic to help out here. Below, I give no more than
a summary of the matrix theory that I need.
[Experts may want to skip the first few 'elementary' sections.]
[Mathematical historians do not believe that Euclid (c.325 B.C. - c.265 B.C.) invented the algorithm that bears his name. Euclid worked as an 'encyclopaedist' in some institute in Alexandria; as such he would have been familiar with the whole of Greek mathematics at that time. Beyond that, little is known about him. However, Euclid appears to have been the first to put the algorithm into it's present arrangement.]
[Several other documents on this site are referred to repeatedly within. All the links to
these documents are given here; return here if you want to view the contents.
fastmuli.html - This extends this document to cover the problem
of finding multi-word GCD's and multiplicative inverses. The theory developed below
also applies to fastmuli.html.
[A fast module which is the key item for doing this is presented within.]
speed.html - Introduces the basics of writing
high speed code for Pentium processors.
converge.html - Discusses the rate of convergence
of the Euclidean algorithm.
euclogic.html - Discusses the logical foundations of the Euclidean
algorithm.
matnotes.html - A 'crib' sheet for such matrix theory as is used
below. This must be used with care since it only applies to 2-by-2 matrices with integer
elements.
refs.html - Most of the technical references
used on this site are listed within. This site has a second file of references
on a special topic. If I refer to this second file, a live link will be
provided.
Below I shall generally write e.g. 'Ref' 11 or 'matnotes' (single
quotes) when referring to any of the files above.]
The 32 bit code, (i.e. using 32 bit operations operating on 32 bit inputs) listed below,
is probably the fastest in the world.
I have also developed (size limited) 64 bit and 128 bit multiplicative
inverse finders (both using 64 bit operations) using the principles discussed
below, which are probably, the fastest available anywhere. These have the
additional benefit of small code volumes.
All the items above are for sale.
## ------------ ##
[Some Mathematical Preliminaries :-
In modern mathematics the integers are whole numbers, and the complete set of them is
Set of Integers =
{-infinity, ... ,-2,-1,0,1,2, ... +infinity}.
In Euclid's day, the Greek integer set excluded negative values and zero; and the only number systems available to Greek mathematicians were positive integers and, 'fractions' (a.k.a. Rational numbers) with positive integer numerators and positive integer denominators. In so far as the ancient Greek set of integers is used in number theory today, it is known as either the 'Positive' or the 'Countable' or the 'Natural' integer set; I shall use the first of these.
Set of Positive integers =
{1,2, ... +infinity}.
The nonNegative integer set is also used below; it is the Positive integer set plus the value zero. i.e.
Set of nonNegative integers =
{0,1,2, ... +infinity}.
Below, I shall need to use '0' from time to time, but all input, intermediate and output values,
related to Euclid's algorithm, are from the Positives.
Wherever I use the word 'integers' the phrase 'Positive integers' is implied.
Beginners to number theory need to be aware of the following :-
In Number Theory, all input values, all intermediate values and all output values for any calculation are Integers.
[Strictly speaking Number Theory includes the Rationals (mentioned above) and modern Number Theory has a branch that includes the use of the calculus (of necessity associated with Real and Complex numbers) - this body of theory is known as 'Analytic Number Theory'.
Neither is relevant to this document.]
This statement is both profound and subtle. It means that familiar schoolroom calculations may not apply. The most obvious problem comes from Integer division.
NOTES on INTEGER DIVISION
On a calculator, any number will divide any other number. What
may not be clear, to users of calculators, is that (most) calculators only
work with floating point numbers, for which division is unrestricted -
excluding calculator limitations anyway. The familiar world of calculator
arithmetic has to be left behind once number theoretical integer divisions
are in play.
For integers, 3 division variants exist;
simple division, Ceil() and Floor().
For positive inputs, Floor() returns a value
that corresponds to the integer part of the result
from a calculator division (e.g. Floor(3/2) = 1).
Ceil() returns a value of (Floor() + 1)
so long as there is something (a fractional part)
after the decimal point of the calculator division,
or Floor() if there isn't (e.g. Ceil(3/2) = 2).
Simple division is fine so long as the
denominator divides the numerator
without remainder;
otherwise it does not exist.
Some textbooks use the phrase 'even division' to
emphasize the situation where one integer divides another without remainder.
If even division occurs, then Ceil(), Floor() and division, all give the
same result.
Note that Floor() might return zero, even when the 2 inputs
are from the Positives.
Note, also, that many textbooks just use the word 'division'
when 'Floor()' is intended.
[A more rigorous test for the existence of division is to check whether multiplying the integer part (Floor()) of the (calculator) quotient from the division by the (integer) denominator recovers the (integer) numerator exactly. If, and only if, this is possible, does even integer division exist.]
In summary, we have :-
Floor(3/2) = 1
Ceil(3/2) = 2
3/2 = does not exist!
[This problem with division splits number theory profoundly.
The higher level constructs of Fields and Rings are differentiated by whether
division is universally possible or not. Division is always possible between
any two elements of a Field (with the exception of the 'zero' element),
but may not be in a Ring. The discussion above suggests that, without precise
definition, the integers are definitely not a Field but might form a Ring,
(they do). Note that the concept of greatest common divisor (GCD), and hence Euclid's algorithm
is meaningless in a Field and is only meaningful in a Ring.
[The more common Fields are the Real numbers, the Complex Numbers
and the Rational numbers (fractions).]]
Prime numbers are elements from the Positives (excluding '1') evenly divisible only by '1' and themselves.
Prime number set =
{2,3,5,7,11, ... +infinity}.
Composite numbers are the Positives (excluding '1') not in the Prime number set. Composite numbers are evenly divisible by two or more members of the Prime number set.
Composite number set =
{4,6,8,9,10 ... +infinity}.
An important operation in number theory, almost unknown in school-rooms, is the 'mod' operation. The mod operation returns the remainder following a Floor() division of the 2 input values. Examples :-
3 mod 2 = 1
6 mod 2 = 0
7 mod 8 = 7
Formally, we have :-
if m = x mod y
then m = x - y.Floor(x/y)
Note that :-
(i) Floor() always exists, hence 'mod' always exists.
(The symbol 'q' - quotient - always stands for Floor() below)
(ii) The mod value is zero if and only if 'y' (the denominator)
evenly divides 'x' (the numerator).
(iii) The mod value is from the set {0,1, ... ,(y-1)}
Note that the remainder is always less than 'y'.
The 'mod' operation can be seen to map the Positives to a truncated nonNegative integer set :-
Positives => {0,1, ... (y-1)} after mod.
The mod mapping might be reversed by :-
m + y.q = x (if 'q' happens to be known)
Definition :- Two given integers with a GCD of '1' are called mutually prime.
This does not mean that either of the given integers are (necessarily) prime numbers - just
that they do not have a common divisor other than '1', just as any pair of different prime
numbers have.
Thus 18 and 35 are mutually prime but neither is a prime number.
The principle importance of mutual primality is that :-
multiplicative inverses exist if and only if the given pair of numbers are mutually prime (more discussion below).
Euclid's 'Elements', book VII, Propositions 1 and 2 which cover his eponymous algorithm, make no mention of multiplicative inverses; but these are my main objective in this document. It is possible to sneak up on multiplicative inverses by contemplating the calculation of the GCD. From this, multiplicative inverses appear almost incidentally.]
## ------------ ##
Several modern developments of Euclid's algorithm exist, see 'Refs' 3, 9, 10, 11 for details.
I am not going to try to compete with the references above. Reference 9, which is not on many people's shelves, develops the algorithm in an unusual way, which I exploit below, and in the document on this site on fast multiplicative inverses for multi-word arguments.
This document starts with an introductory discussion of Euclid's algorithm (to enable beginners to the subject to 'get on board'); then reproduces the presentation of Ref 9 (I assume that a copy will be unavailable); then adjusts the theory to enable a fast software implementation.
The greatest common divisor is just the largest integer that evenly divides a pair of given integers. Thus :-
GCD(6, 8) = 2; GCD(3, 7) = 1;
GCD(18, 35) = 1; GCD(0, 3) = 3
[Note '0' is divisible by all integers, or GCD(0, n) = n.]
For small integers, such as above, the GCD can be written down by inspection. For larger input values we need a procedure to work steadily towards the result - an algorithm.
Definition :- An algorithm is a procedure which, after a finite number of steps, will yield the desired result.
Reference 11, Algorithm E is a fair fit to Euclid's original comments :-
Find GCD(A, B) given 2 integers {A, B} from the Positives :- (Recall that GCD (0, n) = n. Use this if either A or B might be zero on entry) (Ideally A <= B at the start; but if not, the 1st pass will just swap A with B.) E1: If A evenly divides B, then A is the GCD. E2: If B mod A = 1, then A and B are relatively prime, hence the GCD is 1. Otherwise replace {A, B} by {B mod A, A} and goto E1. An Example (from Ref. 12) Find the GCD of {560, 1547} :- E1 fails; then since 1547 = 2.560 + 427, E2 output = (427, 560) E1 fails; then since 560 = 1.427 + 133, E2 output = (133, 427) E1 fails; then since 427 = 3.133 + 28, E2 output = (28, 133) E1 fails; then since 133 = 4.28 + 21, E2 output = (21, 28) E1 fails; then since 28 = 1.21 + 7, E2 output = (7, 21) E1 succeeds - the GCD is 7.
[For a discussion of the logic behind this algorithm, see
euclogic.html. In brief, the algorithm works! Phew!]
The basic algorithm, above, is fine for hand calculations. Anyone needing to write a Euclidean module quickly, might also use it. But if the ultimate in speed is essential, or if the input variables {A, B} are, say, several processor words in size, the speed will be poor and projects that use it extensively will be unacceptably slow.
Two of the reasons for the slowness of Algorithm E (when applied to processor word sized arguments) will be given here; an understanding of them will help to suggest ways for speeding it up.
(1) The main reason is that the Pentium 'DIV' operation is going to be used. This operation is the slowest operation that a Pentium can perform - by a wide margin! Reference books give an execution time in the order of 39 clock ticks for a 32 bit DIV and 71 for a 64 bit DIV. Furthermore, in a simple-minded reading of Algorithm E, it will be called twice. For computer word sized arguments, in line E1 the test for 'even division' (in C code), would have to be done as :-
if ((B % A) == 0) return A ; // In 'C' B % A = remainder after Floor() division = mod
At the op-code level, the '%' operator calls DIV which calculates both quotient and remainder. A similar call is required in E2. Only the dumbest of compilers will call DIV twice in Algorithm E, but even once carries a heavy time penalty. On top of that, each DIV call makes only a small step towards Algorithm E completion - on average, many DIV calls will be needed. Below, see 'converge' for a discussion of the rate of convergence of Algorithm E. This includes an estimate of the number of calls to DIV likely to be needed.
(2) A secondary reason for the slowness is that in step E2, the working variables are swapped round at each pass. Towards the end of document 'speed', I mention that moving data around, in this fashion, is a waste of time.
[The 'swapping' can be eliminated by using a 2-stage process :-
(1) Find B mod A.
(2) If work completed, exit
(3) Find A mod B
(4) If work completed, exit.
(5) Goto (1)
This may seem to be a trivial adjustment, but when the work is done
in x86 assembler (see below) the lack of registers to hold variables with
32 bit assembler makes its benefit significant.]
In order to guide Euclidean algorithm development, 3 items of information
are needed :-
(1) A 'feel' for the typical performance of the algorithm
(discussed next).
(2) An understanding of the vagaries of Pentium processors.
(see 'speed')
(Substitute your own target processor)
(3) A theoretical framework for the algorithm to prevent
errors in logic.
(see the method of Ref. 9 below)
A detailed summary of the convergence
of the Euclidean algorithm
can be found in converge.html.
In brief, the average number of iterations required
approx= Sum of bitsize of inputs / 3.5.
(with significant scatter)
Hence for 2# 32 bit inputs,
Number of iterations = 64/3.5 = 18.2
('q' is the quotient when modding next)
(for convenience let the # iterations be 20)
in approx 17 out of 20 of these cases q <= 8
and for the remaining 3, q might be large
(because of the 'fat tail' of the probability distribution).
Contemplation of this behaviour hints that using 2 modding techniques might be beneficial :-
(a) When 'q' is small, subtraction will be hard to beat.
(b) When 'q' is large, some other technique will be needed.
As a first step towards a faster Euclidean algorithm, I need some solid theoretical foundations, introduced next.
In this section I present the theory leading to a method for obtaining fast multi-word multiplicative inverses - see 'fastmuli' for details. The theory is based on a presentation in 'Ref' 9 section 4.3, which relates to the extended Euclidean algorithm.
[An extended Euclidean algorithm is a Euclidean algorithm extended to calculate multiplicative inverses as well as GCD's.]
[After reading this section, you would be justified in doubting whether there will be any speed payoff from it. Just stick with it.]
The principle problem with what follows relates to the organization of various intermediate values - no new number theory concepts are needed. To bring some order to these values, 'Ref' 9 introduces matrix methods. I am concious that matrix methods may defeat some readers, but their organizational benefits are greater than their drawbacks.
When only the GCD is sought
the use of matrices is superfluous.
When multiplicative inverses are required.
their use is very beneficial.
(see matnotes.html for a cribsheet
of the matrix formulae required below.)
Reference 9 presents a method of building up a 2-by-2 integer matrix M[], stage by stage. This matrix has a determinant which toggles between +1 and -1 at the end of each stage - hence is always invertible. At the end of the calculation the required multiplicative inverse is one of the entries of M[] (more discussion below).
If {u, v} are the input numbers
D = GCD (u, v); plus a, b values are found
such that a.u + b.v = D
(the 'a' and 'b' values are elements of M[])
[Ideally, but not necessarily, u > v at start.
Also, the u=0 and/or v=0 cases are excluded.]
Initialize :- M[] = {1,0,0,1} // to the 'Identity' matrix n = 0 // n is a pass counter Iteration block :- while (v not equal 0) q = Floor(u/v) // repeated subtraction might be used here M = M * {q, 1, 1, 0} // If q=1 then post multiply by {1,1,1,0} {u, v} = {v, u - q.v} n = n + 1 Finalize :- d = u a = (-1)^n * M22 b = (-1)^(n+1) * M12 END
Above, I have tried to be faithful to Ref. 9.
It is more convenient, here down, to change the name of the
working variables in the column matrix to :-
itern[] = {un, un+1};
where the subscript 'n' indicates the values associated with pass counter
'n'; i.e. the matrix contents at the end of each iteration.
Initially we have iter0[] = {u, v}
and M0[] = {1, 0, 0, 1} = Identity[]
At the end of each iteration the following equation must be satisfied :-
iter0[] = Mn[] * itern[]
by adjusting the contents of M[] and iter[].
The objective in Ref 9's implementation is to arrive at an Mn[] which satisfies the following matrix equation :-
iter0[] = Mn[] * {D, 0}
where 'D' = GCD (u, v)
The algorithm proceeds by building up Mn[] while simultaneously reducing itern[], all by matrix operations.
Consider a single Euclidean step :-
[u0 > u1 is presumed at the start;
and iter0[] = {u0, u1}
will be replaced, stage by stage by
{u1, u2}, {u2, u3} ..., {GCD, 0}]
For some intermediate 'nth' pass :-
If qn = Floor (un/un+1)
and un+2 = the remainder = (un - qn.un+1).
Then we have :- un = qn.un+1 + un+2
un+1 = un+1
[This last equation is dumb but is needed next.]
These equations can be written in matrix form as
{un, un+1} = {qn, 1, 1, 0} * {un+1,
un+2}.
or equivalently
itern[] = {qn, 1, 1, 0} * itern+1[]
At pass counter 'n' we will have :-
iter0[] = Mn[] * itern[]
We can use this equation to update M[] :-
iter0[] = Mn[] * {qn,
1, 1, 0} * itern+1[]
Hence Mn+1[] = Mn[] * {qn, 1, 1, 0}
Then increment n, and repeat the above
until the 1 value in iter[] is zero.
After 'n' moves, (just as for the Example above)
iter[] will go to {D, 0},
where 'D' is the GCD sought;
simultaneously M[] = {M11, M12, M21, M22}
(with det = +-1) will be found.
(det toggles between +-1 with each pass
since det {q,1,1,0} = -1).
Reworking the Example above, in Ref 9 style, might help to firm things up :- Pass end_values[] {q, remainder} M[] at end determinant of M[] 1 {1547, 560} {2, 427} {2, 1, 1, 0} -1 2 { 560, 427} {1, 133} {3, 2, 1, 1} +1 3 { 427, 133} {3, 28} {11, 3, 4, 1} -1 4 { 133, 28} {4, 21} {47, 11, 17, 4} +1 5 { 28, 21} {1, 7} {58, 47, 21, 17} -1 6 { 21, 7} {3, 0} {221, 58, 80, 21} +1 7 { 7, 0} END (M[] is being post multiplied at each pass by {q, 1, 1, 0})
Since the determinant of M[] toggles between +1 (even pass numbers) and -1 (odd pass numbers), we have det = (-1)pass number and in software only the least significant bit of 'n' the 'pass number' needs to be tested to find the determinant.
At the end we have :-
{1547, 560} = {221, 58, 80, 21} * {7, 0}.
Hence inverse of M[]
= M-1[] = {21, -58, -80, 221}.
Pre multiplying both sides by the inverse gives :-
{21, -58, -80, 221} * {1547, 560} =
Identity[] * {7, 0} = {7, 0}
Expanding these as 2 simultaneous equations :-
21*1547 - 58*560 = 7 ---- (1)
and
-80*1547 + 221*560 = 0 -- (2)
******
Thinking about equation (2) first. This equation is a solution to Minkowski's theorem :-
Minkowski's Theorem
Let k > 1 be an integer.
For every pair of integers, {u, v},
there exists a pair of integers, {a, b}, with
0 < abs(a) + abs(b) <= 2 * Ceil (SQRT (k))
such that a.u + b.v = 0 (mod k).
['k' provides an upper bound to the a and b solution values. This upper bound just indicates that values of similar size to the inputs can be found to satisfy the equation. Evaluating 'k' gives k>22651 and SQRT(k)=150.5 for equation (2).]
Some authors refer to solutions to Minkowski's Theorem as being so difficult to obtain that 'exhaustive searching' may be the fastest method. Yet here we have the Ref 9 technique generating solutions almost gratis.
******
Equation (1) is more relevant to the main objective of this document - finding multiplicative inverses rapidly.
++++++
Notes on Multiplicative Inverses
[Caution - the value '0' never has a multiplicative inverse, and is excluded from the theory.]
Definition :-
The multiplicative inverse of a given Positive integer 'a' relative
to a given integer 'n' is a Positive integer 'b' such that
a.b = 1 (mod n) (n an integer >= 2)
The condition for the existence of a 'b'
to satisfy this equation is
GCD(a, n) = 1
(the definition of mutual primality)
The proof of this can be obtained
by applying the 'reverse mod mapping'
to the defining equation above :-
for some a, b and n :- a.b - f.n = 1
(where f = appropriate Floor() value)
If now GCD(a, n) = D, say,
with D not equal to 1;
then since D evenly divides both 'a' and 'n',
D must evenly divide the right-hand side
of the equation as well; but this it does not do.
The only way that it will, is if GCD(a, n) = 1.
Some examples.
If 'n' is a composite number, say 9,
then only the values mutually prime to '9'
i.e. {1, 2, 4, 5, 7, 8},
in the truncated mod n set, have inverses.
'1' and '8' are their own inverses,
'2' is the inverse of '5',
'5' the inverse of '2',
'4' the inverse of '7' and
'7' the inverse of '4'.
If 'n' is a prime number then all elements of set {1,2, ... (n-1)} are mutually prime to 'n' (by the definition of primality) and have inverses. For this case, it is not necessary to calculate the GCD.
Clearly, if a 'b' can be found that satisfies the defining equation,
then (b+n) will also satisfy it. The value of 'b' that is required as the
inverse, is the value b (mod n) which will be in
the set {1, 2, ... (n-1)}.
A notation for inverses is :-
a-1 = b (mod n).
++++++
Repeating equation (1) above :-
21*1547 - 58*560 = 7
Suppose we are trying to find the inverse
of 1547 with respect to 560, then
we are sunk, since GCD (1547, 560) = 7
i.e. the test for the existence of an inverse fails.
If, by chance, we were looking for a number 'b'
such that b * 1547 = 7 (mod 560)
then, applying the 'mod 560' operation
to equation (1) gives
21 * 1547 = 7 (mod 560).
Hence b = 21 is the solution.
Such a solution is not much use in number theory;
but multiplicative inverses appear in all
manner of problems.
Having found that a multiplicative inverse does not exist for our example
problem, we can convert it to a problem that does have a solution by dividing
out the GCD.
Dividing {1547, 560} by 7 gives {221, 90}. Reworking the Ref.
9 method with these start values gives :-
Pass end_values[] {q, remainder} M[] at end determinant of M[] 1 {221, 90} {2, 41} { 2, 1, 1, 0} -1 2 { 90, 41} {2, 8} { 5, 2, 2, 1} +1 3 { 41, 8} {5, 1} { 27, 5, 11, 2} -1 4 { 8, 1} {8, 0} {221, 27, 90, 11} +1 7 { 1, 0} END
At the end we have :-
{221, 90} = {221, 27, 90, 11} * {1, 0}.
Hence M-1[] = {11, -27, -90, 221}.
[Note that 2 of the elements of M[] are the start values.]
Extracting equation (1) again :-
11*221 - 27*90 = 1
Now applying a 'mod 90' operation on this equation gives
11*221 = 1.
Hence 11 is the multiplicative inverse of 221 (mod 90).
If we apply a 'mod 221' operation onto this same equation we get
-27*90 = 1.
Hence -27 is the multiplicative inverse of 90 (mod 221).
[Note that -27 maps to (221-27) = 194 under (mod 221),
and 194 is the preferred value. This technique for handling negative values
would also apply if the determinant had happened to be negative, causing
all the terms in M-1[] to be negated.]
At first encounter, the method sketched out, above, does not offer much prospect for a high speed implementation. Even with machine word sized arguments, the method seems unpromising; with multi-word sized arguments the prospect seems worse.
In order to form a high speed module, a firm grasp of the theory above in conjunction with a knowledge of the techniques needed to extract the best performance from Pentium processors is needed.
[I assume, below, that you are familiar with the basic principles of writing high speed software on Pentium processors. 'speed' introduces this.]
The problems mentioned above that need solving are :-
(i) The use of the slow DIV operation in conjunction with a
slow progress towards completion.
(ii) The swapping round of variables.
(iii) Stage by stage, M[] is updated by a post multiplication by
{q, 1, 1,0}.
The post multiplication of
M[] = {M11,M12,M21,M22} by {q, 1, 1, 0} results in :-
col. 1 | col. 2 | |
row 1 | M11.q + M12 | M11 |
row 2 | M21.q + M22 | M21 |
In words, this matrix operation is :-
a multiplication of the 1st column by 'q',
adding this product to the 2nd column,
then swapping columns.
The products call MUL (a minimum of 5 clock ticks)
and the sums call ADD (1 clock tick).
The column swaps require 2 # XCHG (each taking 2 clock ticks).
This workload is small compared with the 39 clock ticks required
by the DIV to find 'q', but not negligible.
This gives a naive total of 57 clock ticks per pass, but the
overhead of MOVing local variables on/off chip will increase this. On average
approximately 19 passes will be required (full 32 bit word random inputs
- see 'converge'), giving a crude time estimate of 1100 to perhaps 1800
clock ticks. This is poor.
############
This section might be skipped. It gives a (pedantic?) theoretical justification used to ease problems (i) and (ii) above.
I tackle the column swap problem first. Solution - get rid of it!
If the end column swap (above) is dropped we get M[] =
col. 1 | col. 2 | |
row 1 | M11 | M11.q + M12 |
row 2 | M21 | M21.q + M22 |
This second swap counters the swap that was (surreptitiously) taking place in Ref. 9 - and leaves the determinant always +1, at the price of losing track of column positions. The latter problem is trivially solved by keeping track of the 'pass counter'. Then, at the end, the counter indicates where the matrix elements can be found.
Having column swapped M[], we need to row swap the right hand column
vector to keep the equations in order.
(See 'matnotes', specifically on the matrix I called RCS (short for row-column-swapper).
In the presentation above, the sequence of right hand column vectors was :-
{u0, u1}, {u1, u2}, {u2, u3}, {u3, u4} ..., {GCD, 0}
If we row swap this vector alternately, (to counteract the column swapping of M[]) we get :-
{u0, u1}, {u2, u1}, {u2, u3}, {u4, u3} ..., {GCD, 0}
From above, the basic transition equations were :-
un = qn*un+1 + un+2
and un+1 = un+1
For the first stage (with n=0), these become :-
u0 = q0*u1 + u2
and u1 = u1
then using the row swapped right hand column vector, and mapping into matrix format gives :-
{u0, u1} = {1, q0, 0, 1} * {u2, u1}
Updating M[] with this new q-matrix gives :-
{M11, M12, M21, M22} * {1, q0, 0, 1}
= {M11, M11*q0 + M12, M21, M21*q0 + M22}
The new M[] matrix is precisely the column swapped M[] matrix sought above.
For the next stage (with n=1), the transition equations become :-
u1 = q1*u2 + u3
and u2 = u2
then using the row swapped left hand column vector and mapping into matrix format gives :-
{u2, u1} = {1, 0, q1, 1} * {u2, u3}
Updating M[] with this new q-matrix gives :-
{M11, M12, M21, M22} * {1, 0, q1, 1}
= {M11 + M12*q1, M12, M21 + M22*q1, M22}
############
In summary, in order to avoid swapping columns around, the following
2 step process has to be adopted :-
Step 1 :- Find 1st 'q' (= Floor(u/v)) value. Then multiply the
left hand column of M[] by q and add it to the right hand column.
Step 2 :- Find 2nd 'q' (= Floor(v/u) value. Then multiply the
right hand column of M[] by q and add it to the left hand column.
Go to step 1 and Repeat as needed.
So, what to do about the DIV needed to find 'q'; and then the MUL by 'q' needed to update the M[] matrix in the way discussed above?
The best policy is to avoid using either of them as far as possible!
The DIV can be replaced by repeated subtractions, and after each subtraction
a column add can be used to update M[], since each subtraction implies
a quotient of '1'. In this way, both the slow DIV and MUL ops vanish.
As discussed above, on average q will be <= 8
in about 17 cases out of 20; and 4 subs and 2*4 adds (for the columns)
+ some looping overhead will take far less time to complete than a DIV
+ 2 MUL's + assorted MOV's.
[Actually, the column adds will occur in parallel on all recent Pentiums - they are data independent adds. See 'speed'.]
In the listing below, the situation is more complicated; after a set number of data subtractions, if the reduction is not complete, a switch is made to a division technique. Although the speed difference of the two techniques is significant, each has a part to play in arriving at an optimal module.
The software listing below has a few further embellishments :-
When I first wrote the module, I naively set out with an intention
of using no DIV's at all. However, in testing I found that 1 in several
thousand test cases ran slowly (some very slowly!) because the occurrence
of large 'q' values is not rare. If q > 10, say, the benefit of using SUB's
and columnar adds is lost.
[A particularly malignant case occurs when the two input values are nearly identical. After the first difference, one parameter is then many bits in size different from the other. From there, many thousands or even millions of SUBs are required for the lesser to reduce the larger. At that point the processor gives a good impression of having entered an infinite loop!]
I solved this problem by introducing a local variable, 'nloop'
(set to a #define INV_NLOOP=7 below). This local variable is reset at the start
of each reduction stage, then decremented after each columnar add.
If it counts down to zero, then a jump is made to
a division section. There, a check on the bit size difference of the current
input values is made. If this exceeds another
#define INV_DIFF_MAX=2, (i.e. more than a further 4-7 SUBs are needed) then a DIV is done to
finish the stage off; if not, SUBing is resumed.
Later, I experimented with replacing the DIV's by
a shift-sub technique, and this resulted in about 10% saving of time; hence
the listing below uses this technique. 10% is not a decisive speed gain for the module below,
but when the multi-word case is considered, the speed gain is considerable.
In this shift-sub technique, the lesser of the 2 given values is shifted
left until both values have equal most significant bit indices. (the 2
relevant column elements of M[] are also shifted left by an identical amount)
An attempt at a SUB is then made. If it succeeds
(no borrow occurs) then a shifted M[] column add is done. If it fails (a
borrow occurs) then the error is corrected by ADDing the lesser value back
and no shifted M[] column add is done.
Next the lesser value plus the 2 shifted column
elements are right shifted by 1 bit, and the process repeated until the
shifted values have returned to their starting values.
This technique might be thought of as reducing the
larger value from the left hand end; whereas the SUB section reduces it
from the right.
The shift-subbing method only pays off when 'q' is representable in fewer than a handful of bits (beyond that DIV is a better choice). As the probability tables in 'converge' show, that is very likely to be the case. When larger 'q' values do occur, the shift-sub technique will take longer than using DIV, but these cases are sufficiently rare that on average a little time is saved.
In 'fastmuli' I extend this document to cover the multi-word inputs case. The use of DIV in that case, to handle larger 'q's, is not simple, whereas shift subbing is straightforward; this is another good reason for introducing the shift-sub technique here.
/***************** To obtain litl_Inv() from this document use WordPad and load this file. Scan down the file to this section, then cut out all surplus text. Then 'Save as' the remainder as a text file with any convenient name. [litl_Inv() is in a 'preformatted' plain text section with no HTML tags.] ******************/ /*********************/ // for the dot-h file :- // The next 2 define's control module flow and // have an effect on speed. //#define INV_LOOP_SHIFT 2 //#define INV_NLOOP 7 // //#define ULONG unsigned long // // for the dot-cpp file :- ULONG litl_Inv (ULONG u, ULONG v, ULONG& m22, ULONG& m12) // // Conditions :- // (1) GCD(u, v) MUST be '1' on entry. // (2) u != 0, v != 0. // (3) u greater than v on entry. // (4) 'n' (the exponent in determinant = (-1)^n) is the // returned value. // // On exit a.u + b.v = 1 (u,v IP's, a,b OP's) // with a = (-1)^n.m22; b = (-1)^(n+1).m12 // (If inverse of v mod u is required this is 'b' /// (or (u-b) if b -ve)) // (If inverse of u mod v is required this is 'a' // (or (v-a) if a -ve)) // // Notes :- // (1) eax used to hold 'u', and edx 'v' throughout as // they are reduced in turn. // (2) Array M[] held in ecx(m11), esi(m12), // ebx(m21), edi(m22) // // Caution :- // On my development system esi & edi are used by the // compiler to hold register variables. Ordinarily these // registers need preserving in an asm{} block by // PUSHing at the start and POPing at the end. // This is NOT done below largely because development // systems vary on this point e.g. 64 bit assembler. // You will need to find out what registers need // preserving on your system - then adjust the code to suit. // Failure to get this right will crash your executable! // In testing the module below, I cheated - I disabled // register variables in the compiler command line. // // No DIV's or MUL's used in this version // - shift-sub technique used. // // This subroutine may be used by anyone. // I have tested it vigorously but I guarantee nothing. // If you do use it, please include a reference to my // web site in your documentation :- // www.kahnet.co.uk // Comments and suggestions can be sent to // enquiries@kahnet.co.uk { ULONG m22in, m12in, m11in, m21in, n=0, shift ; int nloop ; asm { mov edx,v xor esi,esi xor ecx,ecx inc esi mov eax,u mov edi,ecx mov ebx,esi // m[] now {ecx=0, esi=1, // ebx=1, edi=0} Sub_uv: mov nloop,INV_NLOOP Sub_uv_loop: sub eax,edx add ecx,esi // doing column add add ebx,edi cmp eax,edx // another sub possible? jb Exit_uv dec nloop jge Sub_uv_loop jmp ShiftSub_uv // after INV_NLOOP subs switch // to ShiftSub technique Exit_uv: inc n or eax,eax // sets zero flag if zero je Exit Sub_vu: mov nloop,INV_NLOOP Sub_vu_loop: sub edx,eax add esi,ecx // doing column add add edi,ebx cmp edx,eax // another sub possible? jb Exit_vu dec nloop jge Sub_vu_loop jmp ShiftSub_vu // after INV_NLOOP subs switch // to ShiftSub technique Exit_vu: inc n or edx,edx je Exit jmp Sub_uv /************************/ ShiftSub_uv: // In here u (eax) greater than v (edx) push ecx push ebx bsr ecx,eax // get most significant bit indices of // eax, edx into ecx, ebx bsr ebx,edx sub ecx,ebx // ecx now holds difference in bit indices pop ebx cmp ecx,INV_LOOP_SHIFT // check size of shift jg SS_cont_uv // if difference large enough proceed // to ShiftSub section // NOTE jg = SIGNED 'jump if greater'. // cmp could be -ve. pop ecx jmp Sub_uv // else return to subbing SS_cont_uv: mov shift,ecx // shift-left edx,esi,edi here shl edx,cl // now most significant bits eax, // edx are aligned shl esi,cl shl edi,cl pop ecx Shiftsub_dosub_uv: sub eax,edx // try for a shifted-sub jnc Shiftsub_coladd_uv // if edx greater than eax carry // (borrow here) flag set - sub fails // jnc = jump if no carry add eax,edx // reverse the sub above jmp Shiftsub_no_coladd_uv // and skip over // column adding next Shiftsub_coladd_uv: add ecx,esi // column add by shifted esi, edi add ebx,edi Shiftsub_no_coladd_uv: shr edx,1 // shift-right edx,esi,edi by 1 bit shr esi,1 shr edi,1 dec shift // when shift reduced to zero, // section finished jnz Shiftsub_dosub_uv // not zero - try another shifted sub cmp eax,edx jb Exit_uv jmp Sub_uv // eax greater than or equal edx // - 1 more sub needed /********************/ ShiftSub_vu: // In here v (edx) greater than u (eax) push esi push edi bsr esi,edx // get most significant bit indices // of edx, eax into esi, edi bsr edi,eax sub esi,edi // esi now holds difference // in bit indices cmp esi,INV_LOOP_SHIFT jg SS_cont_vu // if difference large enough // proceed to ShiftSub section pop edi pop esi jmp Sub_vu // difference small - return to subbing SS_cont_vu: // shift-left eax,ebx,edi(holds ecx) here mov shift,esi mov edi,ecx // take copy of ecx mov ecx,esi // load ecx with shift needed shl eax,cl shl ebx,cl shl edi,cl // shifting ecx in effect mov ecx,edi // put shifted copy back pop edi // reload edi, esi pop esi Shiftsub_dosub_vu: sub edx,eax // try for a shifted-sub jnc Shiftsub_coladd_vu // if edx less than eax carry // (borrow here) flag set - sub fails // jnc = jump if no carry add edx,eax // reverse the sub above jmp Shiftsub_no_coladd_vu // skip over column adding next Shiftsub_coladd_vu: add esi,ecx // column add by shifted ecx, ebx add edi,ebx Shiftsub_no_coladd_vu: shr eax,1 // shift-right eax,ecx,ebx by 1 bit shr ecx,1 shr ebx,1 dec shift // when shift reduced to zero, // section finished jnz Shiftsub_dosub_vu cmp edx,eax jb Exit_vu jmp Sub_vu // eax greater than or equal edx // - 1 more sub needed /*************************/ Exit: // save results to local variables mov m22in,edi mov m12in,esi mov m11in,ecx mov m21in,ebx } // end of asm block - back to C code // Column swaps have been supressed above. Reinstate here. // // Also ULONG& m22, ULONG& m12 are reference input parameters // which are the most useful members of m[]. // In my implementation m11 and m21 are class variables // which can be recovered using another class method if // the whole of m[] is needed. // My assembler does not recognize references which leads // to a messy use of local variables which are then passed // over to the IP parameter references in the next section. // Trim all this waffle to suit your own needs/software // capabilities. if ((n & 1) == 0) { // if n even do col. swap m11 = m12in ; m21 = m22in ; m12 = m11in ; m22 = m21in ; } else { m11 = m11in ; m21 = m21in ; m12 = m12in ; m22 = m22in ; } return n ; } /********************/
(a) The listing litl_Inv() is dedicated to finding word sized multiplicative inverses in finite fields. For this application the GCD is always '1'; hence it is not checked within. It is simple to adapt the code to handle situations where GCD() may not be '1'.
[The GCD is available within the code - if 'v' goes to zero it is 'u'; if 'u' goes to zero it
is 'v'. If the GCD is not '1', no multiplicative inverse exists, of course.
In this way the module could easily be converted into a GCD finder.
The listing can also easily be converted to a Minkowski Theorem solver. For this problem the other 2 entries of M[] need to be output.]
Also, the input condition that (u > v) is straightforward to apply.
(b) The big question, at the end of
the day, is "Is it fast?". A problem with timing GCD modules, is that the
speed depends on the input values. (see the comments in 'converge').
The only way to get typical timings is to use a large 32 bit
prime for 'u' and random 32 bit numbers for 'v' as inputs.
(The Sophie Germain prime mentioned in 'converge', and associated generator were used.)
Then take, say, 16k of timings to average out the scatter. This
was done to obtain the following results.
For the listing litl_Inv(), the timing results were :-
Average ticks = 880,
Minimum ticks = 599,
Maximum ticks = 1167
(all Pentium III, hot cache timings
- see 'speed' for the explanation).
(Inputting Fibonacci numbers
F47 = 0xB11924E1 and F46 = 0x6D73E55F
gives ticks = 508.
Contemplation of the Fibonacci case, where the q-sequence is {1,1,...(45 times)}, leads to the conclusions that :-
(a) Cold cache problems will be absent.
(b) Pipeline crashes will be absent
(the branch predictor system will be 100% successful).
(c) The shift-sub section will never be used.
Hence the minimum hot cache time for 1 stage approx= 508/45 = 11 ticks.)
[Extending the module above to 64 bit assembler gives run times approximately double the values above, since the number
of passes (for 64 bit inputs) will double. In fact I find that the extra registers available with 64 bit code allows a reduction of local variables (to nil!) with a resultant small speed gain. The Pentium III cannot do 64 bit arithmetic, but on an Athlon (also hot cache) the 64 bit version runs in about 1320 ticks.
In 64 bit assembler there is enough CPU register space to find a 128 bit inverse 'on chip' i.e. without resorting to stack space (slow in comparison). This takes a little more than 4 times that for the 32 bit version.]
Compare these to a minimalist Euclidean module in "C" :-
/*****************/ ULONG GCD (ULONG u, ULONG v) { if ((v == 0L) || (u == v)) return u ; return GCD (v, u%v) ; } /*****************/
GCD() gave timing results :-
Average ticks = 1127,
Minimum ticks = 514,
Maximum ticks = 1818.
[Note the wider scatter from GCD() - the use of DIV (in u%v) will occasionally be fast and occasionally be slow. But it's average speed is slower.]
[The Fibonacci pair needed 2411 ticks to run. Note that this is the 'worst case scenario' for GCD(); but the 'best case scenario' for litl_Inv().]
Hence the listing litl_Inv() runs in about 78% of the time needed by GCD(), on average. But this does not compare like with like. GCD() cannot easily be converted to a multiplicative inverse finder, and it would take maybe 3-5 times as long to run if it were.
I also compared it to an old dedicated multiplicative inverse finder which gave an average time of 19430 ticks i.e. 20.1 times slower. This module uses a binary algorithm from Articles 14.61 & 14.64 of ref. 5. This module was written for multi-word inputs (i.e. far from optimal for word sized inputs); also I did not tune it much. So once again comparisons are not very relevant. Nevertheless, I doubt if I could get it better than 3-5 times slower, no matter how much effort I put into it.
These considerations make litl_Inv() probably the fastest small inverse finder available.
When I wrote the Ref. 5 module, finding multiplicative inverses was only a 'spear carrier' in my work. Latterly, finding multi-word inverses has become more significant. The slowness of the Ref. 5 module started me investigating alternative techniques.
The germ of the idea for the listing litl_Inv() comes from 'Ref' 7 pg. 75, where a so called 'blended' GCD algorithm is discussed. The ideas expressed there inspired me to apply them to an extended Euclidean algorithm. Note that the listing only makes sense when a firm theoretical foundation is blended with knowledge of the strengths and weaknesses of Pentium processors.
The listing above was written as a 'proof of concept' module. I have little need for a word sized multiplicative inverse finder; but I needed to know that the principles used within it were likely to be fruitful when applied to the multi-word problem. These hopes were fulfilled. The multi-word extension of litl_Inv() (see fastmuli.html) is in the order of 100 times faster than the standard methods.
One application for litl_inv() suggests itself - finding the inverses in the key setup stage of IDEA type block ciphers. The time to find these inverses dominates key setup time, and is irritating for heavy users (e.g. users maintaining IDEA secured databases). The ensuing loss of operator productivity from this can undermine a product's credibility. This problem can be eased considerably by inserting litl_Inv() into IDEA code.
Also, since litl_Inv() can be extended to 64 and 128 bit inverses, it opens the way for 128, 256 and 512 bit IDEA type block ciphers largely free from unpleasantly hesitant performance.
[The running times for these larger modules can be estimated by scaling up the figures above in a linear fashion, so long as a 64 bit processor is to be used. (On a 32 bit processor the lack of register space means stack space would have to be used on the larger modules, with a moderate time penalty from the resulting data MOV's.)]
There is a theoretical problem that must be overcome to put together larger IDEA type block ciphers, but several simple solutions to this problem suggest themselves.
In Ref. 4 (pg 319), well known author Bruce Schneier, says of IDEA :-
"In my opinion, it [IDEA]
is the best and most secure block algorithm
available to the public at this time."
(Ref 4 was published in 1996).
However, today, partial breaks into IDEA have been announced at cryptographic conferences (which almost certainly means that complete breaks have been made in private). Hence it can no longer be considered for 'new build' projects and existing projects that use it need to consider their future with some urgency. Moving to a larger IDEA type block cipher provides one option for the future.