An Introduction to Euclid's algorithm
including code for a fast word sized
multiplicative inverse finder
for Pentium processors

Euclid's algorithm finds the greatest common divisor (GCD) of 2 integers. This algorithm is very old, but is a foundation stone of number theory to this day.
(It is arguably the oldest algorithm known)

The target audience of this document is high school mathematics students, but moves steadily along to a level that may stretch them.
I use a little matrix theory within - high school students may need a textbook on the topic to help out here. Below, I give no more than a summary of the matrix theory that I need.

[Experts may want to skip the first few 'elementary' sections.]

[Mathematical historians do not believe that Euclid (c.325 B.C. - c.265 B.C.) invented the algorithm that bears his name. Euclid worked as an 'encyclopaedist' in some institute in Alexandria; as such he would have been familiar with the whole of Greek mathematics at that time. Beyond that, little is known about him. However, Euclid appears to have been the first to put the algorithm into it's present arrangement.]

[Several other documents on this site are referred to repeatedly within. All the links to these documents are given here; return here if you want to view the contents.
fastmuli.html - This extends this document to cover the problem of finding multi-word GCD's and multiplicative inverses. The theory developed below also applies to fastmuli.html.
[A fast module which is the key item for doing this is presented within.]
speed.html - Introduces the basics of writing high speed code for Pentium processors.
converge.html - Discusses the rate of convergence of the Euclidean algorithm.
euclogic.html - Discusses the logical foundations of the Euclidean algorithm.
matnotes.html - A 'crib' sheet for such matrix theory as is used below. This must be used with care since it only applies to 2-by-2 matrices with integer elements.
refs.html - Most of the technical references used on this site are listed within. This site has a second file of references on a special topic. If I refer to this second file, a live link will be provided.
Below I shall generally write e.g. 'Ref' 11 or 'matnotes' (single quotes) when referring to any of the files above.]

The 32 bit code, (i.e. using 32 bit operations operating on 32 bit inputs) listed below, is probably the fastest in the world.
I have also developed (size limited) 64 bit and 128 bit multiplicative inverse finders (both using 64 bit operations) using the principles discussed below, which are probably, the fastest available anywhere. These have the additional benefit of small code volumes.
All the items above are for sale.

## ------------ ##

[Some Mathematical Preliminaries :-
In modern mathematics the integers are whole numbers, and the complete set of them is

Set of Integers =
{-infinity, ... ,-2,-1,0,1,2, ... +infinity}.

In Euclid's day, the Greek integer set excluded negative values and zero; and the only number systems available to Greek mathematicians were positive integers and, 'fractions' (a.k.a. Rational numbers) with positive integer numerators and positive integer denominators. In so far as the ancient Greek set of integers is used in number theory today, it is known as either the 'Positive' or the 'Countable' or the 'Natural' integer set; I shall use the first of these.

Set of Positive integers =
{1,2, ... +infinity}.

The nonNegative integer set is also used below; it is the Positive integer set plus the value zero. i.e.

Set of nonNegative integers =
{0,1,2, ... +infinity}.

Below, I shall need to use '0' from time to time, but all input, intermediate and output values, related to Euclid's algorithm, are from the Positives.
Wherever I use the word 'integers' the phrase 'Positive integers' is implied.
Beginners to number theory need to be aware of the following :-

In Number Theory, all input values, all intermediate values and all output values for any calculation are Integers.

[Strictly speaking Number Theory includes the Rationals (mentioned above) and modern Number Theory has a branch that includes the use of the calculus (of necessity associated with Real and Complex numbers) - this body of theory is known as 'Analytic Number Theory'.
Neither is relevant to this document.]

This statement is both profound and subtle. It means that familiar schoolroom calculations may not apply. The most obvious problem comes from Integer division.

NOTES on INTEGER DIVISION
On a calculator, any number will divide any other number. What may not be clear, to users of calculators, is that (most) calculators only work with floating point numbers, for which division is unrestricted - excluding calculator limitations anyway. The familiar world of calculator arithmetic has to be left behind once number theoretical integer divisions are in play.

For integers, 3 division variants exist;
simple division, Ceil() and Floor().

For positive inputs, Floor() returns a value
that corresponds to the integer part of the result
from a calculator division (e.g. Floor(3/2) = 1).

Ceil() returns a value of (Floor() + 1)
so long as there is something (a fractional part)
after the decimal point of the calculator division,
or Floor() if there isn't (e.g. Ceil(3/2) = 2).

Simple division is fine so long as the
denominator divides the numerator
without remainder;
otherwise it does not exist.

Some textbooks use the phrase 'even division' to emphasize the situation where one integer divides another without remainder. If even division occurs, then Ceil(), Floor() and division, all give the same result.
Note that Floor() might return zero, even when the 2 inputs are from the Positives.
Note, also, that many textbooks just use the word 'division' when 'Floor()' is intended.

[A more rigorous test for the existence of division is to check whether multiplying the integer part (Floor()) of the (calculator) quotient from the division by the (integer) denominator recovers the (integer) numerator exactly. If, and only if, this is possible, does even integer division exist.]

In summary, we have :-

Floor(3/2) = 1
Ceil(3/2) = 2
3/2 = does not exist!

[This problem with division splits number theory profoundly. The higher level constructs of Fields and Rings are differentiated by whether division is universally possible or not. Division is always possible between any two elements of a Field (with the exception of the 'zero' element), but may not be in a Ring. The discussion above suggests that, without precise definition, the integers are definitely not a Field but might form a Ring, (they do). Note that the concept of greatest common divisor (GCD), and hence Euclid's algorithm is meaningless in a Field and is only meaningful in a Ring.
[The more common Fields are the Real numbers, the Complex Numbers and the Rational numbers (fractions).]]

Prime numbers are elements from the Positives (excluding '1') evenly divisible only by '1' and themselves.

Prime number set =
{2,3,5,7,11, ... +infinity}.

Composite numbers are the Positives (excluding '1') not in the Prime number set. Composite numbers are evenly divisible by two or more members of the Prime number set.

Composite number set =
{4,6,8,9,10 ... +infinity}.

An important operation in number theory, almost unknown in school-rooms, is the 'mod' operation. The mod operation returns the remainder following a Floor() division of the 2 input values. Examples :-

3 mod 2 = 1
6 mod 2 = 0
7 mod 8 = 7

Formally, we have :-

if m = x mod y
then m = x - y.Floor(x/y)

Note that :-
(i) Floor() always exists, hence 'mod' always exists.
(The symbol 'q' - quotient - always stands for Floor() below)
(ii) The mod value is zero if and only if 'y' (the denominator) evenly divides 'x' (the numerator).
(iii) The mod value is from the set {0,1, ... ,(y-1)}

Note that the remainder is always less than 'y'.

The 'mod' operation can be seen to map the Positives to a truncated nonNegative integer set :-

Positives => {0,1, ... (y-1)} after mod.

The mod mapping might be reversed by :-

m + y.q = x (if 'q' happens to be known)

Definition :- Two given integers with a GCD of '1' are called mutually prime.

This does not mean that either of the given integers are (necessarily) prime numbers - just that they do not have a common divisor other than '1', just as any pair of different prime numbers have.
Thus 18 and 35 are mutually prime but neither is a prime number.
The principle importance of mutual primality is that :-

multiplicative inverses exist if and only if the given pair of numbers are mutually prime (more discussion below).

Euclid's 'Elements', book VII, Propositions 1 and 2 which cover his eponymous algorithm, make no mention of multiplicative inverses; but these are my main objective in this document. It is possible to sneak up on multiplicative inverses by contemplating the calculation of the GCD. From this, multiplicative inverses appear almost incidentally.]

## ------------ ##

Several modern developments of Euclid's algorithm exist, see 'Refs' 3, 9, 10, 11 for details.

I am not going to try to compete with the references above. Reference 9, which is not on many people's shelves, develops the algorithm in an unusual way, which I exploit below, and in the document on this site on fast multiplicative inverses for multi-word arguments.

This document starts with an introductory discussion of Euclid's algorithm (to enable beginners to the subject to 'get on board'); then reproduces the presentation of Ref 9 (I assume that a copy will be unavailable); then adjusts the theory to enable a fast software implementation.

The greatest common divisor is just the largest integer that evenly divides a pair of given integers. Thus :-

GCD(6, 8) = 2; GCD(3, 7) = 1;
GCD(18, 35) = 1; GCD(0, 3) = 3

[Note '0' is divisible by all integers, or GCD(0, n) = n.]

For small integers, such as above, the GCD can be written down by inspection. For larger input values we need a procedure to work steadily towards the result - an algorithm.

Definition :- An algorithm is a procedure which, after a finite number of steps, will yield the desired result.

The Basic Euclidean Algorithm

Reference 11, Algorithm E is a fair fit to Euclid's original comments :-

Find GCD(A, B) given 2 integers {A, B} from the Positives :-
  (Recall that GCD (0, n) = n. 
          Use this if either A or B might be zero on entry)
  (Ideally  A <= B at the start; but if not,
          the 1st pass will just swap A with B.)

   E1: If A evenly divides B, then A is the GCD.

   E2: If B mod A = 1, then A and B are relatively prime,
          hence the GCD is 1.
      Otherwise replace  {A, B}  by  {B mod A, A}
          and goto E1.

   An Example (from Ref. 12)
   Find the GCD of {560, 1547} :-
     E1 fails; then since 1547 = 2.560 + 427, E2 output = (427, 560)
     E1 fails; then since  560 = 1.427 + 133, E2 output = (133, 427)
     E1 fails; then since  427 = 3.133 +  28, E2 output = (28,  133)
     E1 fails; then since  133 = 4.28  +  21, E2 output = (21,   28)
     E1 fails; then since   28 = 1.21  +   7, E2 output = (7,    21)
     E1 succeeds - the GCD is 7.

[For a discussion of the logic behind this algorithm, see
euclogic.html. In brief, the algorithm works! Phew!]

The basic algorithm, above, is fine for hand calculations. Anyone needing to write a Euclidean module quickly, might also use it. But if the ultimate in speed is essential, or if the input variables {A, B} are, say, several processor words in size, the speed will be poor and projects that use it extensively will be unacceptably slow.

Two of the reasons for the slowness of Algorithm E (when applied to processor word sized arguments) will be given here; an understanding of them will help to suggest ways for speeding it up.

(1) The main reason is that the Pentium 'DIV' operation is going to be used. This operation is the slowest operation that a Pentium can perform - by a wide margin! Reference books give an execution time in the order of 39 clock ticks for a 32 bit DIV and 71 for a 64 bit DIV. Furthermore, in a simple-minded reading of Algorithm E, it will be called twice. For computer word sized arguments, in line E1 the test for 'even division' (in C code), would have to be done as :-

 if ((B % A) == 0) return A ;
   // In 'C' B % A = remainder after Floor() division = mod

At the op-code level, the '%' operator calls DIV which calculates both quotient and remainder. A similar call is required in E2. Only the dumbest of compilers will call DIV twice in Algorithm E, but even once carries a heavy time penalty. On top of that, each DIV call makes only a small step towards Algorithm E completion - on average, many DIV calls will be needed. Below, see 'converge' for a discussion of the rate of convergence of Algorithm E. This includes an estimate of the number of calls to DIV likely to be needed.

(2) A secondary reason for the slowness is that in step E2, the working variables are swapped round at each pass. Towards the end of document 'speed', I mention that moving data around, in this fashion, is a waste of time.

[The 'swapping' can be eliminated by using a 2-stage process :-
(1) Find B mod A.
(2) If work completed, exit
(3) Find A mod B
(4) If work completed, exit.
(5) Goto (1)
This may seem to be a trivial adjustment, but when the work is done in x86 assembler (see below) the lack of registers to hold variables with 32 bit assembler makes its benefit significant.]

The Rate of Convergence of the Euclidean algorithm

In order to guide Euclidean algorithm development, 3 items of information are needed :-
   (1) A 'feel' for the typical performance of the algorithm (discussed next).
   (2) An understanding of the vagaries of Pentium processors. (see 'speed')
(Substitute your own target processor)
   (3) A theoretical framework for the algorithm to prevent errors in logic.
(see the method of Ref. 9 below)

A detailed summary of the convergence
of the Euclidean algorithm
can be found in converge.html.
In brief, the average number of iterations required
approx= Sum of bitsize of inputs / 3.5.
(with significant scatter)

Hence for 2# 32 bit inputs,
Number of iterations = 64/3.5 = 18.2

('q' is the quotient when modding next)
(for convenience let the # iterations be 20)
in approx 17 out of 20 of these cases q <= 8
and for the remaining 3, q might be large
(because of the 'fat tail' of the probability distribution).

Contemplation of this behaviour hints that using 2 modding techniques might be beneficial :-

(a) When 'q' is small, subtraction will be hard to beat.

(e.g. on an x86 processor with a slow DIV)

(b) When 'q' is large, some other technique will be needed.

[If q > 20, say, subtraction will be slow and this event will not be rare.] [The obvious technique is to use DIV, but below I use a 'shift sub' method. [This has little to offer when GCD's are being found, but is beneficial when finding multiplicative inverses.]]

As a first step towards a faster Euclidean algorithm, I need some solid theoretical foundations, introduced next.

The Presentation of Reference 9

In this section I present the theory leading to a method for obtaining fast multi-word multiplicative inverses - see 'fastmuli' for details. The theory is based on a presentation in 'Ref' 9 section 4.3, which relates to the extended Euclidean algorithm.

[An extended Euclidean algorithm is a Euclidean algorithm extended to calculate multiplicative inverses as well as GCD's.]

[After reading this section, you would be justified in doubting whether there will be any speed payoff from it. Just stick with it.]

The principle problem with what follows relates to the organization of various intermediate values - no new number theory concepts are needed. To bring some order to these values, 'Ref' 9 introduces matrix methods. I am concious that matrix methods may defeat some readers, but their organizational benefits are greater than their drawbacks.

When only the GCD is sought
the use of matrices is superfluous.
When multiplicative inverses are required.
their use is very beneficial.

(see matnotes.html for a cribsheet
of the matrix formulae required below.)

Pseudo code for the Extended Euclidean Algorithm from 'Ref' 9

Reference 9 presents a method of building up a 2-by-2 integer matrix M[], stage by stage. This matrix has a determinant which toggles between +1 and -1 at the end of each stage - hence is always invertible. At the end of the calculation the required multiplicative inverse is one of the entries of M[] (more discussion below).

If {u, v} are the input numbers
D = GCD (u, v); plus a, b values are found
such that a.u + b.v = D
(the 'a' and 'b' values are elements of M[])

[Ideally, but not necessarily, u > v at start.
Also, the u=0 and/or v=0 cases are excluded.]

   Initialize :-
   M[] = {1,0,0,1}   // to the 'Identity' matrix
   n = 0             // n is a pass counter

   Iteration block :-
   while  (v not equal 0)
      q = Floor(u/v)        // repeated subtraction might be used here
      M = M * {q, 1, 1, 0}  // If q=1 then post multiply by {1,1,1,0}
      {u, v} = {v, u - q.v}
      n = n + 1

   Finalize :-
   d = u
   a = (-1)^n * M22
   b = (-1)^(n+1) * M12
   END

A discussion of the pseudo code above

Above, I have tried to be faithful to Ref. 9.
It is more convenient, here down, to change the name of the working variables in the column matrix to :-
iter_n[] = {u_n, u_n+1};
where the subscript 'n' indicates the values associated with pass counter 'n'; i.e. the matrix contents at the end of each iteration.

Initially we have iter₀[] = {u, v}
and M₀[] = {1, 0, 0, 1} = Identity[]

At the end of each iteration the following equation must be satisfied :-

iter₀[] = M_n[] * iter_n[]
by adjusting the contents of M[] and iter[].

The objective in Ref 9's implementation is to arrive at an M_n[] which satisfies the following matrix equation :-

iter₀[] = M_n[] * {D, 0}
where 'D' = GCD (u, v)

The algorithm proceeds by building up M_n[] while simultaneously reducing iter_n[], all by matrix operations.

Consider a single Euclidean step :-
[u₀ > u₁ is presumed at the start;
and iter₀[] = {u₀, u₁}
will be replaced, stage by stage by
{u₁, u₂}, {u₂, u₃} ..., {GCD, 0}]

For some intermediate 'nth' pass :-

If q_n = Floor (u_n/u_n+1)
and u_n+2 = the remainder = (u_n - q_n.u_n+1).

Then we have :- u_n = q_n.u_n+1 + u_n+2
u_n+1 = u_n+1

[This last equation is dumb but is needed next.]

These equations can be written in matrix form as
{u_n, u_n+1} = {q_n, 1, 1, 0} * {u_n+1, u_n+2}.
or equivalently
iter_n[] = {q_n, 1, 1, 0} * iter_n+1[]

At pass counter 'n' we will have :-
iter₀[] = M_n[] * iter_n[]

We can use this equation to update M[] :-
iter₀[] = M_n[] * {q_n, 1, 1, 0} * iter_n+1[]
Hence M_n+1[] = M_n[] * {q_n, 1, 1, 0}

Then increment n, and repeat the above
until the 1 value in iter[] is zero.

After 'n' moves, (just as for the Example above)
iter[] will go to {D, 0},
where 'D' is the GCD sought;
simultaneously M[] = {M11, M12, M21, M22}
(with det = +-1) will be found.

(det toggles between +-1 with each pass
since det {q,1,1,0} = -1).

Reworking the Example above, in Ref 9 style,
                     might help to firm things up :-

Pass end_values[] {q, remainder}   M[] at end   determinant of M[]
 1    {1547, 560}    {2, 427}   {2,    1,  1,  0}    -1
 2    { 560, 427}    {1, 133}   {3,    2,  1,  1}    +1
 3    { 427, 133}    {3,  28}   {11,   3,  4,  1}    -1
 4    { 133,  28}    {4,  21}   {47,  11, 17,  4}    +1
 5    {  28,  21}    {1,   7}   {58,  47, 21, 17}    -1
 6    {  21,   7}    {3,   0}   {221, 58, 80, 21}    +1
 7    {   7,   0}  END

 (M[] is being post multiplied at each pass by {q, 1, 1, 0})

Since the determinant of M[] toggles between +1 (even pass numbers) and -1 (odd pass numbers), we have det = (-1)^{pass number} and in software only the least significant bit of 'n' the 'pass number' needs to be tested to find the determinant.

At the end we have :-

{1547, 560} = {221, 58, 80, 21} * {7, 0}.
Hence inverse of M[]
= M^-1[] = {21, -58, -80, 221}.

Pre multiplying both sides by the inverse gives :-

{21, -58, -80, 221} * {1547, 560} =
Identity[] * {7, 0} = {7, 0}

Expanding these as 2 simultaneous equations :-

21*1547 - 58*560 = 7 ---- (1)
and
-80*1547 + 221*560 = 0 -- (2)

******

Thinking about equation (2) first. This equation is a solution to Minkowski's theorem :-

Minkowski's Theorem

Let k > 1 be an integer.
For every pair of integers, {u, v},
there exists a pair of integers, {a, b}, with
0 < abs(a) + abs(b) <= 2 * Ceil (SQRT (k))
such that a.u + b.v = 0 (mod k).

['k' provides an upper bound to the a and b solution values. This upper bound just indicates that values of similar size to the inputs can be found to satisfy the equation. Evaluating 'k' gives k>22651 and SQRT(k)=150.5 for equation (2).]

Some authors refer to solutions to Minkowski's Theorem as being so difficult to obtain that 'exhaustive searching' may be the fastest method. Yet here we have the Ref 9 technique generating solutions almost gratis.

******

Equation (1) is more relevant to the main objective of this document - finding multiplicative inverses rapidly.

++++++

Notes on Multiplicative Inverses

[Caution - the value '0' never has a multiplicative inverse, and is excluded from the theory.]

Definition :-
The multiplicative inverse of a given Positive integer 'a' relative to a given integer 'n' is a Positive integer 'b' such that

a.b = 1 (mod n) (n an integer >= 2)

The condition for the existence of a 'b'
to satisfy this equation is
GCD(a, n) = 1
(the definition of mutual primality)

The proof of this can be obtained
by applying the 'reverse mod mapping'
to the defining equation above :-
for some a, b and n :- a.b - f.n = 1
(where f = appropriate Floor() value)

If now GCD(a, n) = D, say,
with D not equal to 1;
then since D evenly divides both 'a' and 'n',
D must evenly divide the right-hand side
of the equation as well; but this it does not do.
The only way that it will, is if GCD(a, n) = 1.

Some examples.

If 'n' is a composite number, say 9,
then only the values mutually prime to '9'
i.e. {1, 2, 4, 5, 7, 8},
in the truncated mod n set, have inverses.

'1' and '8' are their own inverses,
'2' is the inverse of '5',
'5' the inverse of '2',
'4' the inverse of '7' and
'7' the inverse of '4'.

If 'n' is a prime number then all elements of set {1,2, ... (n-1)} are mutually prime to 'n' (by the definition of primality) and have inverses. For this case, it is not necessary to calculate the GCD.

Clearly, if a 'b' can be found that satisfies the defining equation, then (b+n) will also satisfy it. The value of 'b' that is required as the inverse, is the value b (mod n) which will be in
the set {1, 2, ... (n-1)}.

A notation for inverses is :-
a^-1 = b (mod n).

++++++

Repeating equation (1) above :-
21*1547 - 58*560 = 7

Suppose we are trying to find the inverse
of 1547 with respect to 560, then
we are sunk, since GCD (1547, 560) = 7
i.e. the test for the existence of an inverse fails.

If, by chance, we were looking for a number 'b'
such that b * 1547 = 7 (mod 560)
then, applying the 'mod 560' operation
to equation (1) gives
21 * 1547 = 7 (mod 560).
Hence b = 21 is the solution.
Such a solution is not much use in number theory;
but multiplicative inverses appear in all
manner of problems.

Having found that a multiplicative inverse does not exist for our example problem, we can convert it to a problem that does have a solution by dividing out the GCD.
Dividing {1547, 560} by 7 gives {221, 90}. Reworking the Ref. 9 method with these start values gives :-

Pass  end_values[] {q, remainder}   M[] at end   determinant of M[]
 1     {221, 90}      {2, 41}    {  2,  1,  1,  0}      -1
 2     { 90, 41}      {2,  8}    {  5,  2,  2,  1}      +1
 3     { 41,  8}      {5,  1}    { 27,  5, 11,  2}      -1
 4     {  8,  1}      {8,  0}    {221, 27, 90, 11}      +1
 7     {  1,  0}  END

At the end we have :-

{221, 90} = {221, 27, 90, 11} * {1, 0}.
Hence M^-1[] = {11, -27, -90, 221}.

[Note that 2 of the elements of M[] are the start values.]

Extracting equation (1) again :-

11*221 - 27*90 = 1

Now applying a 'mod 90' operation on this equation gives

11*221 = 1.

Hence 11 is the multiplicative inverse of 221 (mod 90).

If we apply a 'mod 221' operation onto this same equation we get

-27*90 = 1.

Hence -27 is the multiplicative inverse of 90 (mod 221).

[Note that -27 maps to (221-27) = 194 under (mod 221), and 194 is the preferred value. This technique for handling negative values would also apply if the determinant had happened to be negative, causing all the terms in M^-1[] to be negated.]

Speeding up the Ref 9 method

At first encounter, the method sketched out, above, does not offer much prospect for a high speed implementation. Even with machine word sized arguments, the method seems unpromising; with multi-word sized arguments the prospect seems worse.

In order to form a high speed module, a firm grasp of the theory above in conjunction with a knowledge of the techniques needed to extract the best performance from Pentium processors is needed.

[I assume, below, that you are familiar with the basic principles of writing high speed software on Pentium processors. 'speed' introduces this.]

The Ref 9 method applied to word sized arguments

The problems mentioned above that need solving are :-
(i) The use of the slow DIV operation in conjunction with a slow progress towards completion.
(ii) The swapping round of variables.
(iii) Stage by stage, M[] is updated by a post multiplication by
{q, 1, 1,0}.

The post multiplication of
M[] = {M11,M12,M21,M22} by {q, 1, 1, 0} results in :-

	col. 1	col. 2
row 1	M11.q + M12	M11
row 2	M21.q + M22	M21

In words, this matrix operation is :-
a multiplication of the 1st column by 'q',
adding this product to the 2nd column,
then swapping columns.

The products call MUL (a minimum of 5 clock ticks)
and the sums call ADD (1 clock tick).
The column swaps require 2 # XCHG (each taking 2 clock ticks).
This workload is small compared with the 39 clock ticks required by the DIV to find 'q', but not negligible.
This gives a naive total of 57 clock ticks per pass, but the overhead of MOVing local variables on/off chip will increase this. On average approximately 19 passes will be required (full 32 bit word random inputs - see 'converge'), giving a crude time estimate of 1100 to perhaps 1800 clock ticks. This is poor.

############

This section might be skipped. It gives a (pedantic?) theoretical justification used to ease problems (i) and (ii) above.

I tackle the column swap problem first. Solution - get rid of it!
If the end column swap (above) is dropped we get M[] =

	col. 1	col. 2
row 1	M11	M11.q + M12
row 2	M21	M21.q + M22

This second swap counters the swap that was (surreptitiously) taking place in Ref. 9 - and leaves the determinant always +1, at the price of losing track of column positions. The latter problem is trivially solved by keeping track of the 'pass counter'. Then, at the end, the counter indicates where the matrix elements can be found.

Having column swapped M[], we need to row swap the right hand column vector to keep the equations in order.
(See 'matnotes', specifically on the matrix I called RCS (short for row-column-swapper).

In the presentation above, the sequence of right hand column vectors was :-

{u₀, u₁}, {u₁, u₂}, {u₂, u₃}, {u₃, u₄} ..., {GCD, 0}

If we row swap this vector alternately, (to counteract the column swapping of M[]) we get :-

{u₀, u₁}, {u₂, u₁}, {u₂, u₃}, {u₄, u₃} ..., {GCD, 0}

From above, the basic transition equations were :-

u_n = q_n*u_n+1 + u_n+2
and u_n+1 = u_n+1

For the first stage (with n=0), these become :-

u₀ = q₀*u₁ + u₂
and u₁ = u₁

then using the row swapped right hand column vector, and mapping into matrix format gives :-

{u0, u1} = {1, q0, 0, 1} * {u2, u1}

Updating M[] with this new q-matrix gives :-

{M11, M12, M21, M22} * {1, q0, 0, 1}
= {M11, M11*q0 + M12, M21, M21*q0 + M22}

The new M[] matrix is precisely the column swapped M[] matrix sought above.

For the next stage (with n=1), the transition equations become :-

u₁ = q₁*u₂ + u₃
and u₂ = u₂

then using the row swapped left hand column vector and mapping into matrix format gives :-

{u₂, u₁} = {1, 0, q₁, 1} * {u₂, u₃}

Updating M[] with this new q-matrix gives :-

{M11, M12, M21, M22} * {1, 0, q1, 1}
= {M11 + M12*q1, M12, M21 + M22*q1, M22}

############

In summary, in order to avoid swapping columns around, the following 2 step process has to be adopted :-
Step 1 :- Find 1st 'q' (= Floor(u/v)) value. Then multiply the left hand column of M[] by q and add it to the right hand column.
Step 2 :- Find 2nd 'q' (= Floor(v/u) value. Then multiply the right hand column of M[] by q and add it to the left hand column.
Go to step 1 and Repeat as needed.

So, what to do about the DIV needed to find 'q'; and then the MUL by 'q' needed to update the M[] matrix in the way discussed above?

The best policy is to avoid using either of them as far as possible! The DIV can be replaced by repeated subtractions, and after each subtraction a column add can be used to update M[], since each subtraction implies a quotient of '1'. In this way, both the slow DIV and MUL ops vanish.
As discussed above, on average q will be <= 8 in about 17 cases out of 20; and 4 subs and 2*4 adds (for the columns) + some looping overhead will take far less time to complete than a DIV + 2 MUL's + assorted MOV's.

[Actually, the column adds will occur in parallel on all recent Pentiums - they are data independent adds. See 'speed'.]

In the listing below, the situation is more complicated; after a set number of data subtractions, if the reduction is not complete, a switch is made to a division technique. Although the speed difference of the two techniques is significant, each has a part to play in arriving at an optimal module.

The software listing below has a few further embellishments :-
When I first wrote the module, I naively set out with an intention of using no DIV's at all. However, in testing I found that 1 in several thousand test cases ran slowly (some very slowly!) because the occurrence of large 'q' values is not rare. If q > 10, say, the benefit of using SUB's and columnar adds is lost.

[A particularly malignant case occurs when the two input values are nearly identical. After the first difference, one parameter is then many bits in size different from the other. From there, many thousands or even millions of SUBs are required for the lesser to reduce the larger. At that point the processor gives a good impression of having entered an infinite loop!]

I solved this problem by introducing a local variable, 'nloop'
(set to a #define INV_NLOOP=7 below). This local variable is reset at the start of each reduction stage, then decremented after each columnar add.
If it counts down to zero, then a jump is made to a division section. There, a check on the bit size difference of the current input values is made. If this exceeds another
#define INV_DIFF_MAX=2, (i.e. more than a further 4-7 SUBs are needed) then a DIV is done to finish the stage off; if not, SUBing is resumed.
Later, I experimented with replacing the DIV's by a shift-sub technique, and this resulted in about 10% saving of time; hence the listing below uses this technique. 10% is not a decisive speed gain for the module below, but when the multi-word case is considered, the speed gain is considerable.

In this shift-sub technique, the lesser of the 2 given values is shifted left until both values have equal most significant bit indices. (the 2 relevant column elements of M[] are also shifted left by an identical amount)
    An attempt at a SUB is then made. If it succeeds (no borrow occurs) then a shifted M[] column add is done. If it fails (a borrow occurs) then the error is corrected by ADDing the lesser value back and no shifted M[] column add is done.
    Next the lesser value plus the 2 shifted column elements are right shifted by 1 bit, and the process repeated until the shifted values have returned to their starting values.
    This technique might be thought of as reducing the larger value from the left hand end; whereas the SUB section reduces it from the right.

The shift-subbing method only pays off when 'q' is representable in fewer than a handful of bits (beyond that DIV is a better choice). As the probability tables in 'converge' show, that is very likely to be the case. When larger 'q' values do occur, the shift-sub technique will take longer than using DIV, but these cases are sufficiently rare that on average a little time is saved.

In 'fastmuli' I extend this document to cover the multi-word inputs case. The use of DIV in that case, to handle larger 'q's, is not simple, whereas shift subbing is straightforward; this is another good reason for introducing the shift-sub technique here.

/***************** 
To obtain litl_Inv() from this document use WordPad and
load this file. Scan down the file to this section, then
cut out all surplus text. Then 'Save as' the remainder as
a text file with any convenient name.
[litl_Inv() is in a 'preformatted' plain text section
with no HTML tags.]
******************/ 
/*********************/ 
// for the dot-h file :- 
// The next 2 define's control module flow and 
//  have an effect on speed. 
//#define  INV_LOOP_SHIFT  2 
//#define  INV_NLOOP       7 
//
//#define  ULONG  unsigned long 
//
// for the dot-cpp file :- 
ULONG litl_Inv (ULONG u, ULONG v, ULONG& m22, ULONG& m12) 
 // 
 // Conditions :- 
 // (1) GCD(u, v) MUST be '1' on entry. 
 // (2) u != 0, v != 0. 
 // (3) u greater than v on entry. 
 // (4) 'n' (the exponent in determinant = (-1)^n) is the
 //       returned value. 
 // 
 // On exit    a.u + b.v = 1    (u,v IP's, a,b OP's) 
 // with a = (-1)^n.m22;   b = (-1)^(n+1).m12 
 // (If inverse of v mod u is required this is 'b'
 ///  (or (u-b) if b -ve)) 
 // (If inverse of u mod v is required this is 'a'
 //   (or (v-a) if a -ve)) 
 // 
 // Notes :- 
 // (1) eax used to hold 'u', and edx 'v' throughout as
 //     they are reduced in turn.
 // (2) Array M[] held in ecx(m11), esi(m12),
 //                       ebx(m21), edi(m22) 
 // 
 // Caution :- 
 //  On my development system esi & edi are used by the
 //   compiler to hold register variables. Ordinarily these
 //   registers need preserving in an asm{} block by 
 //   PUSHing at the start and POPing at the end. 
 //   This is NOT done below largely because development
 //    systems vary on this point e.g. 64 bit assembler. 
 //   You will need to find out what registers need
 //    preserving on your system - then adjust the code to suit. 
 //   Failure to get this right will crash your executable! 
 //  In testing the module below, I cheated - I disabled
 //   register variables in the compiler command line. 
 // 
 //  No DIV's or MUL's used in this version
 //   - shift-sub technique used. 
 // 
 // This subroutine may be used by anyone. 
 // I have tested it vigorously but I guarantee nothing. 
 // If you do use it, please include a reference to my
 //  web site in your documentation :- 
 //               www.kahnet.co.uk 
 // Comments and suggestions can be sent to
 //            enquiries@kahnet.co.uk 
{ 
  ULONG m22in, m12in, m11in, m21in, n=0, shift ; 
  int nloop ; 
  asm { 
    mov edx,v 
    xor esi,esi

    xor ecx,ecx

    inc esi

    mov eax,u 
    mov edi,ecx 

    mov ebx,esi        // m[] now {ecx=0, esi=1,
                       //          ebx=1, edi=0} 
Sub_uv: 
    mov nloop,INV_NLOOP 
Sub_uv_loop: 
    sub eax,edx

    add ecx,esi        // doing column add 
    add ebx,edi 

    cmp eax,edx        // another sub possible?

    jb Exit_uv

    dec nloop 

    jge Sub_uv_loop 

    jmp ShiftSub_uv    // after INV_NLOOP subs switch
                       //  to ShiftSub technique 
Exit_uv: 
    inc n

    or eax,eax         // sets zero flag if zero 

    je Exit 
Sub_vu: 
    mov nloop,INV_NLOOP
Sub_vu_loop: 
    sub edx,eax 

    add esi,ecx        // doing column add 
    add edi,ebx 

    cmp edx,eax        // another sub possible?

    jb Exit_vu

    dec nloop 

    jge Sub_vu_loop 

    jmp ShiftSub_vu    // after INV_NLOOP subs switch
                       //  to ShiftSub technique 
Exit_vu: 
    inc n

    or edx,edx 

    je Exit

    jmp Sub_uv 
     /************************/ 
ShiftSub_uv:
      // In here u (eax) greater than v (edx) 
    push ecx 
    push ebx

    bsr ecx,eax        // get most significant bit indices of
                       //  eax, edx into ecx, ebx 
    bsr ebx,edx

    sub ecx,ebx        // ecx now holds difference in bit indices 
    pop ebx

    cmp ecx,INV_LOOP_SHIFT  // check size of shift

    jg SS_cont_uv      // if difference large enough proceed
                       //  to ShiftSub section 
                       // NOTE jg = SIGNED 'jump if greater'.
                       //  cmp could be -ve. 
    pop ecx

    jmp Sub_uv         // else return to subbing 
SS_cont_uv: 
    mov shift,ecx      // shift-left edx,esi,edi here

    shl edx,cl         // now most significant bits eax,
                       //  edx are aligned 
    shl esi,cl

    shl edi,cl

    pop ecx 
Shiftsub_dosub_uv: 
    sub eax,edx        // try for a shifted-sub

    jnc Shiftsub_coladd_uv  // if edx greater than eax carry
      // (borrow here) flag set - sub fails
      // jnc = jump if no carry 
    add eax,edx        // reverse the sub above 
    jmp Shiftsub_no_coladd_uv  // and skip over
                               //  column adding next 
Shiftsub_coladd_uv: 
    add ecx,esi        // column add by shifted esi, edi 
    add ebx,edi

Shiftsub_no_coladd_uv: 
    shr edx,1          // shift-right edx,esi,edi by 1 bit

    shr esi,1

    shr edi,1

    dec shift          // when shift reduced to zero,
                       //  section finished 
    jnz Shiftsub_dosub_uv // not zero - try another shifted sub

    cmp eax,edx 

    jb Exit_uv 

    jmp Sub_uv         // eax greater than or equal edx
                       //  - 1 more sub needed 
/********************/

ShiftSub_vu:           // In here v (edx) greater than u (eax) 
    push esi 
    push edi 

    bsr esi,edx        // get most significant bit indices
                       //  of edx, eax into esi, edi
    bsr edi,eax

    sub esi,edi        // esi now holds difference
                       //  in bit indices 
    cmp esi,INV_LOOP_SHIFT

    jg SS_cont_vu      // if difference large enough
                       //  proceed to ShiftSub section 
    pop edi 
    pop esi

    jmp Sub_vu         // difference small - return to subbing 
SS_cont_vu:            // shift-left eax,ebx,edi(holds ecx) here 
    mov shift,esi 
    mov edi,ecx        // take copy of ecx

    mov ecx,esi        // load ecx with shift needed

    shl eax,cl

    shl ebx,cl

    shl edi,cl         // shifting ecx in effect

    mov ecx,edi        // put shifted copy back

    pop edi            // reload edi, esi 
    pop esi 
Shiftsub_dosub_vu: 
    sub edx,eax        // try for a shifted-sub

    jnc Shiftsub_coladd_vu  // if edx less than eax carry 
      // (borrow here) flag set - sub fails
      // jnc = jump if no carry
    add edx,eax        // reverse the sub above

    jmp Shiftsub_no_coladd_vu  // skip over column adding next
Shiftsub_coladd_vu: 
    add esi,ecx        // column add by shifted ecx, ebx 
    add edi,ebx 
Shiftsub_no_coladd_vu: 
    shr eax,1          // shift-right eax,ecx,ebx by 1 bit

    shr ecx,1

    shr ebx,1

    dec shift          // when shift reduced to zero,
                       //  section finished 
    jnz Shiftsub_dosub_vu

    cmp edx,eax 

    jb Exit_vu

    jmp Sub_vu         // eax greater than or equal edx
                       //  - 1 more sub needed 
    /*************************/ 
Exit:                  // save results to local variables 
    mov m22in,edi
    mov m12in,esi

    mov m11in,ecx 
    mov m21in,ebx 
  }                    // end of asm block - back to C code 
   // Column swaps have been supressed above. Reinstate here. 
   // 
   // Also ULONG& m22, ULONG& m12 are reference input parameters
   //  which are the most useful members of m[]. 
   // In my implementation m11 and m21 are class variables
   //  which can be recovered using another class method if
   //  the whole of m[] is needed. 
   // My assembler does not recognize references which leads
   //  to a messy use of local variables which are then passed
   //  over to the IP parameter references in the next section. 
   // Trim all this waffle to suit your own needs/software
   //  capabilities.

  if ((n & 1) == 0) {   // if n even do col. swap 
    m11 = m12in ; 
    m21 = m22in ; 
    m12 = m11in ; 
    m22 = m21in ; 
  } else { 
    m11 = m11in ; 
    m21 = m21in ; 
    m12 = m12in ; 
    m22 = m22in ; 
  } 
  return n ; 
} 
/********************/

Notes on the Listing above

(a) The listing litl_Inv() is dedicated to finding word sized multiplicative inverses in finite fields. For this application the GCD is always '1'; hence it is not checked within. It is simple to adapt the code to handle situations where GCD() may not be '1'.

[The GCD is available within the code - if 'v' goes to zero it is 'u'; if 'u' goes to zero it is 'v'. If the GCD is not '1', no multiplicative inverse exists, of course.
In this way the module could easily be converted into a GCD finder.

The listing can also easily be converted to a Minkowski Theorem solver. For this problem the other 2 entries of M[] need to be output.]

Also, the input condition that (u > v) is straightforward to apply.

(b) The big question, at the end of the day, is "Is it fast?". A problem with timing GCD modules, is that the speed depends on the input values. (see the comments in 'converge').
The only way to get typical timings is to use a large 32 bit prime for 'u' and random 32 bit numbers for 'v' as inputs.
(The Sophie Germain prime mentioned in 'converge', and associated generator were used.)
Then take, say, 16k of timings to average out the scatter. This was done to obtain the following results.

For the listing litl_Inv(), the timing results were :-

Average ticks = 880,
Minimum ticks = 599,
Maximum ticks = 1167

(all Pentium III, hot cache timings
- see 'speed' for the explanation).

(Inputting Fibonacci numbers
F₄₇ = 0xB11924E1 and F₄₆ = 0x6D73E55F
gives ticks = 508.

Contemplation of the Fibonacci case, where the q-sequence is {1,1,...(45 times)}, leads to the conclusions that :-

(a) Cold cache problems will be absent.
(b) Pipeline crashes will be absent
(the branch predictor system will be 100% successful).
(c) The shift-sub section will never be used.
Hence the minimum hot cache time for 1 stage approx= 508/45 = 11 ticks.)

[Extending the module above to 64 bit assembler gives run times approximately double the values above, since the number of passes (for 64 bit inputs) will double. In fact I find that the extra registers available with 64 bit code allows a reduction of local variables (to nil!) with a resultant small speed gain. The Pentium III cannot do 64 bit arithmetic, but on an Athlon (also hot cache) the 64 bit version runs in about 1320 ticks.
In 64 bit assembler there is enough CPU register space to find a 128 bit inverse 'on chip' i.e. without resorting to stack space (slow in comparison). This takes a little more than 4 times that for the 32 bit version.]

Compare these to a minimalist Euclidean module in "C" :-

/*****************/
ULONG GCD (ULONG u, ULONG v)
{
  if ((v == 0L) || (u == v)) return u ;

  return GCD (v, u%v) ;
}
/*****************/

GCD() gave timing results :-

Average ticks = 1127,
Minimum ticks = 514,
Maximum ticks = 1818.

[Note the wider scatter from GCD() - the use of DIV (in u%v) will occasionally be fast and occasionally be slow. But it's average speed is slower.]

[The Fibonacci pair needed 2411 ticks to run. Note that this is the 'worst case scenario' for GCD(); but the 'best case scenario' for litl_Inv().]

Hence the listing litl_Inv() runs in about 78% of the time needed by GCD(), on average. But this does not compare like with like. GCD() cannot easily be converted to a multiplicative inverse finder, and it would take maybe 3-5 times as long to run if it were.

I also compared it to an old dedicated multiplicative inverse finder which gave an average time of 19430 ticks i.e. 20.1 times slower. This module uses a binary algorithm from Articles 14.61 & 14.64 of ref. 5. This module was written for multi-word inputs (i.e. far from optimal for word sized inputs); also I did not tune it much. So once again comparisons are not very relevant. Nevertheless, I doubt if I could get it better than 3-5 times slower, no matter how much effort I put into it.

These considerations make litl_Inv() probably the fastest small inverse finder available.

When I wrote the Ref. 5 module, finding multiplicative inverses was only a 'spear carrier' in my work. Latterly, finding multi-word inverses has become more significant. The slowness of the Ref. 5 module started me investigating alternative techniques.

The germ of the idea for the listing litl_Inv() comes from 'Ref' 7 pg. 75, where a so called 'blended' GCD algorithm is discussed. The ideas expressed there inspired me to apply them to an extended Euclidean algorithm. Note that the listing only makes sense when a firm theoretical foundation is blended with knowledge of the strengths and weaknesses of Pentium processors.

The listing above was written as a 'proof of concept' module. I have little need for a word sized multiplicative inverse finder; but I needed to know that the principles used within it were likely to be fruitful when applied to the multi-word problem. These hopes were fulfilled. The multi-word extension of litl_Inv() (see fastmuli.html) is in the order of 100 times faster than the standard methods.

One application for litl_inv() suggests itself - finding the inverses in the key setup stage of IDEA type block ciphers. The time to find these inverses dominates key setup time, and is irritating for heavy users (e.g. users maintaining IDEA secured databases). The ensuing loss of operator productivity from this can undermine a product's credibility. This problem can be eased considerably by inserting litl_Inv() into IDEA code.
  Also, since litl_Inv() can be extended to 64 and 128 bit inverses, it opens the way for 128, 256 and 512 bit IDEA type block ciphers largely free from unpleasantly hesitant performance.
[The running times for these larger modules can be estimated by scaling up the figures above in a linear fashion, so long as a 64 bit processor is to be used. (On a 32 bit processor the lack of register space means stack space would have to be used on the larger modules, with a moderate time penalty from the resulting data MOV's.)]
  There is a theoretical problem that must be overcome to put together larger IDEA type block ciphers, but several simple solutions to this problem suggest themselves.
  In Ref. 4 (pg 319), well known author Bruce Schneier, says of IDEA :-

"In my opinion, it [IDEA]
is the best and most secure block algorithm
available to the public at this time."
(Ref 4 was published in 1996).

However, today, partial breaks into IDEA have been announced at cryptographic conferences (which almost certainly means that complete breaks have been made in private). Hence it can no longer be considered for 'new build' projects and existing projects that use it need to consider their future with some urgency. Moving to a larger IDEA type block cipher provides one option for the future.

[go to site Home page]

An Introduction to Euclid's algorithm including code for a fast word sized multiplicative inverse finder for Pentium processors