The World's Fastest 'bignum'
Multiplicative Inverse Finder.

[This document targets x86 Processors,
but is applicable to any processor.]

['Word' in this document means the largest register size accessible to a piece of code. For 32-bit code this is 32-bits; for 64-bit code 64-bits.
Many software writers use 'word' to be 16 bits then use double word for 32-bits, double double word for 64-bits etc.
Also I shall use 'bignum' or multiword when an input needs several words to hold it. Knuth ('Ref 11') uses the phrase 'multi-precision'.]

I am principally interested in multiplicative inverses for inputs around 256-2048 bits in size and this document is aimed at this range.

[This document discusses a 32-bit code multiword multiplicative inverse finder. At the bottom of this document I present some timing results for a 64-bit code version, which I believe to be the fastest available anywhere.

The 64-bit version is available for sale - see sales.html for details.]

[Theoreticians searching for new methods to tackle the multiplicative inverse problem are going to be disappointed. I have just combined some elementary theory with a fair knowledge of optimizing Pentium execution speed and seasoned it with a little lateral thinking.]

[Several other documents on this site are referred to repeatedly within. All the links to these documents are given here; return here if you want to view the contents.
eucnotes.html - This introduces many of the ideas and nomenclature used below. It will be difficult to follow this document without having read it.
[A fast 32-bit sized multiplicative inverse finder is presented within. This listing can be easily extended to cover 128 bit inverses using 64-bit code. I have 64-bit code modules to do this, available for sale.]
speed.html - Introduces the basics of writing high speed code for Pentium processors.
converge.html - Discusses the rate of convergence of the Euclidean algorithm.
matnotes.html - A 'crib' sheet for such matrix theory as is used below. This must be used with care since some of it only applies to 2-by-2 matrices with integer elements.
refs.html - Most of the technical references used on this site are listed within. This site has a second file of references on a special topic. If I refer to this second file, a live link will be provided.
Below I shall generally write e.g. 'ref 11' or 'eucnotes' (single quotes) when referring to any of the files above.]

In 'eucnotes' (following the listing of 'Algorithm E') I mention that the 2 principle problems causing the slowness of Euclidean algorithm implementations are :-
(a) The use of the slow Pentium opcode 'DIV' (or the even slower multiword division).
(b) Moving data around (including swapping variables).

    Back in the era of hand calculation, the slowness of dividing large integers was a bottleneck for performing Euclidean algorithm calculations. (Using modern processors, large integer division is still unpleasantly slow.) This observation led to 'binary' algorithms being developed. These replace divisions by shifts and subtractions which, while much faster, can still be a drag on processor speed.
[A SHRD (needed for multiword right shifts) takes 4-6 clock ticks on an AMD Athlon. Not serious in isolation, but a different story if tens of 1000's are used.]
    Typical of many textbooks, the binary extended Euclidean algorithm can be found as Algorithm 'Y' of Ref. 11 (see the listing due to M. Penk in section 4.5.2 exercise 39) or Articles 14.61; 14.64 of 'ref. 5'. I shall use the label 'algorithm 14.61' below, when I refer to it.
[Algorithm 14.61 uses fewer auxiliary arrays.]
    Algorithm 14.61 solves the multiword problem by manipulating the whole of 6# arrays (2# of them the input arrays (possibly copies) and 4# auxiliary arrays), all the same size as the largest input array. This leads to much moving around of data. Hence the binary extended algorithms take the edge off problem (a), but offer no relief to problem (b), indeed may be the 'worst case scenario' for problem (b).
    Problem (b) is a serious problem for Pentium processors - they have a tiny amount of on-chip register space. With 32-bit code, only 8# 32-bit registers are available; with 64-bit code 16# 64-bit registers are available. Clearly, there is no possibility of holding the 6# working arrays 'on chip' when, say, 512 bit inputs are in play. This means that all 6 arrays will have to be moved on chip, shifted and/or added, then moved off chip again. In short, considerable processor time is going to be spent with Algorithm 14.61 shunting data around. Thus, it's speed is not going to be dazzling. It is fairly simple to implement (i.e. cheap), but if your project uses large sized multiplicative inverses heavily, while speed is important, it will be inadequate.
    [My implementation of algorithm 14.61 runs over 1000 times slower than the code below! This module was just crudely lashed up using modules from other work. Speed was not important when I wrote it. Nevertheless, this illustrates the sort of poor code that can be produced when speed testing (see 'speed') is not carried out and the speed of good class products is unavailable for comparison (provided below). If you have written a multiplicative inverse module why not time it with the input data supplied and see how it compares? Who knows, you might be startled!]
    MOVes can never be eliminated completely, but stripping out as many of them as possible is an obvious 'speed up' technique.
    [Note that if the input data can be held in 1 word (32-bit code) or 2 words (64-bit code) there will be enough Pentium registers to hold all 6# arrays. In such circumstances problem (b) disappears! and algorithm 14.61 will be more competitive.]

Other approaches are in the literature on this subject. One is given in 'ref 11' Algorithm 'L' (due to D.H. Lehmer, 1938). This algorithm uses the leading words of the inputs to find quotients then MUL's to adjust the leading words. It is too complicated to do justice to here, but while DIV is used (only a word sized DIV rather than a multiword division) the array movements are much reduced. Speed gains of 'several' from this technique have been reported.
Lehmer's insight is not entirely 'Pentium friendly'; but was a major development because it points the way towards faster Extended Euclidean algorithms.

[Lehmer's technique of tackling Euclid's algorithm in 'word sized chunks' from the head of the data arrays has led to other variants being suggested :-
A method can be found in the Exercises to 'ref 11' Art 4.5.2 - Exercise 38 (due to R.W. Gosper). This improves on the binary algorithm speed and is claimed to give a speed gain of up to 4. The method I introduce below does much better than this.

Another variation is to use the 'tail' of the data arrays e.g. :-
J. Sorensen ("Two fast GCD Algorithms"; Journal of Algorithms; vol. 16 (1994) pgs 110-144). He calls the algorithms discussed - "A right shift k-ary algorithm" and "A left shift k-ary algorithm". He suggests using small 'k' values (several bits only) and the use of lookup tables. (His method requires tables of Minkowski Theorem solutions which, he suggests, need pre computation.) Updating his concept for large (32 or 64-bit) k values and using the capability of modules like litl_Inv() in 'eucnotes' to find Minkowski solutions quickly, is on my list of future projects.]

The technique I use below extends these ideas. In outline my technique works as follows :-

    (a) Extract a portion of data from the head of the 2# input arrays (2 or 3 computer words).
    (b) Strip bits from the head of these portions in a similar manner to litl_Inv() in 'eucnotes' (very fast). Hopefully, something approaching a computer word of 'stripping' should be possible.
    (c) Using the results of (b), update (reduce) the large input arrays and build up the other 4# auxiliary arrays. This step needs a little theory (given below) to keep the process correctly on track.
    (d) Iterate (return to (a)) until complete.

For my test case (see bottom of this document) when 32-bit assembler was used the reduction step (c) took about double the time taken to execute step (b); but reduces to about half the time when 64-bit assembler was used. The other 2 steps carry a negligible time penalty.

The method litl_Inv() in 'eucnotes' proceeds by subtracting the lesser word sized argument from the larger, over and over until the larger has become the smaller. This has to be done in a 2 stage process :-
    (i) Start by using a simple subtraction.
    (ii) If (i) is not completed after a small number of attempts, check whether a switch to a section able to handle large q-values needs to be made. If so, do this; if not return to (i) for completion.
    All of this reduction work is done while simultaneously building up a word sized 2-by-2 array m[]. The multiplicative inverse sought can be extracted from m[], when the input arrays can be reduced no further.
    If this logic is extended to the multiword sized input problem, it becomes necessary to build up an array M[], say, whose elements end up of similar size to the starting input arrays.
    Doing this as in algorithm L, requires a large number of array adds and subtractions; such an approach would have a speed comparable to the algorithm 14.61 method above, hence not worth pursuing.
    From above, a more fruitful approach would be to somehow find a word sized array m[] from the head of the input arrays and then use this to update both M[] and the input arrays (i.e. reduce them) - then repeat this process (iterate) until the input arrays can be reduced no further.
This is the approach used below.

[ ++++++++++++
I am not going to repeat the theory and symbolism given in 'eucnotes', here; but a summary might be useful :-
[nlw = number of words needed to hold the large unsigned integers in here.]

[Note that my target application for this work was to find the inverse of a member of the multiplicative group in a simple Finite Field. For such problems the possibility of non mutual primality does not arise. The code will need adjustment if your application is different.]

Given 2 mutually prime large integers held in input arrays, say U and V (both nlw in size), and an objective of obtaining a 2-by-2 matrix M[] (with nlw sized elements) such that :-

{U, V} = M[] * {1, 0} (or {0, 1})

If this can be done, then the multiplicative inverse will be one of the elements of M^-1[]. Since the determinant of M[] will be either +1 or -1, M^-1[] will exist, be easy to find, and the desired multiplicative inverse will be an element and can be copied out.
Suppose that a word sized 2-by-2 array m[] could be found that will reduce the size of multiword integers U and V i.e.

{U, V} = m[] * {U_next, V_next} ------------ Eqn. A
such that
U_next < U and V_next < V
then some progress will have been made towards the objective.

The skeleton of a module to find M[] would then be :-

Step 1 :-

Initialize M[] to {1, 0, 0, 1} (the Identity array)
At this point we have the obvious starting relationship :-

{U, V} = M[] * {U, V}

Step 2:-

Find an m[] satisfying Eqn A - from the head of the current {U, V} arrays.

Step 3:-

Insert m[] into the starting equation by substitution :-

{U, V} = M[] * {U, V} = M[] * m[] * {U_next, V_next}
then let M_next[] = M[] * m[]
so we have {U, V} = M_next[] * {U_next, V_next}
and {U_next, V_next} = m^-1[] * {U, V}

(from Eqn. A by pre multiplying
both sides by m^-1[])

Step 4:-

If {U_next, V_next} is {0, 1} (or {1, 0}) -- EXIT
else go to step 2.

In other words, if an m[] can be found, then we can update M[] by post multiplication by m[] and update the input arrays by pre multiplication by m^-1[]. These will be multiword sized array times word multiply operations using predominantly MUL and ADD processor op-codes, which will be much faster than a multitude of array adds and shifts.
[I mention in 'speed' that, for an AMD Athlon, a MUL op has a latency of 5 clock ticks and a throughput of 2 clock ticks. In all the books listed in 'refs' it is an under-current that a MUL is very slow and will wreck the speed of any algorithm that uses it.

This was true 30 years ago!

However, today the opposite is true. The amount of arithmetic done by a MUL op is immense, and the time taken to do it is tiny.
This means that if an algorithm making heavy use of MUL is in competition with an algorithm that doesn't, the former will almost certainly win out on speed.]
++++++++++++ ]

Everything above depends on finding a word sized 2-by-2 array m[] with the properties required in Eqn. A. In fact, more conditions need to be met by m[] :-
    (a) The size of the elements of m[] need to be as close to a full word as possible. Clearly, the larger these elements are, the faster M[] will be built up and the faster the input arrays will be reduced. In this way the number of iterations will be minimized, and so will the overall module run time.
    (b) The calculation of m[] needs to be theoretically sound; in particular the determinant of m[] must be either +1 or -1. Any other value will result in m[] and M[] being non-invertible. This consideration means that the 'ref 9' presentation reproduced in 'eucnotes' (used in litl_Inv()) has, once again, to be the guiding light in calculating m[].
    (c) The calculation of m[] has to be fast. This means the module should ideally be written in assembler, so as to extract the best possible performance from a Pentium processor.

If we were to start doodling with pencil and paper to experiment with methods for finding m[], the first method to be tried might be :-
    (i) Extract (into a 32 (or 64) bit word) the most significant 32 (or 64) bits from the larger of the input arrays by left shifting so that the most significant bit is set.
    (ii) Ditto for the smaller input array, [keeping the shift equal to that from (i)]. This may mean that some of the most significant bits, here, are zero.
    (iii) Use a litl_Inv() clone on the single words from (i) and (ii) to calculate m[].

The result from (iii) might then be checked with an m[] derived by using the whole of the input arrays. At that point an ugly reality would present itself - the 2 m[]'s will not (in general) be equal. The reason for the discrepancy is that the 'borrows', from the bits of the input arrays that are excluded in (iii), are ignored; but are included in the check calculation. If in (iii) the derivation of m[] were stopped after 12-16 bits of precision had been accumulated, a usable m[] might result; but this does not meet condition (a), above.

Taking on board the lessons from above, we can adjust things to :-
    (1) Extract into a pair of words the most significant 64 (or 128) bits from the larger of the input arrays.
    (2) Ditto for the smaller input array, while keeping the left shift for this pair of words equal to that from (1). Again, this may mean that some of the most significant bits are zero.
    (3) Use a litl_Inv() clone on the double words from (1) and (2) to calculate an m[] with word sized entries).

Here, in essence, the 2nd 32 (or 64) bit word is used as a 'buffer' to stop the neglected borrows from damaging the most significant word. This is the technique used in strip_30() listed below.
[The name strip_30() comes from the objective of 'stripping' 30 bits from the input arrays in each step while building up M[] by a similar amount. On occasion, an early exit is the safest course, and less than 30 bits will be stripped; but this outcome is unusual. For my 64-bit version, I call this module strip_62() but below I shall keep referring to strip_30() for simplicity.]

The section above gives an overview of the method below; this section discusses some adjustments found necessary during development to make it all work :-
(Below, the arrays u[2] and v[2] are the pair of 32 (or 64) bit words that need to be filled from the input arrays by simultaneously copying and shifting so that the most significant bit of u[] (or v[]) is set.)
    (a) I found it necessary to prevent either u[0] or v[0] being reduced to zero. If either value is reduced to 2 bits or less, the method 'bails out' (a variable 'enough' enforces this). Also, if a 'zero flag' is generated when subtracting u[0] from v[0] (or vice versa), it is possible that some (possibly many) bits of u[1] (or v[1]) also become zero. But the accuracy of these values is steadily degraded as the calculation proceeds and so the subtraction that raises the zero flag, may be a subtraction too many. The only conservative decision, in this event, is to bail out.
    (b) From (a) it will be clear that strip_30() is biased towards protecting the least significant bits of u[0] and v[0]. This conservative philosophy, means that on occasion strip_30() will make no progress (e.g. if u[0] were equal to v[0] on entry). If this occurs the array m[] that is returned would be an identity array. This, in turn, means that the post multiplication of M[] by m[] leaves M[] unchanged, and the pre multiplication of the input arrays by m^-1[] leaves the input arrays also unchanged.
    In my implementation strip_30() is buried within a while{} loop (controlled by input array bit size) so the chance of an infinite loop being entered is small but not zero. A set of conditions, that I have successfully used to protect strip_30() from this malignant possibility, are given below. These, or similar conditions, are not optional.

I am not going to discuss strip_30() in any depth here. The theory behind litl_Inv() also applies here, and this can be found in 'eucnotes'. However, a few adjustments have been made :-
    (i) I found that allowing the determinant of m[] to toggle between +1 and -1 (as in litl_Inv()), introduced pointless complexity into several subroutines. By introducing minor adjustments, strip_30() only returns an m[] with a determinant of +1.
[A rearrangement of working variables achieves this - see the "method to avoid column swapping" section in 'eucnotes' just below a '############' page divider.]
    (ii) The restriction in litl_Inv() that the first input be larger than the second, is inconvenient here - the stripped U and V values will change erratically as regards to which is the greater. I have eliminated it by using an early test and jump.
    (iii) The allocation of registers to elements of m[] has been altered (only to make it easier to watch m[] develop in the debugger).

The changes to the theory from the adjustments above is both obvious and self documenting.

Although strip_30() is targeted for use in finding inverses in simple (mod p) Finite Fields it could be transformed into a fast GCD finder in a straightforward way.]
After the listing of strip_30() I shall give pseudo code for the sort of module that is needed to drive and protect it. Finally, I present some timings for the module.

/******************
To obtain strip_30() from this document use WordPad and
load this file. Scan down the file to this section, then
cut out all surplus text. Then 'Save as' the remainder as
a text file with any convenient name.
[strip_30() is in a 'preformatted' plain text section
with no HTML tags.]
******************/
/*****************/
void
strip_30 (ULONG *u, ULONG *v)
 // The inputs to this routine are expected to be :- 
 //  (a) 2*32 bits long i.e. declared in the calling routine as :- 
 //        ULONG u[2], v[2] ; 
 //  (b) The greater of u[] or v[] is left justified so that the 
 //        most significant bit of the greater is set on entry. 
 //  (c) In order to achieve (b), BOTH u[] and v[] MUST be 
 //        left shifted by an EQUAL AMOUNT. 
 // 
 // Caution :- 
 //  On my development system esi; edi are used by the compiler
 //  to hold register variables. Ordinarily these registers need
 //  preserving in an asm{} block by 
 //        PUSHing at the start and POPing at the end. 
 //   This is NOT done below largely because development systems
 //   vary on this point. (e.g. 64-bit assembler)
 //   You will need to find out what registers need preserving
 //    within an asm{} block on your system - then adjust the code
 //    to suit. 
 //   Failure to get this right will crash your executable.
 //   In testing the module below, I cheated - I disabled register
 //    variables in the compiler command line. 
 // 
 // The next is for the relevant dot-h file :- 
 // #define  ULONG  unsigned long 
 // 
 // [The 4 #defines, below, all affect the speed of the module - 
 //  none of them very heavily. 
 //   The speed of the module is dependent on the input values; so
 //    the only way to find optimal values for the includes is to
 //    find the total time for, say, 64k random inputs 
 //    (see more comments below) - and take the average. 
 //   This was done to set the values below.] 
 // [The 2 #defines, next, are similar in function to the values
 //   in litl_Inv(). 
 //  I found that values of 3 and 1 were the fastest, but the
 //   difference was less than 1%, hence not worth changing.] 
 // #define  INV_NLOOP       7 
 // #define  INV_LOOP_SHIFT  2 
 // 
 // The next are used by my calling routine :- 
 //  (see pseudo code further down in this document) 
 // #define  MAX_DIV_SHIFT   24 
 // #define  SUB_SHIFT_GAP  (0x0000FFFF) 
 // 
 // Register allocations used within :- 
 //  (1) eax, ebx are data holders 
 //  (2) Array m[] held in ebx(m11), ecx(m21),
 //                        esi(m12), edi(m22) 
 // This subroutine may be used by anyone. 
 // I have tested it vigorously but I guarantee nothing. 
 // If you do use it, please include a reference to my web
 //  site in your documentation :- 
 //               www.kahnet.co.uk 
 // Comments and suggestions would be welcomed at 
 //            enquiries@kahnet.co.uk 
 // 
{ 
  ULONG u0, u1, v0, v1, enough ; 
  ULONG m22in, m12in, m11in, m21in, shift ; 
  int nloop ; 
  asm { 
    mov esi,u 
    mov ecx,v 

    mov enough,3 
    mov ebx,1 

     // just moving u[], v[] onto stack where access is easier 
    mov eax,[ecx+4]  // v1 
    mov edx,[esi+4]  // u1 

    mov v1,eax 
    mov u1,edx 

    mov eax,[ecx]    // v0 
    mov edx,[esi]    // u0 

    mov v0,eax 
    mov u0,edx 

    xor ecx,ecx 
    xor esi,esi 

    mov edi,ebx      // m[] now Identity = {1,0,0,1} 
    cmp edx,eax      // cmp u0 to v0 

    jae Sub_UV 

    mov eax,edx      // load eax with u0 

    jmp Sub_VU 
Sub_UV: // u[] greater than v[] here 
        // AND v[0] not = 0 
        //  (tested by 'enough' below and calling routine)
        // AND eax loaded with v[0] 
    mov nloop,INV_NLOOP 
    mov edx,v1 
Sub_UV_loop: 
    sub u1,edx 

    sbb u0,eax 

    je  Exit      // 1st Point 'X' - see comments below 

    jb  Exit_UV_uflow 

    add esi,ebx   // add left col m[] to right 
    add edi,ecx 

    cmp u0,eax 

    jbe  Exit_UV  // u[0] less than or equal eax 

    dec nloop 

    jge Sub_UV_loop

    jmp ShiftSub_UV 
Exit_UV_uflow: 
     // sub has caused borrow - add back to cancel 
    add u1,edx 

    adc u0,eax 
Exit_UV: 
    mov eax,u0 

    cmp v0,eax 

    je  Exit   // too risky to proceed 

    cmp eax,enough 

    jbe Exit 
Sub_VU: // v[] greater than u[] here 
        //  AND u[0] not = 0
        //    (tested by 'enough' below and calling routine)
        //  AND eax loaded with u[0] 
    mov nloop,INV_NLOOP 
    mov edx,u1 
Sub_VU_loop: 
    sub v1,edx 

    sbb v0,eax
 
    je  Exit      // 2nd Point 'X' - see comments below 

    jb  Exit_VU_uflow 

    add ebx,esi  // add right col to left 
    add ecx,edi
 
    cmp v0,eax 

    jbe  Exit_VU 

    dec nloop 

    jge Sub_VU_loop 

    jmp ShiftSub_VU 
Exit_VU_uflow: 
     // sub has caused borrow - add back to cancel 
    add v1,edx
 
    adc v0,eax 
Exit_VU: 
    mov eax,v0 

    cmp u0,eax 

    je  Exit 

    cmp eax,enough 

    jbe Exit 

    jmp Sub_UV 
     /************************/ 

ShiftSub_UV: 
  // arrives with v1 in edx, v0 in eax 
  // AND u[] greater than v[]
  // AND v0 not = 0 
    mov edx,u0 
    push edi 

    bsr edi,eax      // v0 
    push esi 

    bsr esi,edx      // u0 
    mov edx,v1       // eax:edx back to v0:v1 

    sub esi,edi      // esi now the bit-size difference
                     //  of u0 and v0 

    cmp esi,INV_LOOP_SHIFT 

    jg SS_cont_UV 

    pop esi 
    pop edi 

    jmp Sub_UV  // continue by subtraction
SS_cont_UV: 
    mov shift,esi 
    mov edi,ecx 

    mov ecx,esi 
     // shift left eax:edx so msbit of eax aligned
     //  with msbit of u0 
    shld eax,edx,cl 

    shl edx,cl 
     // and left shift left col of m[] by same 
     //  (edi holding ecx copy) 
    shl edi,cl 

    shl ebx,cl 
    mov ecx,edi 

    pop esi 
    pop edi 
SS_dosub_UV: 
    sub u1,edx 

    sbb u0,eax 

    ja SS_coladd_UV     // subtraction succeeds
                        //  - jump to col add 
     // if zero or -ve add back 
    add u1,edx 

    adc u0,eax 

    jmp SS_no_coladd_UV 

SS_coladd_UV: 
    add esi,ebx 
    add edi,ecx 

SS_no_coladd_UV: 
    shrd edx,eax,1 

    shr eax,1 

    shr ecx,1 

    shr ebx,1 

    dec shift 

    jnz SS_dosub_UV 

    jmp Sub_UV 
/********************/ 

ShiftSub_VU: 
  // arrives with v1 in edx, v0 in eax 
  // AND v[] greater than u[] 
  // AND u0 not = 0 
    mov edx,v0 
    push ebx 

    bsr ebx,eax      // u0 
    push ecx 

    bsr ecx,edx      // v0 
    mov edx,u1       // eax:edx back to u0:u1 

    sub ecx,ebx      // ecx now the bit-size difference
                     //  of v0 and u0 
    cmp ecx,INV_LOOP_SHIFT 

    jg SS_cont_VU 

    pop ecx 
    pop ebx 

    jmp Sub_VU 
SS_cont_VU: 
    mov shift,ecx 
     // shift left eax:edx so msbit of
     //  eax aligned with msbit of v0 
    shld eax,edx,cl 

    shl edx,cl 
     // and left shift right col of m[] by same 
    shl esi,cl 

    shl edi,cl 

    pop ecx 
    pop ebx 
SS_dosub_VU: 
    sub v1,edx 

    sbb v0,eax 

    ja SS_coladd_VU     // subtraction succeeds
                        // - jump to col add.
                        // If zero or -ve add back 
    add v1,edx 

    adc v0,eax 

    jmp SS_no_coladd_VU 

SS_coladd_VU: 
    add ebx,esi 
    add ecx,edi 

SS_no_coladd_VU: 
    shrd edx,eax,1 

    shr eax,1 

    shr esi,1
 
    shr edi,1 

    dec shift 

    jnz SS_dosub_VU 

    jmp Sub_VU 
    /*******************/ 

Exit: 
    mov m12in,esi    // transfer register variables to locals 
    mov m22in,edi 

    mov m11in,ebx 
    mov m21in,ecx 
  } 
   // My assembler does not recognise class variables
   //  which leads to a messy use of local variables here,
   //  which are then passed over to the class variables
   //  in the high level section next. 
   // Trim all this waffle to suit your own
   //  needs/software capabilities. 
  m11 = m11in ;      // transfer m[] contents to class variables 
  m21 = m21in ; 
  m12 = m12in ; 
  m22 = m22in ; 
  return ; 
} 
/*********************/

In the discussion next, I shall assume :-
    (a) U and V (upper case) are the large data arrays in a routine I call get_Inv() which calls strip_30().
    (b) u[] and v[] (lower case) are the 2-ULONG arrays handed to strip_30().
    (c) U > V - reverse the logic if not.

As will be clear from the comments above, strip_30() might return without making any progress. This event will be rare; but the calling routine must filter out these cases - otherwise the calling routine will enter an infinite loop.
    Actually these malignant cases are best handled by the calling routine anyway. Hence get_Inv() MUST protect strip_30().
    The critical points for this problem are labeled 'X' in the listing above. After a preamble setting up various registers, strip_30() starts work by subtracting v[] from u[] (or vice versa). If this first subtraction results in the zero flag being set, the 'je Exit' step will be taken (jump to Exit label if preceding subtraction generated a result equal to zero). The most obvious cause of this is if u[0] = v[0] on entry. However, if u[0]+1 = v[0] AND u[1] - v[1] generates a borrow, it will also occur.
    This problem occurs (in general) when many bits at the head of U and V are identical.
    Thinking about program flow as a whole, if

a subtraction of (U - V)
were made within get_Inv(),
an m[] of {1,1,0,1}
(or {1,0,1,1} if V > U)
is generated.

Using such an m[] to update M[] only requires 2# array adds. No word-times-array multiplies followed by array adds, are required. Simultaneously, many of the most significant bits of the larger of U or V will be removed. This is a large step forward for the calculation as a whole, for a trivial time penalty - a very beneficial step.

Once the benefit of this step is identified, it is natural to try to make more use of it. This can be done by testing whether 'some' of the most significant bits of u[0] and v[0] are equal.

This can be done by :-

if ((u[0] OR SUB_SHIFT_GAP)==(v[0] OR SUB_SHIFT_GAP))
...

The value of SUB_SHIFT_GAP is #defined as 0x0000FFFF in the strip_30() header.
Hence, this test is TRUE if the 1st 16 bits of u[0] and v[0] are identical.
SUB_SHIFT_GAP was set after experimenting with 64k timings (using random inputs) to optimize it (details below).

You might think that the conditional statement, above, will sort the strip_30() early exit problem - but not quite.

For example, if u[0] = 0x00010000
and v[0] = 0x0000FFFF
the test above would produce FALSE,
then all that is then needed is for
(u[1] - v[1]) to generate a borrow,
for an infinite loop to follow.

To close this loophole I set a variable
called 'diff' to abs(u[0] - v[0])
and then do a high level subtract move, as discussed above, when 'diff' is tiny.

------------
A pseudo code listing of the multiplicative
  inverse finder, get_Inv() (this drives strip30()).

get_Inv() needs several 'spear carrying' modules to
make it work; I am not going to list these modules,
since anyone seriously interested in using this 
document, will have their own generic modules on hand
to build them. 
get_Inv() is a public method in a class I call
FieldInv. The constructor of this class is called with
details of the Field prime as parameters, and the
number of bits in the prime are found there.

Below I use curly brackets {## ... ##} to delimit a
piece of work not detailed. In general a subroutine
(or several of them) will be needed; but these
are always elementary.

/********************/ 
ULONG *                 // Returns a ptr to the inverse 
FieldInv::get_Inv (ULONG *pmember) 
  // pmember[] MUST be in set {1,2...(p-1)} 
  //  (i.e. a multiplicative group member) 
{ 
  Local variables of interest :- 
   (1) ULONG comp=0 ; // comp even - prime[] > member[];
                      //  else odd 
                      // Note (prime[] == member[])
                      //  cannot occur anywhere in 
                      //  the calculation - this is a
                      //  consequence of mutual primality. 
   (2) ULONG diff ;   // abs(u[0] - v[0]) 
   (3) ULONG u[3], v[3] ; // The leading 2 words of the data
                      // arrays. For my modules, it is easier
                      //  to set the 1st 2 ULONG's 
                      //  (the parts used by strip_30())
                      //  when 3 ULONG's of space
                      //  are available.
                      //  Otherwise '2' will be adequate.
                      // u[] is always filled from the prime[]
                      // and v[] from the member[] 

  {## Find the most significant bit of the member ##}
    //  - the msb of the prime is set in the constructor

    // I have a method which will find the inverse of a
    //  word sized (<= 32 bits) member of a multiword Field.
    // I call this here. Optional.

  {## Set variables deltabits to the bit-difference of
     prime[] and member[], 
   and maxnbits to the bit-size of the larger value
     = prime[] bits on entry. ##}

  {## Take copies of prime[] and member[] into U and V. ##}
    // These are the 'working' data arrays 'reduced' below. 

  {## Set M[] to Identity. ##}

  while (maxnbits > 32) { 
    if (deltabits >= MAX_DIV_SHIFT) {
        // MAX_DIV_SHIFT=24 #defined in strip_30() 
        // If deltabits large, a full array division is
        //  the best move.
        // (I use Algorithm D of 'ref 11', section 4.3.1)

      do_big_div_move (comp) ;
        // Does an array division
        //  + updates the data arrays and M[] 
    } else { 
      {## set up u[2] and v[2] so that the larger
         is left justified. ##}
        // I have a method which does a copy with a
        //  simultaneous left shift on an array 
        //  - I use this here.
        // (Might be done in high level 'C' with little
        //   loss of time.)
        // (If maxnbits < 64, any non-active bits
        //   of u[], v[] must be zeroed.) 

      diff = abs(u[0] - v[0]) ;

      if (((u[0] | SUB_SHIFT_GAP)
                 == (v[0] | SUB_SHIFT_GAP))
           || (diff < 3)) { 
        {## do the subtraction move discussed above. ##}
      } else { 
        strip_30 (u, v) ;   // find m[]

        {## Use m[] to reduce the data arrays and update M[] ##}
      } 
    } 
    {## reset the data array nlw and bit sizes;
         reset maxnbits, comp, deltabits. ##}
  } 

    // Here both U and V have been reduced to <= 32 bits. 

  if the data arrays do NOT contain {1, 0} or {0, 1} 
     use litl_Inv() to find the last m[] 
     Update M[] using the last m[] 
  } 
  return the Inverse ptr (an element of M[] 
    - see 'eucnotes' for details) 
} 
/********************/

Timings for get_Inv()

Timings for get_Inv() were obtained using a technique similar to that for litl_Inv(). The speed of calculation for any Euclidean type of module is input data dependent; and the only way to get a representative timing is to average the results from a large number (64k was used below) of random inputs. As for litl_Inv(), I used a Sophie Germain prime and an associated generator to find random multiplicative group elements.

All the timings given below were 'hot cache' timings - see 'speed' for the explanation of this phrase.

[As discussed in 'eucnotes', repeated multiplication of an element of the multiplicative group by a generator will 'generate' the whole of the multiplicative group. (It is easy to find a generator for a Sophie Germain prime.) The universe will not survive long enough to generate the multiplicative group of a large prime in this way, but I hoped that the elements generated would be adequately random within the group. This hope was not quite realized. Merely by changing the starting element and doing another run, the average speed changed. Nevertheless, the variations were only the odd % and sufficiently small for this document.]

For my own purposes, primes of about 512 bits (16# 32-bit words) are typical. Hence :-

//  A 512 bit Sophie Germain Prime = 
  ULONG P[16] = {0x815A0082, 0x2AE60442, 0x2D881BCD, 0x44BB190F, 
            0x59A47996, 0x232C38DE, 0x0DF3595F, 0x483C0550, 
            0x15246861, 0x57BF61D7, 0x69EF7AD1, 0x1C1636A1, 
            0x79EE6B75, 0x762C67CB, 0x39BB4D68, 0x54B0BA0F} ; 
//  A generator of the multiplicative group for the prime above :- 
  ULONG gen[16] = {0x464B5C29, 0x0E9633FE, 0x0F8B55EB, 0x3CC93EDC, 
             0x48392C18, 0x6F3221D8, 0x17043627, 0x58EE14F4, 
             0x44E7529F, 0x4FCB0064, 0x64C3226C, 0x3B04317A, 
             0x303B4A62, 0x4D3E23D1, 0x15D5670F, 0x34FD06F3} ;

Timing Results

Once I got get_Inv() functionally correct, I started taking some timings. Initially it took a little over 80,000 clock ticks on my Pentium P3. strip_30() accounted for about 20,000 of these. Hence the updating of M[] and the input arrays was taking about 3/4 of the total run time. The initial module used a general purpose word-times-array module that was not written specifically to accompany get_Inv(). I managed to reduce this run time by about 25% by writing a custom double (word-times-array) module with both calculations proceeding simultaneously.

This gave the following final timings for get_Inv() :-

Processor	Timing (clock ticks)
Pentium P3	59170
AMD Athlon	49980
AMD A8	33760

On the P3, the final ratio of the running times for strip_30() versus that for updating the various arrays was about 20:40 thousand clock ticks. Extending these values to, say a 1024 bit prime (double the size of the case above) indicates :-
(a) The time spent in strip_30() should double.
(b) The time spent updating the arrays should quadruple (follow an n² law).

Hence the estimated (P3) time would be 40+160 (thousand clock ticks) = 200,000 ticks. Very obviously, as the prime size increases, the run time will become more and more dominated by the 'updating array' part of the work; and this focuses attention on this side of the problem.

The easiest way to improve this situation was to write a 64-bit version. When this was done, the total time spent in a strip_62() module was (roughly) unchanged (each call produces an m[] array with double sized elements, but requires double the time to find them). However the 'updating array' part of the work comes down by a factor of 4 (only half as many 'updates' being required, and these proceed roughly twice as quickly (array word sizes being halved)). Hence the P3 time should come down to
(20 + 40/4) = 30k ticks i.e. a speed gain of 2.
The results below show that this is roughly correct.

[The P3 is, of course, incapable of 64-bit arithmetic, but the guesstimate remains valid for processors that can.

The rule of thumb, just discussed, can be used to estimate run times for other problem sizes by using the data in the table below as an anchor point.]

64-bit code version timings

Processor	Timing (clock ticks)
AMD Athlon	27880
AMD A8	16614

I believe that these timings are the fastest in the world.
If you know differently, let me know.
The software that produced these timings
is available for sale.
See sales.html for details.

go to site Home page

The World's Fastest 'bignum' Multiplicative Inverse Finder.

Timings for get_Inv()

Timing Results

64-bit code version timings

The World's Fastest 'bignum'
Multiplicative Inverse Finder.