Notes on Schönhage's algorithm
for polynomial GCD's

[This document is based on Aho, Hopcroft and Ullman's book 'Ref 13' (link below).
['Ref 3' (Section 9.4.3) attributes the algorithm of Ref. 13 Section 8.9, discussed below, to Schönhage; but, below, I shall call it Aho's algorithm (mainly because Aho is easier to type! but also because everything below is derived from Ref. 13).]
[Both 'Ref 13' and 'Ref 3' give an adaptation of the algorithm below for use with integers.]

Any reader of this document, hoping to find a method to speed up the finding of polynomial 'greatest common divisors' (GCD) is likely to be disappointed. Such a reader might want to skip to the conclusions section at the bottom of this document; after which they will probably want to search elsewhere.

[Several other documents on this site are referred to repeatedly within. All the links to these documents are given here; return here if you want to view the contents.
eucnotes.html - This introduces some of the ideas and nomenclature related to the Euclidean algorithm which overlaps this document.
[A 32 bit sized multiplicative inverse finder is presented within which I believe is the fastest in the world. This listing can be easily extended to cover 128 bit inverses using 64 bit code. (I have 64 bit code modules to do this, available for sale.)]
fastmuli.html - This extends 'eucnotes' (which relates to processor word sized arguments) to the multi-word problem. The 64 bit version of this code I believe to be the fastest multi-word inverse finder in the world. This binary is for sale.
euclogic.html - This introduces the logical foundations of the Euclidean algorithm when applied to integers. When polynomials are in play (as below) only parts of this document are relevant.
speed.html - Introduces the basics of writing high speed code for Pentium processors.
matnotes.html - a 'crib' sheet for such matrix theory as used within.
[Caution :- 'matnotes' applies to 2-by-2 matrices with integer elements; below the matrices have elements of polynomials with integer coefficients, and some of the results do not apply. However comments relating to matrix multiplication do apply.]
refs.html - The technical references used on this site.
Below I shall generally write e.g. 'Ref 11' or 'eucnotes' (single quotes) when referring to any of the files above.]

The division of polynomials is only possible (without restriction - i.e. all polynomials divide all other polynomials of greater degree) when the polynomial coefficients are from a Field. Fields are arithmetical constructs which support all 4 basic arithmetical operations {+,-,*,/} without restriction. The common Fields of interest are the Real Numbers, the Complex Numbers, the Rational Numbers (called fractions in a school room) and Finite Fields.
    [Note that division by 'zero' is a problem for all these Fields - a problem solved by excluding zero from the division operation.]
    Both Real and Complex numbers cease to satisfy the Field axioms when implemented on a computer. Round off errors intrude, although these may be small enough to be insignificant.
    The Rationals are not pestered by round off, but the growth of numerator and denominator terms can be explosive as calculations proceed, leading to overflow problems.
    Finite Fields are the only Fields that can be implemented on a computer without worries about either overflow or loss of accuracy from the vagaries of computer number representation.

Finite Fields come in many flavors; the simplest uses integer arithmetic modulo a prime number (P below). Only Finite Fields of this type are considered in this document.
The modulo (or mod) operation can be considered the remainder after division;

thus 13 mod 7 = 6.

Clearly, the only possible outcome from any calculation modulo a prime is a member of set {0, 1, ... (P-1)} (these numbers are known as the members of the Field; and their number (the order of the Field) is finite - hence Finite Field). Thus multiplication, addition and subtraction proceed in a staightforward way.

Division is more subtle. This depends upon the ability to find a multiplicative inverse i.e. a solution to the equation

(a * b) = 1 (mod P).
Then if some calculation requires
the division by 'b',
we can replace it by multiplication by 'a';
since a = (1/b) (mod P).

This definition of division in a Finite Field does not correspond to the commonplace division of numbers as taught in a school room. Nevertheless it does satisfy the axiom that multiplication following division returns you to the start values.
[see 'eucnotes' and 'fastmuli' for an in depth discussion of multiplicative inverses, when they exist, and how to find them.]

Aho's algorithm is recursive. This makes it difficult to debug - you will probably have to resort to the old fashioned method of writing a log file of algorithm progress to do it. This means that it will take several times longer to get bug free than you might imagine.
(I provide a worked example below as a debugging aid)

My target task for this work was to find polynomial GCD's for degree 256 polynomials with coefficients from a 512 bit sized Finite Field. To make any progress with Aho's algorithm requires vigorously tested software to handle both the arithmetic of Finite Fields and the arithmetic of polynomials with Finite Field coefficients. These 2 suites might take a handful of man years to write. If you do not have them it would be unwise to proceed.
Despite the previous paragraph, difficulties with Aho's algorithm are largely 'flow' problems - and this being the case tiny Field primes can be used to get it working. Below I use tiny P-values for debugging work confident that moving to 512 bit primes will cause no further problems. I supply multiplication and addition tables, below, which make hand calculation (just about) possible.

All number theory textbooks give polynomial GCD algorithms; Algorithm 3.2.1 of 'Ref 10' is typical :-

Standard algorithm to find the GCD of 2 polynomials over a Field.

Given 2 polynomials A, B over a Field K, their GCD can be found :-

   1. [Finished?] If B=0, then output A as the answer,
                   and terminate the algorithm.

   2. [Euclidean Step] Divide A by B to find Q the 'quotient'
                   and R the 'remainder'
      Equivalently, we can write  A = B.Q + R 
                  (with DEG(R) less than DEG(B)).

      Copy B into A; R into B; go to step 1.

[The theoretical justification for this algorithm is analogous to the discussion of Euclid's algorithm for integers in 'euclogic'.]
    Note that the division step in line 2 will require the finding of a Field multiplicative inverse, and the best way to speed up Algorithm 3.2.1, is to speed this up.
[see 'fastmuli' for guidance.]
    Other parts of this algorithm that can yield increased speed are the polynomial quotient finder and the polynomial multiplier (to find B.Q). The copying operations, of the last line, can be removed by using pointers, then swapping them.
    You will need a standard module, such as above, to provide accuracy and speed checks for comparison to Aho's method.

In 'eucnotes' I include a discussion of the rate of convergence of Euclid's algorithm for integers. The rate of convergence for polynomials over a Field is much simpler - for Fields of significant size (100's of bits) Q will almost always be of degree 1, and the degree of A and B will almost always reduce by 2, leapfrogging each other, in each pass.
Put another way, usually the degree of R will be 1 less than B, which will be 1 less than A. This simple rate of convergence makes the extrapolation of run times straightforward.
Below, I shall use tiny Field primes as an aid to understanding Aho's algorithm; for these, Q's of degree 2 or more will be more frequent.

The fact that Q has a degree of 1 (nearly always) provides the germ of the idea behind Aho's algorithm. To find such a quotient only 3 terms from the numerator and 2 terms from the denominator are needed (assuming they differ by 1 in degree) - provided the evaluation of the Remainder R can be held off.
    If we set up a recursive subroutine that splits the input polynomial into 2 and passes the most significant half down to the next recursion level, eventually the number of polynomial terms will be reduced sufficiently to still find Q without error, simultaneously with finding a truncated R.
The missing part of R cannot be ignored, and when the subroutine returns to a higher recursion level, information has to be passed back to enable it to be recovered.
    Aho's method uses a variant of the 2-by-2 matrix technique discussed in 'eucnotes' to do this. (In this case the matrix entries are polynomials rather than integers).
    The next higher recursion level (which operates with longer input polynomial portions) uses the returned 2-by-2 matrix to obtain a longer portion of R; use this to find another quotient; incorporate this into the 2-by-2 matrix and return it to a yet higher recursion level ...
    When the recursive section is exited, the 2-by-2 matrix can be applied to the input polynomials resulting in polynomials which are at least 3/4 (probably less) of their initial degrees, but leave the GCD unchanged. Significant progress having been made, the recursive section can be entered again ...
    The 'master of ceremonies' routine Aho calls GCD() and the recursive routine HGCD() - for half GCD. Since most people will not have a copy of Aho's book to hand I reproduce them next. There are potential problems with Aho's listings which I cover immediately after them. Following that I give a C++ style pseudo code outline for the same routines which remove these problems.

Aho's Algorithm

[Aho calls the language used below 'pidgin ALGOL'! Mercy!]

[a0, a1, b0, b1, c, c0, c1, d, e, q, f, g0, h0, g1, h1 are all polynomials.
R and S are 2-by-2 matrices of polynomials.]

[Procedure GCD() returns a polynomial that is the GCD of the input polynomials;
Procedure HGCD() returns a 2-by-2 matrix of polynomials (R)
used by GCD() to reduce the degree of the input polynomials.]

[Below, column arrays are written in rows delineated by curly brackets.
This is for ease of writing matrices in HTML.
Thus {b0, b1} indicates a column array with 2 rows,
b0 at the top and b1 at the bottom.
Similarly square arrays are written as rows, copying the top row 1st,
and the bottom row 2nd.
Thus an identity array is written as {1, 0, 0, 1}
Further notes on matrices can be found in 'matnotes']

Procedure GCD (a0, a1)

(G1) if a1 divides a0 then return a1
       else
          begin
(G2)         R = HGCD (a0, a1)
(G3)         {b0, b1} = R * {a0, a1}
(G4)         if b1 divides b0 then return b1

             else
                begin
(G5)               c = b0 modulo b1
(G6)               return GCD (b1, c)
                end
          end

    |-------------------------------------|

Procedure HGCD (a0, a1)
(DEG(a1) has to be less than DEG(a0) here)

(H1) if DEG(a1) less than or equal DEG(a0)/2 return {1,0,0,1}
     else
        begin
(H2)       let a0 = b0.x^m + c0  
               // where m = Floor (DEG(a0)/2)
               // AND DEG(c0) is less than m
(H3)       let a1 = b1.x^m + c1
               // where DEG(c1) is less than m
               // COMMENTS :-
               //   b0, and b1 are the leading terms of
               //   a0 and a1.
               //   We have DEG(b0) = Ceil(DEG(a0)/2) and
               //    DEG(b0) - DEG(b1) = DEG(a0) - DEG(a1)
(H4)       R = HGCD (b0, b1)
(H5)       {d, e} = R * {a0, a1}
(H6)       f = d modulo e
               // COMMENT :-
               //   e and f are are successive terms in
               //   the remainder sequence, of degree at most
               //   Ceil((3.m)/2), that is 3/4 DEG(a0).
(H7)       let e = g0.x^Floor(m/2) + h0 
               // where DEG(h0) is less than Floor(m/2)
(H8)       let f = g1.x^Floor(m/2) + h1 
               // where DEG(h1) is less than Floor(m/2)
               // COMMENT :-
               //   g0 and g1 are each of degree (m+1) at most
(H9)       S = HGCD (g0, g1)
(H10)      q = Floor(d/e)
(H11)      return  (S * {0, 1, 1, -q} * R)
        end
      |-------------------------------------|

A point that might surprise the unwary are the following facts about polynomials with Finite Field coefficients :-

[If 2 polynomials of DEG m and n are multiplied together, the product will have DEG (m + n).]

Given 2 of polynomials of DEG m and n;
with a product, say, Prod1;
and a second pair of DEG p and q
(with m + n = p + q)
with a product, say, Prod2.

If Prod1 is added to Prod2,
the end result might have a degree anywhere
between 'empty' and (m + n) = (p + q).

The problem is with the addition :-
if 2 coefficients of the same order
happen to add up to the Field prime (P),
then after the 'mod' operation
the final coefficient will be zero.
This effect might occur in lines
G3, G5, H5 or H6 in the listings above.
Of particular concern is line H5,
where 'e' just might go to 'empty'.
If this were to occur,
lines H6 and H10 have no sensible outcomes,
and your code would hit a brick wall.
The chances of a malignant outcome depend
inversely upon the size of
'P' - if P is large (100's of bits)
it will only occur once in a blue moon;
if P is tiny it will occur regularly.

However, for a logically rigorous module, the listings above need defensive measures to be inserted when polynomials with Finite Field coefficients are being considered.

In the pseudo code given next, a class I call Fp_Poly is used. Each instance of this class holds a polynomial with Finite Field coefficients.
The salient features of the class are :-
    (a) The coefficients are held in memory slots large enough to hold 32 contiguous coefficients. This feature helps to speed up polynomial copying (it is an example of 'pooling' - see 'speed').
    (b) An 'int' parameter cur_deg holds the current degree of the polynomial at any instant. This parameter has to be checked and reset after any operation that might change it. It takes a value of -1 when the polynomial is empty; but the coefficients are not deleted if this should occur - this is too slow. If more coefficients are needed than are available, the dynamic memory allocator 'new' is used. This is slow, (at least in older versions of 'Windows') so it is important to initialize polynomial sizes on the conservative side to avoid this possibility (see 'speed' again).
[Line G3 is a 'crunch point' for this. Each product polynomial (during the matrix product) needs to have about 50% more coefficients than the input polynomials. When these products are added, a swathe of the most significant terms go to zero - if the algorithm is working correctly! This observation also provides a speed up technique - don't evaluate the high order product terms that will end up zero.]
    (c) A method part_Copyin (source, m) is used to split the polynomials. Only the terms from order 'm' up to 'cur_deg' are copied with the destination cur_deg set to source->cur_deg - m.

The class I wrote to implement Aho's method I called HGCD, and the methods corresponding to GCD() and HGCD() I called get_GCD() and get_HGCD(); the latter being private and the former public. get_HGCD() being recursive and needing several polynomials to hold intermediate values, the following scheme was adopted to get the best speed :-
    (a) 2 recursion depth counters (class variables) were used - Hdepth (for get_HGCD()) and Gdepth (for get_GCD()). These are incremented on entry, and decremented on exit from their respective subroutines. The temporary polynomials are set up at constructor time in arrays and then accessed using Hdepth and Gdepth. This requires a 'maximum input degree parameter' being passed to the constructor so that the maximum recursion depth can be found.
[In the example below they are useful for indicating the recursion state, as work proceeds.]
Other points :-
    (b) Rather than 'return' a 2_by_2 polynomial matrix from HGCD() (this is a clumsy use of the stack), get_HGCD() gets passed a pointer to a 2_by_2 matrix which the called routine fills in. If this matrix is an Identity matrix, there is no point in filling it in and the return value from get_HGCD() is IsIDENTITY. (#defined as TRUE in the dot-h file) or IsNotIDENTITY (#defined as FALSE) is used as a shortcut.
    (c) FpX_2x2 is a structure holding a 2_by_2 matrix of polynomials.
    (d) Variable names are unchanged from above as far as possible, but they are pointers.

The following is an outline of get_GCD() and get_HGCD() :-

   /****************/
void
HGCD::get_GCD (Fp_Poly *a0, Fp_Poly *a1, Fp_Poly *GCD, BOOL firstpass)
  // Here a0, a1 are the non-empty 'input' polys, and GCD the required output.
  // 'firstpass' must always be TRUE on the first call; subsequently FALSE.
{
   Fp_Poly *b0, *b1 ;
   BOOL isI ;       // an indicator that a 2_by_2 matrix
                    //  is the Identity array.
   if (firstpass) Gdepth = Hdepth = 0 ;
   if (a1->cur_deg > a0->cur_deg) swap a0 and a1 ;
   if cur_degs are equal set a1 = a1 Mod a0 ;

   while (cur_degs differ by 2 or more AND a1 is not 'empty') {
      set a0 = a0 Mod a1 ;
      swap a0, a1 ;
   }
    // above assumes the Mod() call is faster than calling get_HGCD()
    //  when the difference in degree's is greater than 1.
   if a1 is 'empty', copy a0 into GCD; make it monic; return ;
   if a1 divides a0, copy a1 into GCD; make it monic; return ;

   isI = get_HGCD (a0, a1, &R) ; // R is a class FpX_2x2 structure
                                 //  (holds the final array)
   if (!isI) {
      b0, b1 set from class array of temporaries using Gdepth
      set [b0, b1] = R * [a0, a1] ;
      check whether either b0 or b1 is empty.
      If either is empty {
         copy the other to GCD;
         make it monic;
         return ;
      }
      check b1->cur_deg >= b0->cur_deg; if not swap
        (reversing order here - swapping
          back below)
   } else {
      b1 = a0; b0 = a1;   // no progress from 1st get_HGCD() call 
                          //  - Mod() call in do-while
                          //   (next) prevents repeat.
   }
    // here b1->cur_deg >= b0->cur_deg
   do {
      swap b0, b1 // now b0->cur_deg >= b1->cur_deg
      set b0 = b0 Mod b1   // now b0->cur_deg less than b1->cur_deg
   } while ((abs(b0->cur_deg - b1->cur_deg) >= 2) AND (b0->cur_deg > 0)) ;
     // here b0->cur_deg < b1->cur_deg
   if (b0->cur_deg less than or equal 0)
      copy b1 to GCD;
      make it monic;
      return;
       // Note if b0 is reduced to a 'constant' (cur_deg = 0)
       // work is finished.
   }
   Gdepth++ ;
   get_GCD (b1, b0, OP, FALSE) ; // degree's reversed after above.
   Gdepth-- ;
   return ;
}
   /*****************/
HGCD::get_HGCD (Fp_Poly *a0, Fp_Poly *a1, FpX_2x2 *R_or_S)
{
   int m, n ;
   BOOL retflag, isI ;
   ULONG flag ;
   Fp_Poly *bg0, *bg1, *d, *e, *f, *q ;
   FpX_2x2 *S ;

   if ((a1->cur_deg*2) less than or equal a0->cur_deg) {
      return IsIDENTITY ;
   }
   m = Floor (a0->cur_deg, 2) ;
   n = Floor (m, 2) ;      // using n = Floor(m/2)
                           //  for lines H7, H8 of listing
   get bg0, bg1 from pool of temporaries using Hdepth
   bg0->part_Copyin (a0, m) ; // copy most significant half 
                              // of a0, a1 into bg0, bg1
   bg1->part_Copyin (a1, m) ;
   if ((bg1->cur_deg * 2) less than or equal bg0->cur_deg) { 
       // short circuit early return from get_HGCD()
      isI = TRUE ;
   } else {
      Hdepth++ ;
      isI = get_HGCD (bg0, bg1, R_or_S) ;
      Hdepth-- ;
      get f, q from pool of temporaries using Hdepth
      if (isI) {
         d = bg0 ; // just copying pointers
         e = bg1 ;
      } else {
          // R_or_S is set AND not Identity here
         get d, e from pool of temporaries using Hdepth
         set [d, e] = R_or_S * [a0, a1]
         if e is empty {
            return IsNotIDENTITY 
             // cannot proceed - 
             //  return to higher recursion level
         }
         set flag, q to *d/*e
         set *f to remainder.
          // My method returns a ULONG into variable flag.
          // It is ALLOK if 'q' is not zero,
          //      !ALLOK if q is zero)
         if (flag == ALLOK) { // ie q non-zero
            retflag = IsNotIDENTITY ;
         } else {
            retflag = isI ;
         }
         if (isI) {
            if (flag == ALLOK) { 
                // flag is TRUE when q non-zero
               R_or_S set to {0, 1, 1, -q}
            } else {
               R_or_S set to {0, 1, 1, 0}
            }
         } else {
            if (flag == ALLOK) {
               set R_or_S to {0, 1, 1, -q} * R_or_S ;
            } else {
                // if q zero, R_or_S needs row swap
                // (premult by {0, 1, 1, 0})
               row swap R_or_S
            }
         }
          // Here I check whether next call to get_HGCD()
          //  is going to exit early
         if (early exit from get_HGCD() will not be taken) {
              // saving copying + a call to get_HGCD()
            retflag = IsNotIDENTITY ;
            bg0->part_Copyin (e, n) ;
            bg1->part_Copyin (f, n) ;
              // f = d Mod e; hence 
              //  bg0->cur_deg greater than bg1->cur_deg
            get S from pool of temporaries using Hdepth
            Hdepth++ ;
            get_HGCD (bg0, bg1, S) ;
              // retflag cannot be IsIDENTITY here
            Hdepth-- ;
            set R_or_S = S * R_or_S
         }
         return retflag ;
      }
/**************/

The modules above are tricky to understand. Hopefully a worked example will be of assistance.

To keep things simple, I shall use P=13 as the Field prime. A Field prime this small enables full multiplication and addition lookup tables to be easily displayed. These are :-

MULT	1	2	3	4	5	6	7	8	9	10	11	12
1	1	2	3	4	5	6	7	8	9	10	11	12
2	2	4	6	8	10	12	1	3	5	7	9	11
3	3	6	9	12	2	5	8	11	1	4	7	10
4	4	8	12	3	7	11	2	6	10	1	5	9
5	5	10	2	7	12	4	9	1	6	11	3	8
6	6	12	5	11	4	10	3	9	2	8	1	7
7	7	1	8	2	9	3	10	4	11	5	12	6
8	8	3	11	6	1	9	4	12	7	2	10	5
9	9	5	1	10	6	2	11	7	3	12	8	4
10	10	7	4	1	11	8	5	2	12	9	6	3
11	11	9	7	5	3	1	12	10	8	6	4	2
12	12	11	10	9	8	7	6	5	4	3	2	1

ADD	0	1	2	3	4	5	6	7	8	9	10	11	12
0	0	1	2	3	4	5	6	7	8	9	10	11	12
1	1	2	3	4	5	6	7	8	9	10	11	12	0
2	2	3	4	5	6	7	8	9	10	11	12	0	1
3	3	4	5	6	7	8	9	10	11	12	0	1	2
4	4	5	6	7	8	9	10	11	12	0	1	2	3
5	5	6	7	8	9	10	11	12	0	1	2	3	4
6	6	7	8	9	10	11	12	0	1	2	3	4	5
7	7	8	9	10	11	12	0	1	2	3	4	5	6
8	8	9	10	11	12	0	1	2	3	4	5	6	7
9	9	10	11	12	0	1	2	3	4	5	6	7	8
10	10	11	12	0	1	2	3	4	5	6	7	8	9
11	11	12	0	1	2	3	4	5	6	7	8	9	10
12	12	0	1	2	3	4	5	6	7	8	9	10	11

[Tables, like the above, are known as Cayley Tables, and they can be used to demonstrate many theorems in Group Theory, but that is a distraction here. See 'Ref 14' for more on Group Theory.
(A. Cayley 1821-1895)]

To use the tables above to find e.g. (8 * 7) :-
(i) move down the rows of the 'MULT' table to row '8'.
(ii) move across to the column headed by '7'
(iii) and the result, at the intersection, is '4'.
Alternatively we could calculate :-
8 * 7 = 56 and 56 mod 13 = 56 - (4*13) = 4.

Also available from the 'MULT' table are
multiplicative inverses.
Suppose the inverse of '8' is wanted.
(i) move down the rows of the 'MULT' table to row '8'
(ii) move along this row to the entry '1'
(iii) move up that column to the column header '5'.
Hence 5 is the multiplicative inverse of 8 (mod 13).
To check this (8 * 5) = 40,
and 40 mod 13 = 40 - 3*13 = 1.

The additive table is used in a similar way.

Subtraction can be done by adding the negative -
hence (3 - 7) can be found by finding
-7 = (P-7) = (13-7) = 6;
then adding 3 + 6 = 9.

The paragraph above can be extended to the addition, subtraction multiplication and division of polynomials. Addition and subtraction are done by adding or subtracting like powers of 'x'.

Consider 3.x + 7 times 9.x² + 7.x + 5.
Finding 3*9 = 1; 3*7 = 8; 3*5 = 2 we get
3.x*(9.x² + 7.x + 5)
= 1.x³ + 8.x² + 2.x.

Similarly 7*(9.x² + 7.x + 5)
= 11.x² + 10.x + 9.

Adding these partial results gives :-
x³ + (8+11).x² + (2+10).x + 9
= x³ + 6.x² + 12.x + 9

Division is not much more complicated :-
Consider dividing x³ + 6.x² + 12.x + 9 by 3.x + 7.

The multiplicative inverse of the highest power of
the denominator will be required.
This term is 3 with an inverse of '9'.
Hence to get the first term of the quotient
we divide 3.x into x³ giving 9.x².

Multiplying 3.x + 7 by 9.x²
gives 1.x³ + 11.x².

Subtracting this from the numerator
gives a reduced numerator of
0.x³ + (6-11).x² + 12.x + 9
= 8.x² + 12.x + 9.

Dividing 8.x² by 3.x
gives the next term of the quotient
of (9*8).x = 7.x.

Multiplying this by 3.x + 7
gives 8.x² + 10.x.

Subtracting this from the reduced numerator
leaves 2.x + 9.
Dividing 2.x by 3.x gives (9*2) = 5.

Hence the final quotient is
9.x² + 7.x + 5.
and the final remainder is 2.x + 9 - 5*(3.x + 7) = 0.

------------

Having demonstrated the principles of Finite Field arithmetic, I shall now present a demonstration of Aho's algorithm :-
The following apply to the displayed calculation :-

(a) Polynomials are displayed without an appendage of 'xⁿ', except for the first entry.
Thus x² + 3.x + 5 is displayed as 1.x², 3, 5.

(b) In get_HGCD() the work is done in 3 stages :-
   stage 1 :- a 1st recursive call to get_HGCD().
   stage 2 :- calculation of a quotient term.
   stage 3 :- a 2nd recursive call to get_HGCD().
    These are the definition of stage 1, 2 and 3 used below.

(c) The FpX_2x2 are printed as e.g. R.p11 = .... The numbers after the lower case 'p' indicate the row - then - column position of the polynomial.

                #----------------#
ALL POLY COEFFICIENTS BELOW ARE IN HEXADECIMAL.

Two mutually prime polynomials over P=13 :-
   IP1 = 1.x^12, 1, 1, 1, 9, 3, C, B, 0, 5, A, 8, 7.
   IP2 = 1.x^12, 1, 9, 6, B, B, 6, 9, 5, 2, 8, C, B.

Multiplying both by factor (=GCD sought) = 1.x, 8.
 IP1*factor = 1.x^13, 9, 9, 9, 4, A, A, 3, A, 5, B, A, 6, 4.
 IP2*factor = 1.x^13, 9, 4, 0, 7, 8, 3, 5, C, 3, B, B, 3, A.

            +++++++++

### Entry of get_GCD()
Gdepth=0 (after 'firstpass' block).
a0 = 8.x^11, 4, 3, B, 6, 2, 2, B, 0, 1, A, 6.
a1 = C.x^10, 8, 0, 8, B, 9, 8, 3, 6, A, 6.
###
     # Hdepth=0 START of get_HGCD() # After splitting,
             working polys are :-
   bg0 = 8.x^6, 4, 3, B, 6, 2, 2.
   bg1 = C.x^5, 8, 0, 8, B, 9.
   ---Stage 1 - entering get_HGCD().

        # Hdepth=1 START of get_HGCD() # After splitting,
             working polys are :-
      bg0 = 8.x^3, 4, 3, B.
      bg1 = C.x^2, 8, 0.
      ---Stage 1 - entering get_HGCD().
           # Hdepth=2 START of get_HGCD() # After splitting,
             working polys are :-
         bg0 = 8.x^2, 4, 3.
         bg1 = C.x, 8.
         ---Stage 1 - entering get_HGCD(). Returned array is Identity.
         ---Stage 2 - After division q = 5.x, A. Remainder = 1.
         After incorporation of q, output array =
         R.p11 = Empty. R.p12 = 1.
         R.p21 = 1.     R.p22 = 8.x, 3.
         ---Stage 3 returns Identity, Returned array unchanged.
           # Hdepth=2 END

         ---Stage 2 (cont) - calculation of q.
         Hdepth = 1. After use of returned array :-
         d = C.x^5, 8, 0, 8, B, 9.
         e = 1.x^4, A, 1, 3, 3.
         After division q = C.x, 5. Remainder = 3.x^3, 6, C, 7.
         After incorporation of q, output array =
         R.p11 = 1.      R.p12 = 8.x, 3.
         R.p21 = 1.x, 8. R.p22 = 8.x^2, 2, C.
         ---Stage 3 - 2nd get_HGCD() call.
         Polys for next get_HGCD()
         bg0 = 1.x^3, A, 1, 3.
         bg1 = 3.x^2, 6, C.

              # Hdepth=2 START of get_HGCD() # After splitting,
                   working polys are :-
            bg0 = 1.x^2, A, 1.
            bg1 = 3.x, 6.
            ---Stage 1 - entering get_HGCD(). Returned array is Identity.
            ---Stage 2 - After division q = 9.x, 7. Remainder = B.
            After incorporation of q, output array =
            R.p11 = Empty. R.p12 = 1.
            R.p21 = 1.     R.p22 = 4.x, 6.
            ---Stage 3 returns Identity, Returned array unchanged.
              # Hdepth=2 END
         Hdepth=1 (cont); Final returned array :-
         R.p11 = 1.x, 8.      R.p12 = 8.x^2, 2, C.
         R.p21 = 4.x^2, C, A. R.p22 = 6.x^3, 4, 3, A.
           # Hdepth=1 END

      ---Stage 2 - calculation of q.
      Hdepth = 0 (cont). After use of returned array :-
      d = 3.x^8, 6, B, 6, 4, C, 2, A, 3.
      e = 3.x^7, 2, 2, 1, 7, 8, 4, 3.
      After division q = 1.x, A. Remainder = 2.x^6, B, 0, C, 9, 6, C.
      After incorporation of q, output array =
      R.p11 = 4.x^2, C, A.    R.p12 = 6.x^3, 4, 3, A.
      R.p21 = 9.x^3, 0, 1, C. R.p22 = 7.x^4, 1, 4, 1, 3.
      ---Stage 3 - 2nd get_HGCD() call.
      Polys for next get_HGCD()
      bg0 = 3.x^5, 2, 2, 1, 7, 8.
      bg1 = 2.x^4, B, 0, C, 9.

           # Hdepth=1 START of get_HGCD() # After splitting,
                working polys are :-
         bg0 = 3.x^3, 2, 2, 1.
         bg1 = 2.x^2, B, 0.
         ---Stage 1 - entering get_HGCD().

              # Hdepth=2 START of get_HGCD() # After splitting,
                   working polys are :-
            bg0 = 3.x^2, 2, 2.
            bg1 = 2.x, B.
            ---Stage 1 - entering get_HGCD(). Returned array is Identity.
            ---Stage 2 - After division q = 8.x, 9. Remainder = 7.
            After incorporation of q, output array =
            R.p11 = Empty. R.p12 = 1.
            R.p21 = 1.     R.p22 = 5.x, 4.
            ---Stage 3 returns Identity, Returned array unchanged.
              # Hdepth=2 END

         ---Stage 2 - calculation of q.
         Hdepth = 1 (cont). After use of returned array :-
         d = 2.x^4, B, 0, C, 9.
         e = 7.x^3, 9, 9, 5.
         After division q = 4.x, 2. Remainder = B.x^2, 0, C.
         After incorporation of q, output array =
         R.p11 = 1.      R.p12 = 5.x, 4.
         R.p21 = 9.x, B. R.p22 = 6.x^2, 0, 6.
         ---Stage 3 returns Identity, Returned array unchanged.
           # Hdepth=1 END

      Hdepth=0 (cont); Final returned array :-
      R.p11 = 6.x^4, A, 9, B, 6.    R.p12 = 9.x^5, 7, 4, C, 9, 9.
      R.p21 = 2.x^5, 0, 5, 3, 7, 0. R.p22 = 3.x^6, 6, 3, A, 9, C, B.
        # Hdepth=0 END

   After return to get_GCD(), IP polys reduced to :-.
   b0 = 7.x^5, 9, 9, 9, A, C.
   b1 = B.x^4, A, 3, 3, 1.
     ++

   ### Entry of get_GCD() (after Mod() call) Gdepth=1.
   a0 = B.x^4, A, 3, 3, 1.
   a1 = C.x^3, 1, 8, 8.
   ###
     # Hdepth=0 START of get_HGCD() # After splitting,
          working polys are :-
     bg0 = B.x^2, A, 3.
     bg1 = C.x, 1.
     ---Stage 1 - entering get_HGCD(). Returned array is Identity.
     ---Stage 2 - calculation of q.
     After division q = 2.x, 5. Remainder = B.
     After incorporation of q, output array =
     R.p11 = Empty. R.p12 = 1.
     R.p21 = 1.     R.p22 = B.x, 8.
     ---Stage 3 returns Identity, Returned array unchanged.
       # Hdepth=0 END
  After return to get_GCD(), IP polys reduced to :-.
  b0 = C.x^3, 1, 8, 8.
  b1 = 8.x^2, C, 0.
    ++

  ### Entry of get_GCD() Gdepth=2.
  a0 = 8.x^2, C, 0.
  a1 = 1.x, 8.
  ### in get_GCD()[a1 divides a0];
   a1 = GCD = 1.x, 8.
                #----------------#

The question now is "How fast is it?"
My test conditions were :-
    (a) The Field prime used was the 512 bit Sophie Germain prime given at the bottom of 'fastmuli'.
    (b) The GCD is always of degree 1. This was achieved (as above) by finding two mutually prime polynomials of suitable degree, then multiplying them by a degree 1 polynomial (the GCD).
    (c) All timings are 'hot cache', on a Pentium 3 and in clock ticks. (see 'speed' for explanations)

Timings for the 2 algorithms

IP Poly degree	Standard (e.g. 3.2.1)	Aho's method	Speed ratio
63	0x14E78B6	0x41C631A	3.15
127	0x4927467	0xF2A89B0	3.31
252	0x10D0811A	0x383B0769	3.34

I have to say that my implementation of algorithm 3.2.1 of Ref 10, has received a lot of tuning and is close to being as fast as I can make it. My implementation of Aho's method has not received anything like the same effort. In particular 3 significant speedup ideas suggest themselves :-
    (a) As mentioned above, The array products in lines G3 and H5 result in many terms that become zero being calculated. This observation leads to an obvious response - don't calculate them. If this were done, about 40-50% of the Field multiplies in G3 and H5 could be skipped. [check out the returned array and the reduced input polynomials in the worked example to see this effect.]
    (b) Observation of get_HGCD() on the debugger (and the worked example above) indicates that a common outcome, in the deepest recursion level, is stage 1 - return Identity; stage 2 - find 'q'; stage 3 - return Identity. If get_HGCD() were modified to test for this possibility and then provide the appropriate response, a fair amount of bureaucratic work would be bypassed.
    (c) A good deal of time is wasted in calls to part_Copyin (source, m). In 'speed' I make the point that all forms of copying are likely to be time wasters, and this method is no exception. It would be possible to strip out this copying and replace it with indices into the Fp_Poly coefficients. Unfortunately, the rest of Fp_Poly is not set up to do this; to put this in place would require a significant rewrite.
    The 3 problem areas above (plus some lesser ones) amount to a significant drag on the speed of the Aho implementation - possibly a factor of 2. But I don't think a factor of 3.3 can be overcome. Hence, I have stopped development.

The timings in the table above should not be taken as 'the last word' on speed. As I write this, I am still involved with tuning my Fp_Poly class for speed, and as improvements come along, so the table entries will drop.

CONCLUSIONS

For my work, at least, Aho's method is of no value. The method is slow for polynomials of a size that concern me. Not only is it slow, but it does not converge towards the 'standard method' even at the highest degrees that interest me (see the Speed Ratio column above).
When allied to the fact that it is hard to debug, I consider it a doubtful prospect for any project.

[go to site Home page]

Notes on Schönhage's algorithm for polynomial GCD's

Aho's Algorithm

Timings for the 2 algorithms

CONCLUSIONS

Notes on Schönhage's algorithm
for polynomial GCD's