[By 'original form' I mean the form that Toom and Cook originally presented it. Recent developments have moved the method along; these developments differ in the choice of 'evaluation points' (explained below).]
One purpose of this document is to try to stop software writers wasting time on the Toom Cook algorithm! It's speed is poor on an x86 processor except when the input integers need several 100's of words to hold them.
[Use the link to 'msummary' (given just below) to view timings.]
20-30 yrs ago the method could deliver fair speed because it reduced the count of calls to the slow MUL opcode. Latterly, the latency of the MUL opcode on recent AMD processors has dropped to a dazzling 5 clock ticks, which means that the speed benefit from reducing calls to MUL is small. Simultaneously, Toom Cook has a heavy administrative work load, which text books simply ignore!
Unfortunately all the standard textbooks on this topic were written decades ago when hardware performance was very different; and today they badly mislead software writers on the topic of speed.
There is only one way to be sure of obtaining high speed modules today, and that is to conduct timing experiments and allow them to be your one true guide. The days of estimating algorithm speed by counting MUL opcodes are long gone.
[Some documents on this site are referred to repeatedly below. All the links to these documents are given here; return here if you want to view the contents.
zuras.html - Discusses the implementation of a Toom Cook variation developed in papers written by Mr Dan Zuras (1993). Zuras's method is the fastest I have implemented for a range of input sizes. However, one of the algorithms below is faster when input sizes are very large.
msummary.html - Presents an overview and timing results for the schoolroom method and all the Toom Cook methods I have implemented.
laglists.html - Lists Lagrangian coefficients used in stage [C] (see below) of the 'original' Toom Cook method to recover the multiword product.
[The assumptions built into these tables are covered below. Take care using these tables until these assumptions are fully understood.]
speed.html - Discusses the basics of writing high speed code for x86 processors. Some jargon is introduced in 'speed', which is used below.
refs.html - Lists the technical references used in this document.
Below I shall write e.g. 'Ref' 11 or 'laglists' (single quotes) when referring to these documents.]
The method is developed using polynomial theory. This does not rule out the multiplication of multiword integers - the standing problem in this document. Below, I cover the simple task of mapping multiword integers into (and back out from) polynomials, which makes the theory directly applicable to multiword integers.
--------
--------
The Toom Cook method is most easily explained by starting with the final product polynomial, then working backwards.
[From here down I assume that the input polynomials are of equal degree. This implies that the output polynomial degree (double the input degree) will be even. Call the degree of the input polynomials IPdeg and the output polynomial OPdeg.
This assumption is not part of Toom Cook; just a simplification for this document.
(The number of terms (IPterms and OPterms) is useful below.
Clearly, IPterms = IPdeg + 1, and OPterms = OPdeg + 1.)]
Consider a scenario where we are plotting polynomial curves on a sheet of graph paper.
To get a smooth curve implies that evaluations using 'x' values from the Reals will be needed, but only integer 'x' values are used in this document.
[An alternative to using integer 'x' values is to use 'fractions' as they are known in the schoolroom e.g. x-values of ¼, ½ and so on.
(Mathematicians call fractions (with integer numerators and integer denominators) 'Rational' numbers and the set of all of them is given the symbol Q).
This option is used in D. Zuras's papers and delivers some benefit (see 'zuras'). Indeed, this is the principal difference between 'zuras' and this document.]
It will be fairly obvious that a unique polynomial of degree '1' (a straight line) can be fitted through 2 points on the paper; a unique polynomial of degree '2' (a parabola) can be fitted through 3 points on the paper; etc, etc.
Hence in order to find the required product polynomial of degree OPdeg we are going to need OPterms independent points, that it must pass through, to do it. Even if these points could be found (fairly simple - see below) how would we find the product polynomial from them anyway?
The last part can be done by using Lagrangian interpolation polynomials (other methods can be used - e.g. see 'Ref 11' section 4.3.3 where a method based on difference tables is presented). One and only one polynomial of degree OPdeg passes through OPterms different points; hence the recovered polynomial will be the required unique product polynomial.
(Lagrangian polynomials can be used to interpolate (i.e. find a 'y' value at some intermediate 'x') when a set of points on a curve are known. This interpolation capability of Lagrangian polynomials is not of use in Toom Cook - only the polynomial that passes through the specified points is needed. Despite the fact that interpolation is not done, some textbooks call this the 'interpolation stage'.)
So, how do we find the OPterms points on the graph needed to precisely determine the output polynomial? Simple: -
If we were to plot two polynomials on graph paper,
points that the product polynomial must pass through
can be found by: -
drawing a vertical line at some 'x' value,
'evaluate' the 'y' values of the 2 input polynomials at this 'x'
(call them y1 and y2);
then the product curve must pass through
y = (y1.y2) at the same 'x'.
So all that has to be done to get OPterms 'y' values that the product polynomial has to pass through is to 'evaluate' both input polynomials at OPterms different 'x' values; then multiply the pairs.
[Note that the statement above implies that the y's of the product polynomial will be roughly double the size (in words) of the input polynomial evaluated values.
These, in turn, will be 1 or 2 ... words larger than the size of the input polynomial coefficients - this growth resulting from the 'evaluations'.
Hence, if the input coefficients are Bn in size, the product polynomial coefficients will be B2n+1 or B2n+2 ... in size.
This means that when the final multiword integer is recovered, (by substituting x = 2n.64 into the product polynomial) the coefficients will overlap (by (n+1) or (n+2) ... words) and a staggered overlap type of addition will be needed to recover it.]
In principle any 'x' values could be chosen - giving an infinite number of choices for the set of x-values. More practically we need the size of the x-values to be small so as to limit the growth of the 'evaluated' 'y' values and hence the product polynomial 'y' values. This forces the choice of x-values to be clustered near to x=0.
This consideration realistically reduces the (integer) x-value candidate sets to: -
{xset_1} = {0, 1, ... OPdeg}
(in steps of '1')
[The Lagrangian polynomials depend on the xset being used.
The values presented in 'laglists'
only apply when xset_1 is being used.
]
{xset_2} = {-IPdeg, ... 0, ... +IPdeg}
(in steps of '1')
[Other xset's are discussed in the literature.
Notably 'Ref' 11 (Knuth) uses xset_1 in his Algorithm T.
He also considers xset_2, an xset of {1,2,4,8, ...}
and an xset of {12,22,32 ...}
in 'Answers' to Q.4 of section 4.3.3.
He also considers an iterative scheme for this work.]
There is merit with both xset_1 and xset_2.
The benefits of xset_1 are: -
(i) It will be obvious that all the input polynomial evaluations will be >= 0 for this xset; hence so will the y-values of the output polynomial.
(ii) It can also be proven that all the coefficients of the output polynomial
(obtained by processing the y's from (i) using the Lagrangian polynomials) will also be >= 0.
The associated Lagrangian polynomial coefficients have a regular "+-+-..." fluctuation in sign (see 'laglists'), so this is not obvious.
[Below, it will become clear, that a small speed advantage can be gained by negating some of the Lagrangian polynomials in order that a common denominator can be used. The values in 'laglists' include this negation and (ii) is only true after this is done.]
Both (i) and (ii) simplify the software task slightly. For (ii) we might calculate the positive and negative sums separately, then subtract them, confident that no 'borrow' will be generated. (The regular fluctuation in sign can be exploited in the software.)
(This point may seem trivial; but handling carries and borrows can have a surprisingly large corrosive effect on software speed. Any scheme that removes this problem is to be welcomed.)
The benefits of xset_2 are: -
(i) The Lagrangian polynomial coefficients are a little smaller. This means that a higher input polynomial degree can be selected before the associated Lagrangian polynomial coefficients need 2# 64-bit words to hold them. When this happens the workload (and run times) step upwards, of course.
[Examination of 'laglists' (all based on xset_1) shows that coefficients for IPdeg=8 (or OPdeg=16) the Lagrangian coefficients are all in 1# 64-bit word. For IPdeg=9 (OPdeg=18), some of the coefficients need 2# 64-bit words to hold them.
On a similar topic, overflow into 2 or more 64-bit words during 'evaluations' only takes place when IPdeg is 14 (or OPdeg is 28) or greater.
[This can be found by considering xset = {0,1,...,OPdeg} in conjunction with the highest degree term (xIPdeg) in the corresponding input polynomial.
(Note: - 264 is 1.84e19 in floating point format.)
Hence OPdegIPdeg will be the largest factor during evaluations.
When OPdeg=26 this is 2.48e18 (i.e. < 1# 64-bit word of growth);
when OPdeg=28 this is 1.82e20 (i.e. > 1# 64-bit words of growth).]
Note that it is the Lagrangian polynomials that are more significant than the evaluations when considering data (coefficient) size growth.
If a limiting IPdeg of 8 is used, OPdegIPdeg=4.29e9 i.e. contained in less than half of a 64-bit word. This means that the products in the multiplication stage can always be contained in (2.n + 1) 64 bit words. Put another way, although a space of (2.n + 2) will need to be available to hold the product, the 1st word will always be zero.]
[A trivial point here. It takes a processor precisely the same amount of time to multiply, say, (2 by 3) or (0x1234567887654321 by 0x4321123456788765). You and I might wince at having to calculate the second of these products, but the processor is indifferent. We have to suppress value judgements gained from years doing mental arithmetic when planning high speed computer arithmetic. This means that xset_2 has no merit whatsoever (from the viewpoint of speed) until, as OPdeg grows, the Lagrangian coefficients corresponding to xset_1 grow into 2 words.]
Hence, it is fairly simple to find a set of points that the output polynomial has to pass through. That leaves the problem of finding the polynomial that passes through the points: -
The problem of finding the unique polynomial of degree OPdeg that passes through OPterms points appears tough at first encounter; solution - break the problem down to something easier.
Consider the simpler problem of finding a polynomial that passes through OPterms points, all but 1 of which have values of zero. (Call such a polynomial 'elemental')
In this case there are still OPterms points to pass through; hence the degree of the elemental polynomial will have to be OPdeg as before. This elemental polynomial can be written down (in factored form, at least) directly (see below).
By extension OPterms elemental polynomials can be written down merely by varying which one of the OPterms y-values is non-zero in turn. If we then add all these 'elemental' polynomials together, the sum polynomial will now pass through all OPterms points simultaneously and have a degree of OPdeg - i.e. just the unique polynomial sought.
So the key step is to find an elemental polynomial of degree OPdeg that passes through OPdeg zeroes and 1 non-zero y-value.
Let the xset be {x0, x1, ... xOPdeg} with corresponding polynomial y-values {y0, y1, ... yOPdeg}.
Let the i'th 'y' value be non zero, with all the others zero.
Then the following product of degree 1 polynomials: -
(x-x0) ... (x-xi-1)(x-xi+1) ... (x-xOPdeg) --- POLY A
will be zero at every x-value in xset except x = xi (there being no (x-xi) factor in POLY A).
POLY A is nearly one of the elemental polynomials sought. One problem remains - POLY A will not, in general, pass through the value yi. In order to make it do so, it has to be scaled.
Consider POLY A evaluated at xi obtained by substituting x = xi into it, resulting in a value di say: -
di = (xi-x0) ... (xi-xi-1)(xi-xi+1) ... (xi-xOPdeg)
(= y value of POLY A at x=xi.)
The value of 'di' can be just about any value, either positive or negative but not zero (provided the xset values contain no repeats). Note that POLY A is evenly divisible by this value (at all the integer xset points) since POLY A is zero at all xset values not equal to xi and leaves POLY A with a value of '1' at xi itself.
[Recall zero is evenly divisible by any (non-zero) integer with a quotient of zero.]
After division by di, POLY A can be multiplied by the output polynomial yi to end up with an OPdeg degree elemental polynomial that passes through the points: -
(POLY A * yi / di) passes through {{x0, 0},{x1, 0} ... {xi, yi} ... {x(OPdeg-1), 0},{xOPdeg, 0}}.
Clearly there will be OPterms di's, and they have the following properties: -
(a) They are 'smooth' numbers. In number theory smooth numbers have many small factors. If the xsets are simple sequences with steps of '1' a factor of '2' will enter di for every 2nd entry of xset; a factor of '3' for every 3rd entry and so on.
(b) The size of the di's grows as OPdeg grows. Eventually they overflow into 2# 64-bit words. When this happens the time taken to carry out the division steps up.
(c) Restricting the di's to 1# 64-bit word and assuming the output polynomial coefficients are B2.n+1 in size, then OPterms * (2.n+1) calls to the DIV opcode will be needed - and DIV is the slowest opcode on an x86 processor by a wide margin.
For greater speed it would be beneficial if the number of calls to DIV could be lowered. This can be done by replacing the di's by the lowest common multiple (LCM) of all of them (call this value 'D') as follows.
D = LCM{abs(d0), abs(d1) ... abs(dOPdeg)}
[The lowest common multiple of the abs(di)'s can be found (if a GCD() module is available - see elsewhere on this site for the world's fastest GCD() algorithm) by iteratively using: -
Let m = LCM (abs(a), abs(b)) = (abs(a) * abs(b)) / GCD(abs(a), abs(b))
then LCM (abs(c), abs(a), abs(b)) = LCM (abs(c), m) etc.
Note that 'D' will not be much larger than the largest di because of the smoothness of the di's. Put another way the GCD() denominator will be significant, which forces down the LCM value.]
Having found 'D', it can be used as follows: -
(I) The Lagrangian polynomial coefficients need to be scaled up by (D / abs(di)) and simultaneously negated if di is negative.
(II) The division by 'D' can now be held off until the final 'staggered sum' has been completed when recovering the multiword integer output; and then the number of calls to DIV will be roughly halved. This is a small speed gain without penalty, which is to be welcomed.
Note that when Toom Cook is used to multiply polynomials from Z[x], this use of the LCM 'D' is of only moderate benefit. But it is very beneficial when multiplying polynomials from Fp[x] since then only 1 'multiplicative inverse' (of 'D' itself) needs to be found.
(See elsewhere on this site to find the world's fastest multiplicative inverse algorithm derived. Finding the inverse of a 1-word number can be found significantly faster than that for a full multiword sized Field member.)
In 'laglists' this use of the LCM is built in i.e. the 'denominator' quoted is 'D' and the Lagrangian polynomials have been scaled up to match. Simultaneously the Lagrangian polynomials with negative di's have been negated.
[Lagrangian polynomials are sometimes proposed for the task of interpolation between regularly tabulated data points. This is a well known difficult numerical problem. The problem is known as "Runge's phenomenon", which manifests itself by excessive swings in the interpolation curve, particularly towards the end of the tabulated interval. Lagrangian polynomials are not immune to this effect. Vigorous testing and sensitivity testing (to see what happens when the tabulated values are changed by small amounts) are necessary to check that all is well with any application.]
Toom Cook proceeds in 3 stages: -
(worded for multiword integer products)
[A] Split the multiword inputs into IPterms parts
considered as polynomial coefficients,
and select an 'xset' (with OPterms entries).
Then 'evaluate' both input polynomials at each x-value in turn.
[B] Multiply the 2 evaluation sets from [A] term by term
- the product polynomial must pass through all these evaluation products simultaneously.
[C] Form the product polynomial from the values from [B]
using Lagrangian polynomials (associated with the xset in [A]).
Finally recover the multiword integer product
from the product polynomial coefficients.
[The use of symbols [A], [B] or [C], below, refers to these 3 stages.]
A better feel for Toom Cook can be obtained by working through an example: -
(Polynomials (rather than multi-word integers) are used throughout this 'example' and all numbers are in decimal.)
Working with 2# degree 2 (3 term) input polynomials to produce a degree 4 (5 term) output polynomial, the Lagrangian polynomials required (copied from 'laglists') are: -
Evaluation points (xset) = {0, 1, 2, 3, 4} Lagrangian Polynomial coefficients for this xset: - +1, -10, +35, -50, +24 -4, +36, -104, +96, 0 +6, -48, +114, -72, 0 -4, +28, -56, +32, 0 +1, -6, +11, -6, 0 Common Denom 'D' is +24
The following points from this table can be observed: -
(a) The sign of any non-zero entry = (-1)(i+j).
where i = row number, j = column number (both counting from '0').
(b) Every column (except the last) sums to zero.
(c) The top entry of the last column = '+D' with the rest '0'.
These points are true for any table whenever a common denominator along with xset_1 is being used (as above). These points provide useful checks in software development.
----------
As an example of deriving these polynomials, consider row 2: -
This polynomial is based on x=1; for which 'POLY A' is: -
x.(x-2).(x-3).(x-4) = x.(x3 - 9.x2 + 26.x - 24)
and d1 = (1-0).(1-2).(1-3).(1-4) = -6
The full d_set is {+24, -6, +4, -6, +24} with a 'D' (= LCM(abs(di's))) of +24.
Since D/d1 = -4 this POLY A needs to be scaled up by '4' and negated, to give the row 2 entry.
----------
Let the 2# degree-2 polynomials be 'g' and 'h', and the product polynomial be 'r' with
g = 3.x2 + 5.x + 7 and h = 4.x2 + 3.x + 2.
Toom Cook stage [A]
This involves evaluating g and h at {0, 1, 2, 3, 4} in turn: -
g(0) = 7, g(1) = 15, g(2) = 29, g(3) = 49, g(4) = 75.
h(0) = 2, h(1) = 9, h(2) = 24, h(3) = 47, h(4) = 78.
Toom Cook stage [B]
This involves multiplying the evaluations from A: -
y(0) = g(0)*h(0) = 7 * 2 = 14. Similarly
y(1) = 135, y(2) = 696, y(3) = 2303, y(4) = 5850
Toom Cook stage [C]
This involves recovering the coefficients of 'r' using the Lagrangian polynomial coefficient table above. The coefficients of like order are found by working down the columns of this table: -
r[0] = 1.14 - 4.135 + 6.696 - 4.2303 + 1.5850 = 288.
Similarly, r[1] = 696, r[2] = 1176, r[3] = 744, r[4] = 336.
These values have to be divided by D=24 to get the final result: -
r[0] = 12, r[1] = 29, r[2] = 49, r[3] = 31, r[4] = 14.
[For those who prefer their equations in matrix format: -
If A[] represents the 5-by-5 array of Lagrangian coefficients above, and AT represents it's transpose, then we have: -
r[] = (1/D) * AT[] * y[]
for r[] and y[] arranged as column arrays.]
As a check, by long multiplication (schoolroom method): - 3 5 7 4 3 2 * --------- 6 10 14 9 15 21 12 20 28 + ------------------ 12 29 49 31 14
It works!
(1) The fact that the Lagrangian polynomials in 'laglists' are in 'row' order in the sense that each row holds one of the Lagrangian polynomials in high-to-low order, has to be taken on board.
The fact that when calculating the product polynomial coefficients (by accumulating like powers of 'x' from each of the Lagrangian polynomials) means that it is necessary to work down the columns, also has to be taken on board.
These arbitrary layout arrangements were set when the 'laglists' tables were calculated. Failure to grasp the arrangements will result in nonsense being generated.
(2) The denominator 'D' will always evenly divide the product polynomial coefficients (i.e. without remainder) whatever the yset used to multiply the Lagrangian polynomials happens to be. This is because each elemental Lagrangian polynomial is evenly divisible by 'D' at each x-value in xset_1 (discussed above).
(3) Contemplation of the calculation of the constant term of the product polynomial shows that the 'constant' yset element is multiplied by 'D' then divided by 'D'! This is, of course, a total waste of time! Code layout may be simplified by this, but for speed, the constant term needs to be handled separately.
When multiword integer products are being found, the following avoids this problem: -
At this point it will be clear that Toom Cook theory leads to a whole family of modules. I have implemented just 2 of them: -
(i) A method I shall call "TC_4" that splits the input arrays into 4 terms (i.e. a degree 3 polynomial) and generates a degree 6 output polynomial (7 terms). As such it is directly comparable to the module discussed in 'zuras' (Zuras_4). The Zuras_4 module is faster by several %.
(ii) A method I shall call "TC_8" that splits the input arrays into 8 terms (i.e. a degree 7 polynomial) and generates a degree 14 output polynomial (15 terms). The papers by Mr. D. Zuras contain no method comparable to this. TC_8 is the fastest method I have tested when inputs are very large.
[The prospect of developing an 8 term zuras-type module is not attractive (contemplate his flowcharts in documents ZURAS_A or ZURAS_B (references in 'zuras')). If it could be done it would surely outstrip TC_8.]
Other sized TC modules appear to have little to offer. There are theoretical reasons why a TC_5 module would be only slightly faster than TC_4. Hence jumping straight to a TC_8 should give the best speed gain while not requiring large scale software changes. A TC_10 module (or larger) runs into the 2-word Lagrangian polynomial coefficient and denominator problem, which will erode speed.
As a matter of interest, it is fairly simple to write other module sizes. By using a suitable set of '#define' constants in the code, little more than adjusting these is needed. The Lagrangian polynomial coefficients need replacement, of course, but this is little more than cut-and-paste.
There are many ways to evaluate polynomials and new methods are being developed all the time.
[The problem of polynomial evaluation is important in Number Theory beyond the confines of Toom Cook; e.g. it is a key component of the fastest known deterministic factorization method - see 'Ref 3' for a starting point.]
'Ref 3' has 3 methods for doing it; one of which applies to the case when the x-values are in an arithmetic progression - as in Toom Cook with xset_1 or xset_2. This method finds the first few evaluations using a direct technique then uses these results to assist with subsequent ones. As such it comes into its own when the number of evaluations is much greater than the polynomial degree. For Toom Cook a polynomial of degree 'n' has to be evaluated (2.n + 1) times - i.e. not an obvious candidate for such methods.
'Ref 11' also gives a method with a similar drawback.
These methods just might provide some benefit for Toom Cook but checking this proposition requires a non-trivial amount of effort, which I have not done.
Any student of numerical methods will have learnt Horner's method, which reduces the number of multiplications to a minimum (unless previous evaluations can be exploited).
[Dating from 1819, but known to Newton 150 yrs earlier according to 'Ref 11'].
This method is so well known that further discussion is not needed here.
From 'speed' it will be obvious that line movements (code or data) can exert a malignant effect on run times. A corollary to this statement is that if a method of doing some task that puts less demand on the line movement hardware (the caches) can be found, it is worth exploring, even if it flies in the face of all textbook advice. I do not intend to promote this procedure as a panacea - it isn't. Nevertheless, when drawing up a shortlist of candidates when hunting for speed, including a 'line friendly' module sometimes pays off. The 'line friendliest' method that I could think of was just to evaluate the polynomial term-by-term - the dumbest method known!
Both the dumb method and Horner's method were implemented in assembler.
When I timed Horner's method against the dumb method I found the dumb method about twice as fast as Horner's method over a wide range of input conditions! This is another example of keeping an open mind and conducting experiments when speed is important.
This is the method used to obtain the results in 'msummary'.
From 'msummary' it will be clear that modules derived from this document only become competitive for speed when the input size, in 64-bit words, exceeds several hundred. This means that the top level module will have to hand off the several smaller stage [B] products to whatever method is fastest for their size. As further splitting takes place, different methods will be optimal at different stages.
The only way to get the best speed performance in this scenario is to conduct timing experiments. [See 'msummary' for details]
This section is algorithmically straightforward. This section was also written in assembler.
In my implementation I evaluate the positive Lagrangian terms separate from the negative - then subtract the 2. It is probable that amalgamating these 2 steps might save a handful of % in time, but the increase in complexity of the module to do it is significant. Since I have little need for the very large products that might benefit from this, I have not done it.