[This document is based on papers authored by
Mr. Dan Zuras (1993, 1994) (full references below).]
One purpose of this document is to try to stop software writers wasting time on the Toom Cook algorithm! It's speed is uncompetitive on an x86 processor, except when the input integers need an array of several 100's of words to hold them. If you have no need to multiply array sizes that large go to site document 'msummary' (link below) to view timings for more suitable methods.
[Some documents on this site are referred to repeatedly below. All the links to these documents are given here; return here if you want to view the contents.
tc_notes.html - Provides an introduction to the original Toom Cook method. Certain principles are introduced in 'tc_notes', which I assume the reader is familiar with below.
[I have written 2 methods based on the ideas in 'tc_notes' - TC_4 and TC_8 - which split the input data arrays into 4 parts and 8 parts respectively. These labels are used below. TC_4 is directly comparable to the method discussed below; it is about 10% slower but simpler to implement.]
msummary.html - Presents timing comparisons for the schoolroom method and all the Toom Cook methods that I have implemented.
speed.html - Discusses the basics of writing high speed code for x86 processors. Some jargon is introduced in 'speed', which is used below.
refs.html - Lists the technical references used in this document.
Two papers written by Mr. D. Zuras provide all the theory for this document - I refer to these below as ZURAS_A and ZURAS_B.
These are: -
ZURAS_A - "On Squaring and Multiplying Large Integers" (1993)
ZURAS_B - "More on Squaring and Multiplying Large Integers" (1994)
Both these documents can be found on the web by entering their titles as a search string. Below I merely quote from these documents; you will need to read the originals for a better understanding.
(Below I shall refer to the algorithms covered in the ZURAS papers as Zuras's methods. The papers do not mention who originated these algorithms, so this label may be misleading.)
Below I shall write e.g. 'Ref' 11 or 'ZURAS_B' (single quotes) when referring to the documents above.]
[The ZURAS papers are an exception to the generalisation that technical documents referring to high speed code are little more than worthless. He concentrates on his own experimental results - speed estimates are rarely mentioned! If only more authors would follow his lead!
The machine he used was a HP 9000/720; a machine known only to historians today. Nevertheless he uses clock ticks as his measure of time, which preserves relevance down the decades even though the clock rate was only 50 MHz. This machine he describes as using 32-bit RISC architecture (i.e. most opcodes taking 1 clock tick) with a MUL taking 2 ticks (more if data movement time was included). This proportion is similar to recent x86 processors; nevertheless the conclusions he reaches have to be treated with a little caution.
(So long as relative speeds are being considered the 32/64-bit processor debate is not important. Once absolute speeds are required 64-bit processors win out by a factor approaching 4 relative to a 32-bit processor for the multiword multiplication task. All results in this document apply to 64-bit processors running 64-bit software.)
He provides a list of elemental modules written in assembler, which his higher level C-code modules call as needed. This strategy is still the best for the achievement of high speed. It is nearly impossible to write everything in assembler, and nearly impossible to avoid using some assembler written modules called by 'C' modules. The software planning problem comes down to deciding which of the 2 methods to use with any specific module.
Both papers concentrate on the squaring problem; but are easily extended to the multiplication problem.]
Both ZURAS papers discuss several squaring methods including schoolroom, Karatsuba, Toom Cook and Schönhage Strassen. The schoolroom and Karatsuba methods are straightforward and covered in all textbooks (e.g. 'Ref' 11) ('msummary' gives an outline) so need no further comment.
His comments on Schönhage Strassen (an FFT-type method known as a Number Theoretic Transform (NTT) method) are of interest. (This method is recommended by many textbooks.) He implements and obtains timings for 3 variants which he calls FFT-3, FFT-4, FFT-5. None of these methods beat Toom Cook up to his input array size limit of 100,000 words. He is ambiguous about their performance beyond this limit saying: -
"... it is possible that the FFT-k methods
will eventually win out for some w >> 100,000."
('w' the input array size).
I personally have no application for inputs this large; so Schönhage Strassen was immediately removed from my candidate list! His conclusion may be invalid on an x86 processor, but I, for one, am not prepared to commit months of effort to either finding his conclusions correct or, if wrong, obtain small gains.
Both ZURAS papers discuss Toom Cook methods in their original form (see 'tc_notes' for my take on this) using evaluation points at integer x-values symmetrical about '0', and in a form using evaluation points from the Rationals.
The 'original form' Toom Cook methods he considers are 3 and 4-way splitting methods that he finds uncompetitive. I find similar results. In 'tc_notes' I find an 8-way splitting method to be the best when very large input array sizes are being multiplied.
'Zuras's variant' of Toom Cook uses Rational evaluation points and finds this gives a speed gain; as do I.
The set of Rationals he uses is: -
{1, inf, 0, 2, 1/2, -2, -1/2, 3, 1/3, -3, -1/3, 3/2, 2/3, ...}
(he does not specify which members of this set are used in the algorithms presented.)
[The only difficulty in using this set is using 'inf'; the point at x = infinity. This is not as complicated as it seems on first encounter.
(In the example, next, I use Zuras's squaring problem and the Karatsuba method, just for simplicity.)
The idea is that the Karatsuba solution is equivalent to finding the 'C' polynomial coefficients in the following equation: -
(A1.x + A2)2 = C2.x2 + C1.x + C0
If we let x approach infinity while dividing by x2,
we find A12 = C2
which is obvious by equating like powers of 'x'.
In short, the use of 'inf' is little more
than a piece of mathematical shorthand.]
Using this set of Rationals, 2 algorithms are developed - a 3-way splitting method I shall call Zuras_3, and a 4-way splitting method I shall call Zuras_4.
In his experimental results he finds Zuras_4 faster than Zuras_3, but not by much. Guided by these results I implemented Zuras_4 (not Zuras_3). Signal flow graphs are presented for both algorithms in both ZURAS_A and ZURAS_B; that for Zuras_4 being something like twice as complicated as Zuras_3. Considering that the speed of the two is very similar, Zuras_3 could be a better bet for projects under pressure of time.
The flow graph for Zuras_4 is self explanatory.
Indeed, the whole method is straightforward, if bulky and intricate. It probably represents 2-4 months of software effort to implement. The intricacy of the flow graph implies that very detailed planning is essential.
The following points may provide some assistance: -
[You will need to be looking at Figure 4 of ZURAS_A to follow the next comments. The equivalent figure in ZURAS_B is too small to be usable.
(Below 'Figure 4' will always imply 'Figure 4 of ZURAS_A'.)]
(a) Figure 4 applies to the squaring problem; when multiplication is considered, the S4 and S2 terms might be negative. (In either case, most of the 't2n'...'t6n' variables might be either positive or negative - I use single bits in sign indicator variables to hold '+' or '-', while holding the 'magnitude' in unsigned arrays.)
(b) When the flowchart is analysed from the standpoint of reducing line movements, some of the 't' variables will be seen as mere burdens on the line movement hardware (the caches) e.g. t45 and t40. In general, where a 'tnn' variable is only used once (i.e. has a single line exiting it) consideration should be given to deleting it.
It is a point made repeatedly on this site that line movements can have a corrosive effect on speed. Considering Figure 4 from this viewpoint, the summing junctions at the end of the flow 'arrows' below the S-product boxes will be seen as a stress test of the processors front side components, including RAM, bus and caches! Considering further that Zuras_4 is only competitive when the input arrays are some 100's of words (i.e. some dozens of lines) in size (a fact demonstrated in 'msummary') then each summing junction will be making heavy demands of the front side components.
[The work in each summing junctions moves through the arrays from high-to-low addresses in a regular 'stride pattern'. Early x86 prefetch systems could not work 'backwards' like this, but all later models can. This hardware development reduces cache stress significantly here.]
The best way to reduce line movements is to eliminate all the t2n-t6n variables and find the C5,...C1 values directly from the S5,...S1 values.
This can be done by using the following matrix product: -
|C0| |C1| |-360 -360 +12 -4 -160 +2 +6| |C6| |C2| |-1530 +360 +12 +12 0 -3 -3| |S1| 360 * |C3| = |+1530 +1530 -27 -7 +680 -7 -27| * |S2| |C4| |+360 -1530 -3 -3 0 +12 +12| |S3| |C5| |-360 -360 +6 +2 -160 -4 +12| |S4| |S5|
(These coefficients were obtained by working backwards from the Figure 4 flowchart.)
[Some implementation notes: -
(A) The factor of '360' has to be divided out at the end to get the required {C1,...C5} set. 6 of the coefficients in the table are also 360, and it is foolish to multiply these by 360 and then divide by 360 at the end. The way I avoid this is to add/sub all the terms that need no division directly into the output array (a Zuras_4 input pointer parameter). All the others are multiplied then added/subbed into an auxilliary array; then divided by 360; then add/sub the quotient into the output array.
An objective with this procedure is to minimize the use of the x86 DIV, which is very slow. This requires the auxilliary array to be as small as possible, and to hold off with the division until all the add/subs are completed (these C1...C5 terms overlap significantly in the auxilliary array sum - see 'tc_notes' - and doing a division on individual terms would roughly double the division time).
(B) Some of the other matrix products are repeated. I use further auxilliary arrays to hold these to prevent repetitions.]
There is only 1 way to verify the utility of this scheme - write it and time it.
The timing results in clock ticks were as follows: -
#(Input | AMD A8 #64-bit words) | Figure 4 method Modified 'end game' 0x800 | 0x3212B4 0x2F2E3E 0x2000 | 0x17B09B4 0x16F8A93 0x8000 | 0xB4DCEB5 0xAC93FB8 --------------------------------------------------------------- #(Input | Intel i3 #64-bit words) | Figure 4 method Modified 'end game' 0x800 | 0x35E73C 0x35777C 0x2000 | 0x1AE3260 0x1A93664 0x8000 | 0xCB4D09C 0xC83D020 [The operation of the software in these tests was: - when #words=0x800 - 1 Zuras_4 split, followed by some Karatsuba splits followed by the schoolroom method. when #words=0x2000 - 2 Zuras_4 splits etc. when #words=0x8000 - 3 Zuras_4 splits etc.]
These results indicate that the speed gain is positive but small - gains less than 5-10% I consider generally not worth bothering with. However, readers about to start writing their own Zuras_4 module need to appreciate that incorporating the modified 'end game' perhaps halves the software writing effort.
In 'msummary' I find that TC_8 is the fastest module that I have written, when the input array sizes are very large.
Since Zuras_4 is faster than TC_4, it is reasonable to expect that a Zuras_8 module would be faster that my TC_8 module, probably by more than 10%. Deriving a flow graph (in the style of Figure 4) for a Zuras_8 module would require considerable effort (implementation would be tough as well). The modified 'end game' technique would surely be the only practical way to implement a Zuras_8.
If a TC_4 module has been written, writing a TC-8 module borders on a trivial additional step. Only the Lagrangian coefficients need changing (little more than cut-and-paste) along with changing a few '#define'd constants to control loop counters, array sizes etc. For someone who has written TC_4 (i.e. very familiar with it) this might take less than a week! Weighing everything up, a Zuras_8 module is an unattractive prospect.