The performance of Toom Cook
multiword integer multiplication algorithms.

[These algorithms can also be used
for polynomial multiplication.]

[Some documents on this site are referred to repeatedly below. All the links to these documents are given here; return here if you want to view the contents.
tc_notes.html - Provides an introduction to the Toom Cook method in its original form. Two algorithms of this form are tested below.
zuras.html - Discusses some more recent Toom Cook methods (and others) produced in 2 papers by Mr. Dan Zuras. One of these is tested below.
speed.html - Discusses the basics of writing high speed code for x86 processors. Some jargon is introduced in 'speed' which is used below.
refs.html - Lists the technical references used in this document.

Below I shall write e.g. 'zuras' or 'speed' (single quotes) when referring to the documents above.]

Multiword integer multiplication is a cornerstone of many software packages that have a 'number theory' component e.g. cryptographic packages. Oddly, there is little in the way of experimental performance data to guide projects.
[An exception to this is Dan Zuras's work (see 'zuras').]

When I set out to write this and associated documents I had the objective of 'sorting the subject out'. This is a hopeless objective. There are so many algorithms and variations of algorithms that the task is huge. Not only that, if 2 programmers were to be given a flowchart for some algorithm and told to implement it, the speed of the implementations might easily be tens of % different in execution time - results depend significantly on experience and skill levels, and writing high-speed code is not a common skill.

I have now adopted a more sensible objective - the provision of timings for a handful of Toom Cook methods that are, in the view of most experts, front runners. Perhaps someone might be inspired to collate other results to provide a global summary.

Although this document is all about experimental results, 2 methods - the schoolroom and the Karatsuba methods - are simple enough to need only a brief introduction and this is included below. The other methods are discussed in dedicated documents 'zuras' and 'tc_notes'.

Timings below are always in processor clock ticks (see 'speed'). Doing this means that processor clock speed gains, from hardware improvements, are taken out of consideration. This leaves results relevant over a longer term. Even when this is done, there are variations between processor types.

OVERVIEW

The speed of multiplication algorithms is dominated by 2 competing influences: -

The time spent 'number crunching' i.e. executing processor 'MUL's.
The time spent doing 'administrative' work. Some administrative work is needed even by the simplest methods - e.g. moving data to and from the processor. For more complicated methods rearranging the input arrays and assembling the output array from the partial products causes the administrative time to grow steeply, while simultaneously reducing the number crunching time.

[From here down I shall use 'IPn64' to indicate the number of 64-bit words needed to hold the input arrays.]

In general, when IPn64 is small, the simplest methods are fastest (the administrative time must be as small as possible). As array sizes grow so more complex methods become faster than simpler ones (more administrative time can be tolerated so long as number crunching time savings are even greater). Yet further growth in IPn64 enables yet more complex methods to be the fastest ...

[This workload breakdown explains the historical evolution of this subject since the early days of x86 processors.
A reference book tells me that a 286 processor took 21/24 ticks to perform a (16 bit) MUL opcode. Today an AMD A8 has a latency of 5 ticks (register operands) and a throughput of 2 ticks for a 64 bit MUL.

This means that 'number crunching' times have shrunk dramatically. Simultaneously, the MOV's ADD's (and a few MUL's) that make up the administrative workload are only reduced on the back of increased register size. This in turn, means that the advantage of the more complex methods has shrunk to not far from the point of disappearing.]

Furthermore, the more complex the method the longer it takes to write - some of the methods discussed below might take 3 months or more to write, even if a suitable suite of assembler-written subroutines are available. If these need writing as well, a doubling or trebling of writing time is possible.

From the project management viewpoint, the above points raise serious problems of money and skilled manpower allocation and project completion dates. A conclusion from the above is that, for the more complex methods discussed below, speed gains will be moderate and the expense of obtaining them will be unpleasantly high.

One influence that eases these problems, is the fact that when graphically plotting timing results versus IPn64 for all the methods discussed below the curves intersect one another at very fine angles. If one defines an insignificant speed benefit as, say, a 10% speed gain, then there is little point in implementing some more complex method unless your project needs to handle input sizes much larger than the IPn64 of the intersection point.
On top of fine intersection angles, when IPn64 is steadily incremented, algorithm run time increases in a slightly jerky manner. This is caused mainly by data line movement numbers increasing somewhat erratically, by variations of flow paths within methods and by variations of the internal 'state' of the processor look up tables at any instant. These points mean that the phrase 'intersection zone' might be more appropriate than 'intersection point'.

EXPERIMENTAL CONDITIONS

My regular test machine is a 64-bit AMD A8-5500 with 6 GB of RAM with a 64-bit Windows 8.1 operating system. I sometimes use an Intel i3 machine (64-bit Windows 7) for comparison. The i3 takes 30% - 50% less time when IPn64 is small; but the A8 is faster when IPn64 is large (see below).
    The development system is Visual Studio (2008).
    All results below use 64-bit code and 64-bit assembler. This point is important because, for the type of work discussed in this document, there is a speed gain of 2-4 available merely by making best use of the capabilities of the host processor. This factor is of the same order as that available from the 'cunning plans' discussed on this site.
    [This point is also important if you are attempting to sell to the general public without restriction. There are many 32-bit OS's out there and they are not going away any time soon. If you intend to provide the best possible performance for your clients you will have to develop both 32-bit and 64-bit versions, find the OS word size of the host machine 'on the fly', then use this to select between code versions. Alternatively, just write a 32-bit version as this will run on a 64-bit OS with only a tiny loss of performance (relative to a 32-bit OS).]

All the results below are taken in 'hot cache' conditions (see 'speed' for the meaning of this). Such results are minimum times, unlikely to be found when a module is embedded in a larger project. Such timings have the benefits of experimental repeatability along with ease of measurement. Cold cache timings can be much larger (even 10 times larger). Warm cache times (typical of use in large projects) will be adjacent to hot cache times.

In ordinary development work I take 4 timings in rapid succession; but below I use 16 timings because some of the timings are very large and then Windows task switching becomes more likely to interfere with measurements - the only way to subdue this is to increase the number of timings and use the minimum.

4 further effects need consideration: -

All but the most significant 1½-2 hex digits of clock tick results are random when experiments are repeated. Below, I shall use '-' (minus symbol) to replace least significant random 'hex' digits. This means that timing accuracy is limited to 1-2%, which is adequate.
As discussed in 'speed' the least significant 1½-2 hex digits of a timing are also random. This means that timings less than 3 hex digits in size have to taken 'with a pinch of salt'.
[Below, occasionally timings for the same experiment occur in different test sequences. Comparison of these results will give a feel for these experimental errors. I have not attempted to 'massage' the differences away.]
The 'state' of a machine exerts an influence on timings.
[By 'state' I mean the condition of RAM-to-chip system, the caches and look-up tables on the processor - i.e. all dynamic entities.]
The biggest problem comes from the caches, and doing 16 tests in quick succession largely eliminates this.
The other 2 can be controlled by doing an OS 'restart' immediately before taking a timing (this returns the dynamic entities to a 'standard' state). This reduces variations found when doing repeat experiments and has been done below.
As explained in 'speed' line-aligning data structures (when array sizes are small) can reduce timings a little. All timings below use this technique. At the very large array sizes discussed in places below, this technique fades into insignificance.

[On a technical note, all timings are taken in Visual Studio's 'Release' mode (as opposed to 'Debug' mode).
[Typically, Debug mode increases code volume by 10-30%, which increases timings (increases code line movements as well as conditional jumps).]
On my system, even in Release mode a '/Zi' flag (= add symbolic debug info) is left set in the compiler command line (check your compiler 'Properties'). I have deleted this flag for all timings in this document to remove any trace of debug padding.]

ALGORITHM PERFORMANCE

I shall discuss algorithms in order of increasing complexity.

The Schoolroom method

This algorithm gets its name from the way it mimics long multiplication as taught in a schoolroom. In this method products of individual digits are found one by one, working from the right, arranged in a rhombus skewed to the right at the top, and finally summed down the columns, making allowance for carries. In software it is possible to 'add in' the rows, but summing down the columns is slightly faster since line movements to the output array are then reduced - column sums can be done instead into 3 registers (2 for products, 1 for carries).
Writing n-by-n modules along these lines is very simple, if tedious. I have written straight line (i.e. loop free) modules up to IPn64 = 16 for both multiplication and squaring.
Beyond this I use a module that can find the products for a single rhombus column of any length and sum them, initially into registers then, when complete, into the output array. This is used in conjunction with modules that can evaluate the right/left hand 16 columns of the rhombus. This compound module can find schoolroom products up to any IPn64.

For the A8 the speed of these modules is close to: - 5.IPn64² ticks.

If IPn64 is small, the constant is a little above '5' because of subroutine calling overhead and the shortness of the rhombus columns; if large the constant drops off to about 4.8.
[This undercutting of the latency time is small because the administrative workload between MUL calls is large enough to restrict it.]

Testing a 24-by-24 sized schoolroom product: -

A8 time = 0xB53 (2899₁₀) ticks
hence ticks/MUL = 2899 / 24² = 5.03

i3 time = 0x780 (1920₁₀) ticks
hence ticks/MUL = 1920 / 24² = 3.33

[I have been unable to find the i3 MUL latency time on the Intel website, but it is probably no more than '3' and may be '2'.]

Clearly, the i3 has a superiority with the schoolroom method, and it is only reasonable to suppose that this will extend to all methods. Below I demonstrate that this is not correct.

[Just to put give some context to these figures: - the speed of light is about 3.10⁸ m/sec. For a PC clock speed of 3 GHz (3.10⁹) a photon would travel 0.1m per tick. Hence finding a 320-by-320-bit product (5-by-5 64-bit words, using 5 ticks/MUL) would take the same time as a photon takes to travel 12.5m!

This blistering speed demonstrates that schoolroom methods will be impossible to beat for speed so long as the input array sizes are small. However, run time is going to go up very close to the square of IPn64. If a method can be found whose run time grows with a smaller exponent than '2' (it can!), then the 2 timing curves will intersect and beyond the intersection, the method will be faster than the school room method.]

The next point is obvious, but needs to be made clear. The more sophisticated methods split the inputs into parts (either 2, 4 or 8 parts in this document) then call another routine (possibly calling themself or another method) to find the products of these parts. Which method is called depends on an answer to - "What is the fastest way to multiply the parts?" The parts may be split again and again ... eventually the parts will be small enough for the answer to this question to be "the schoolroom method". In other words, the schoolroom method will underpin all the more sophisticated methods. This in turn means that any hint of poor performance in implementing the schoolroom method is going to be passed on to the performance of the more sophisticated methods.

[The schoolroom method plays a valuable role in debugging the more sophisticated methods. It is so simple that the method should work correctly 'straight out of the box'. If it doesn't, errors can be easily spotted and rectified. This means that it can act as a 'gold standard' for accuracy. So long as no upper limit on IPn64 is built in (as is achieved by the technique above) it can be used in a go/no-go test.]

The points above force attention onto the question - "What algorithm timing curve intersects the schoolroom timing curve at the lowest IPn64; and at what is the value of this intersection IPn64?"

The Karatsuba method

The Karatsuba method can be developed as the simplest possible Toom Cook method.
(see the ZURAS papers in 'zuras'. Karasuba's paper (1962) predates Toom's (1963) and Cook's (1966), so presumably he did not know this.)
In the search for the method with the lowest possible intersection point with the schoolroom method, and given that the administrative overhead for the Toom Cook methods grows sharply with increasing complexity, the obvious place to start looking is the simplest possible Toom Cook method.

The development of the Karatsuba method can be found in any numerical algorithm textbook (e.g. 'Ref' 11 pg 295) so I shall just give an algorithmic summary: -

Given 2 input arrays, say g[] and h[],
split them into a most significant half (MS)
and a least significant half (LS)

      +----------+----------+
      |   MS_g   |   LS_g   |
      +----------+----------+ 

      +----------+----------+
      |   MS_h   |   LS_h   |
      +----------+----------+ 

[Data movement should be avoided here
  - just set 2 pointers to relevant addresses.
  (For simplicity I assume that the IPn64 of 
  both g[] and h[] are even numbers here.
  There are several ways to handle the
  even/odd length problem - I recommend using
  different subroutines; this keeps things
  simple and works well.)
  Division at a word boundary is essential
  for best speed.]

(a) Multiply MS_g by MS_h (call this MS_prod).
(b) Multiply LS_g by LS_h (call this LS_prod).

[For the next, I use an Abs_Diff() module
  (finds the absolute difference)
  that returns an int (diffsign), that indicates
  which of the input arrays was greater.
  This is used to find Abs_Diff(MS, LS).
  (Note that Abs_Diff() will 'nearly always' only
    need to compare the most significant word of
    MS_g and LS_g to determine which way round
    the subtraction needs to be done to get
    the absolute value and set the return flag.)
  Temporary spaces are needed for both differences.]

(c) Form: -
  (i)  diffsign_g = Abs_Diff (MS_g, LS_g, Diff_g, n64)
          (Puts absolute difference of MS_g[n64] and
               LS_g[n64] into Diff_g[n64])
  (ii) diffsign_h = Abs_Diff (MS_h, LS_h, Diff_h, n64)

(d) Form Diff_g * Diff_h -> Diff_prod
    [Another temporary space for Diff_prod needed.]
    [If diffsign_g != diffsign_h add Diff_prod below
     If diffsign_g == diffsign_h subtract Diff_prod below.]

(e) Add the results from above as: -
      +----------+----------+----------+----------+
      |       MS_prod       |       LS_prod       |
      +----------+----------+----------+----------+
                 |      +MS_prod       |
                 +---------------------+
                 |      +LS_prod       |
                 +---------------------+
                 |    +-Diff_prod      |
                 +---------------------+

                 ^ carries or borrow possible here

(f) Incorporate carries/borrow (cannot ripple out).

    #----------------------------#

As an example consider 3521 * 4122  (= 14513562)
[all numbers in decimal here.]

MS_prod = 35 * 41 = 1435
LS_prod = 21 * 22 = 0462

Diff_g = abs (35 - 21) = 14  (with MS_g > LS_g)
Diff_h = abs (41 - 22) = 19  (with MS_h > LS_h)
hence Diff_prod = 14 * 19 = 0266
(since both have MS > LS, this needs subtracting)

hence result formed as: -
         14350462
           1435
           0462
Sum    = 14540162
          -0266
Result = 14513562

Reversing the MS & LS halves of 1st term above
  gives another example: -
 2135 * 4122  (= 08800470)

MS_prod = 21 * 41 = 0861
LS_prod = 35 * 22 = 0770

Diff_g = abs (21 - 35) = 14  (with MS_g < LS_g)
Diff_h = abs (41 - 22) = 19  (with MS_h > LS_h)
hence Diff_prod = 14 * 19 = 0266
(since diffsign_g != diffsign_h
                this needs to be added)

hence result formed as: -
         08610770
           0861
           0770
Sum    = 08773870
          +0266
Result = 08800470

    #----------------------------#

If we think about Karatsuba from the speed viewpoint, the 'number crunching' part is now 3# half-sized products - a 25% reduction in schoolroom MUL calls. Simultaneously, the administrative workload is going up by the following items: -
    (a) 2# half-IPn64 sized Abs_Diff() calls.
    (b) 3# IPn64 sized add/subs to recover the output.
    (c) Carry/borrow adjustments.

Items (a) and (b) grow linearly with IPn64. Item (c) will 'nearly always' require adjustment to a single word - trivial. This means, that as IPn64 grows large, Karatsuba's execution time will drift towards 75% of the schoolroom execution time. If a second Karatsuba call is made on the split parts of the first, another near 75% speed gain is possible ... For multiple levels of Karatsuba, the overall speed gain can rise to factors of 'several'.
The IPn64 beyond which a single stage of Karatsuba (calling schoolroom on the split parts) is faster than the schoolroom method applied to the whole can only be found by experiment: -

TIMINGS FOR THE SCHOOLROOM (SR) METHOD
      vs THE KARATSUBA METHOD
  [IPn64 in decimal, timings in 'hex']

          AMD A8       ||     INTEL i3
IPn64   SR   KARATSUBA ||    SR   KARATSUBA
12     2A8      354    ||   1BC      2E4
14     393      437    ||   258      390
16     4BC      4CD    ||   314      448
18     64E      5F3    ||   420      50C
20     7CA      745    ||   524      5F8
22     9BB      85B    ||   648      710
24     B4F      A26    ||   784      81C
26     CEB      B88    ||   8D0      938
28     F1F      D0D    ||   A38      A60
30    1185      E7E    ||   BC0      B98
32    1455     108B    ||   D44      CE8

CONCLUSIONS FROM TABLE
(1) For the AMD A8, the cross-over point
    for the 2 methods is IPn64 = 16.
(2) For the Intel i3 the cross-over point
    for the 2 methods is IPn64 = 30.
(3) For both methods, the timings are
    small so that 'jitter' is apparent.
    Nevertheless, underlying trends are clear.

The fact that the different processors perform differently, is a minor irritation. It would be possible for your executable to identify the manufacturer of the host processor 'on the fly' by using opcode CPUID, then use this to set a variable and use this to branch to the appropriate code; but this is a step too far for me. I use a fixed '#define' constant KSTART to control the cross over point.

In my implementation, the Karatsuba subroutine can handle any value of IPn64 (assuming dynamic memory is available). It carries on repeatedly splitting the input arrays and calling itself until the parts are <= KSTART, at which point the schoolroom method is called. Then as the subroutine returns, the partial products are assembled stage by stage.
The performance of my Karatsuba subroutine as KSTART varies is: -

TIMINGS FOR THE KARATSUBA METHOD
      when IPn64 is large
    [IPn64 and timings in 'hex']

For KSTART = 16
           AMD A8      ||   Intel i3
IPn64
0x40          34--     ||        2E--
0x100        1F4--     ||       1EB--
0x400       121---     ||      127---
0x1000      9D3---     ||      AA6---
0x4000     5AB----     ||     60F----
0x10000   329-----     ||    36E-----

For KSTART = 24
           AMD A8      ||   Intel i3
IPn64
0x40          35--     ||        2F--
0x100        1F1--     ||       1EC--
0x400       122---     ||      127---
0x1000      9CB---     ||      AA9---
0x4000     5A6----     ||     60F----
0x10000   32B-----     ||    36D-----

For KSTART = 32
           AMD A8      ||   Intel i3
IPn64
0x40          3F--     ||        30--
0x100        26---     ||       1F4--
0x400       15B---     ||      12B---
0x1000      BB2---     ||      ACD---
0x4000     6C1----     ||     626----
0x10000   3C8-----     ||    37A-----

CONCLUSIONS FROM TABLE
(1) The A8 is slower than the i3 when IPn64 is
    small. However when IPn64 is large,
    it is a little quicker.
    Why should this be?
    I am not sure; the following is my guess.
    When IPn64 is small, the work tests the
       processors ability to execute a 'MUL'.
       From above, the i3 is superior at this.
    When IPn64 is large, data line movements
       are large and then the work tests cache 
       performance.
       At this the A8 is superior.
       AMD have replaced Intel's northbridge/
       southbridge architecture with their own
       technology, and this appears to be
       superior.
(2) The i3 is insensitive to KSTART values. 
    The A8 performs best when KSTART=24; and
    performance falls off for KSTART=32.

On the basis of the results above I settled on KSTART = 24.

The speed gain for Karatsuba above schoolroom was found to be: -

RATIO OF THE SCHOOLROOM METHOD TIME
DIVIDED BY THE KARATSUBA METHOD TIME
    [IPn64 in 'hex']
[KSTART fixed at 24]

          AMD A8     ||   Intel i3
IPn64
0x40        1.62     ||      1.14
0x100       2.65     ||      1.70
0x400       4.50     ||      2.81
0x1000      7.63     ||      4.77
0x4000     13.7      ||      8.34
0x10000    23.6      ||     15.1

CONCLUSIONS FROM TABLE
(1) The larger speed gain for the A8
    relative to the i3 should be noted.
    This is caused by a combination of 
    the larger MUL time of the A8 
    increasing the numerator 
    (= schoolroom time),
    and superior A8 cache performance
    reducing the denominator
    (= Karatsuba time).
(2) The Karatsuba performance is
    good when IPn64 > 100 (decimal).
    Also, the schoolroom performance
    becomes steadily more unacceptable.
(3) Projects that need to handle only
    IPn64 < 100 might get away with
    just a schoolroom module. This
    would save a good deal of
    software effort and few clients
    would notice.

Some higher order Toom Cook methods

Toom Cook methods can be categorized by the number of parts that the input arrays are split into. There are theoretical reasons why merely incrementing the number of parts should deliver ever smaller speed gains. At the same time, to write many Toom Cook splitting modules implies a great deal of effort.

Faced with these problems, I have just implemented methods that split the inputs into 2 (i.e. Karatsuba, above) 4 and 8 parts and skipped other possibilities. Doubling up, like this, seems like a reasonable payoff between software effort and speed payoff.

Even after a splitting number is selected, there are many variations of the method available. For splitting into 4 parts I have implemented 2 of these variations. The first follows Toom and Cook's original presentation and principles are covered in 'tc_notes'.
The second follows the recommendations of Dan Zuras (see references ZURAS_A and ZURAS_B in 'zuras').
For the 8 part splitting method only 1 method has been implemented - see 'tc_notes' again.

[All theory from here down is in either 'tc_notes' or 'zuras' - consult these for further understanding.]

Two 4-way splitting methods

I have implemented an 'original form Toom Cook' 4-way splitting module I call "TC_4", and also a more recent Toom Cook variant I call "Zuras_4" which follows the signal flow diagrams given in Dan Zuras's documents ZURAS_A and ZURAS_B (see 'zuras' for references). I found Zuras_4 faster than TC_4 by several % across the board, hence I will not discuss TC_4 further.

Zuras_4 is not complicated to implement, but it is bulky. The time taken to implement it will depend on whether you have available 'atomic' assembler modules for multiword add/subtract plus a few others. Even if you already have these available, you could be looking at 2-4 months of work to get it running. Below I demonstrate that Zuras_4 is only significantly faster than Karatsuba when IPn64 is greater than several hundred words. If your project has no use for IPn64's as large as this, do not waste your time with it.
[In 'zuras' I discuss a variation of Zuras_4 with a different 'end-game'. Times below are for this variation, though the speed gains from it are tiny.]

[Below I discuss an 8-way splitting module I call TC_8 which is faster than Zuras_4 when IPn64 is very large. This module calls down to Zuras_4 (when the parts are small enough) or to itself. This means that the speed gains from Zuras_4 are passed directly along to TC_8. This provides a handy (but not transformative) boost to the speed of TC_8. Hence, if you need a multiword multiplication module to handle very large IPn64 (i.e. you need TC_8), you may want to reconsider implementing it. Since TC_4 is easier to implement, this could be an alternative.]

RATIO OF Zuras_4 to Karatsuba TIMES
[IPn64 in 'hex']
[KSTART 24]
            A8    ||     i3
IPn64
0x200      1.06   ||    1.27
0x400      0.97   ||    1.09
0x600      0.95   ||    1.03
0x800      0.94   ||    1.0
0xA00      0.91   ||    0.96
0xC00      0.89   ||    0.95

EXPERIMENTAL CONDITIONS
For the table above, the Zuras_4
module is called once i.e. makes 1 
and only 1 splitting; then the split 
parts are handed down to Karatsuba.

CONCLUSIONS FROM TABLE
(1) For the A8, the intersection
    point is IPn64 approx= 0x400;
    for the i3 approx= 0x800.
(2) The tiny angle at which the
    times for the 2 modules approach
    each other should be noted. This
    means tiny changes to the code will
    have large effects on intersection points.
(3) I use a '#define' constant called
    Z4START set to 0x600 to control
    switching between Karatsuba and
    Zuras_4. 
    Because of point (2), when IPn64 is 
    much larger than Z4START, multiplication
    times will be insensitive to Z4START.

For the range of IPn64 in the table above, it will be clear that implementing Zuras_4 is a poor prospect. The gains in speed are small and the software effort large, meaning few projects could justify it. So how do the speed gains develop as IPn64 grows? The following tables answer this.

TIMES OF Zuras_4 and Karatsuba
methods as IPn64 grows.
[IPn64 & times in 'hex']
[KSTART 24; Z4START 0x600]

Processor = AMD A8
         Karatsuba  ||   Zuras_4
IPn64
0x1000      9A9---  ||    888---
0x2000     1D5----  ||   1718---
0x4000     591----  ||   415----
0x8000    108F----  ||   AD3----
0x10000   318-----  ||  1DCF----
0x20000   95D-----  ||  4ED-----
0x40000  1BFC-----  ||  D76-----

Processor = Intel i3
         Karatsuba  ||   Zuras_4
IPn64
0x1000      A41---  ||    9E5---
0x2000     1EF4---  ||   1A9B---
0x4000     5D6----  ||   4CC----
0x8000    1190----  ||   C8D----
0x10000   34F-----  ||  236-----
0x20000   9ED-----  ||  5B6-----
0x40000  1E62-----  ||  FEE-----

EXPERIMENTAL CONDITIONS
With Z4START fixed at 0x600, when IPn64
is large, Zuras_4 hands down to itself,
possibly several times, before handing
down to Karatsuba.


Ratio of Zuras_4 to Karatsuba speeds.
[IPn64 in 'hex']
[KSTART 24; Z4START 0x600]

            A8   ||   i3
IPn64
0x1000     0.88  ||  0.96
0x2000     0.79  ||  0.86
0x4000     0.73  ||  0.82
0x8000     0.65  ||  0.71
0x10000    0.60  ||  0.67
0x20000    0.53  ||  0.58
0x40000    0.48  ||  0.52

CONCLUSIONS FROM TABLES
(1) Zuras_4 continues to pull away
    from Karatsuba, but IPn64 has
    to be very large before
    significant gains are found.
(2) The A8 continues to edge slowly
    ahead of the i3.
(3) Jitter from experimental problems
    is easy to spot.

An 8-way splitting method

The 8-way splitting method implemented is an 'original' form Toom Cook method that I call "TC_8" (see 'tc_notes' document for principles).
TC_8 is faster than all the other methods discussed in this document when IPn64 is very large.
Since Zuras_4 is the fastest method to this point, the question is "What is the cross over point at which TC_8 becomes faster than Zuras_4?"
The following timings answer this: -

TIMINGS FOR Zuras_4 AND TC_8 WITH
AMD A8 PROCESSOR
[All numbers in 'hex']
[KSTART 24; Z4START 0x600]
            Zuras_4     ||     TC_8
IPn64
0x1000        8AC---    ||    978---
0x1200        AB8---    ||    B0B---
0x1400        CE0---    ||    CC1---
0x1600        F22---    ||    E94---
0x1800       10C7---    ||   1070---
0x1A00       117B---    ||   118A---
0x1C00       1354---    ||   1346---
0x1E00       1540---    ||   1503---
0x2000       1798---    ||   173F---


TIMINGS FOR Zuras_4 AND TC_8 WITH
Intel i3 PROCESSOR
[All numbers in 'hex']
[KSTART 24; Z4START 0x600]
            Zuras_4     ||     TC_8
IPn64
0x2800       261----    ||   26C----
0x2A00       28B----    ||   293----
0x2C00       2B7----    ||   2BC----
0x2E00       2E8----    ||   2E7----
0x3000       31F----    ||   311----
0x3200       387----    ||   36A----
0x3400       393----    ||   374----

EXPERIMENTAL CONDITIONS
Above, TC_8 is called once and once
only; then the parts are handed down
to either Zuras_4 or Karatsuba,
whichever is faster.

CONCLUSIONS FROM TABLE
(1) For the A8, the intersection
    point is IPn64 = 0x1300;
    for the i3 0x2E00. This looks
    like a significant difference,
    but is caused by the very fine
    angle at which the 2 sequences
    cross each other.
(2) Experimental jitter is sufficient
    to cause the A8 sequences to cross
    more than once! This indicates that
    precision is meaningless here.
(3) I use a '#define' constant called
    TC8START set to 0x1A00 to control
    switching between TC_8 and Zuras_4.

    [Note that at this value, some
    IPn64 values will cause TC_8 to
    miss out Zuras_4 altogether and
    go straight to Karatsuba. This is
    because TC8START < 8*Z4START.]

Once again, for IPn64 in the range of the tables above, TC_8 offers negligible speed benefit in return for a significant software writing effort. What happens with IPn64 values larger than above?

TIMINGS FOR TC_8
[All numbers in 'hex']
[KSTART 24; Z4START 0x600;
    TC8START 0x1A00]
              AMD A8    ||  Intel i3
IPn64
0x2000        178C---   ||    1D3F---
0x8000        A399---   ||    BA66---
0x20000      48EB----   ||   501A----
0x80000     1E122----   ||  22179----
0x200000    BC9EC----   ||  C5C09----

RELATIVE SPEED FOR TC_8 vs KARATSUBA
and Zuras_4 ON AMD A8 PROCESSOR
[IPn64 in 'hex']
[KSTART 24; Z4START 0x600; 
    TC8START 0x1A00]
            % Karatsuba ||  % Zuras_4
IPn64
0x2000          82.6    ||    100.3
0x8000          61.7    ||     93.4
0x20000         48.4    ||     91.9
0x80000         35.4    ||     83.8
0x200000        24.6    ||     75.1

RELATIVE SPEED FOR TC_8 vs KARATSUBA
and Zuras_4 ON Intel i3 PROCESSOR
[IPn64 in 'hex']
[KSTART 24; Z4START 0x600; 
    TC8START 0x1A00]
            % Karatsuba ||  % Zuras_4
IPn64
0x2000          94.4    ||    109.4
0x8000          66.4    ||     92.4
0x20000         50.5    ||     87.5
0x80000         37.9    ||     81.7
0x200000        24.5    ||     67.5

CONCLUSIONS FROM TABLES
(1) The table just above indicates TC_8
    is slower than Zuras_4 when IPn64 is
    0x2000. The explanation for this
    is that after splitting, TC_8 will hand
    down 8 parts with IPn64 = 0x400
    (approx) (to Karatsuba - not calling
    Zuras_4). This is not ideal. A 
    little cunning can eliminate problems
    like this at transitions.
(2) TC_8 is not much faster than Zuras_4
    until IPn64 is very large.
(3) Karatsuba becomes less attractive as
    IPn64 grows.

Conclusions

Multiword integer multiplication methods are numerous and all these methods have many variations. The Toom Cook method is merely one of these. I set out to find its potential for speed and the above is the result. I do not claim to have extracted every last scrap of performance from the method - tweaks and tricks constantly suggest themselves as writing proceeds. Nevertheless, I hope I have provided basic guidance for future projects.

[go to site home page]

The performance of Toom Cook multiword integer multiplication algorithms.

[These algorithms can also be used for polynomial multiplication.]