[Some documents on this site are referred to repeatedly below. All the links to these documents are given here; return here if you want to view the contents.
tc_notes.html - Provides an introduction to the Toom Cook method in its original form. Two algorithms of this form are tested below.
zuras.html - Discusses some more recent Toom Cook methods (and others) produced in 2 papers by Mr. Dan Zuras. One of these is tested below.
speed.html - Discusses the basics of writing high speed code for x86 processors. Some jargon is introduced in 'speed' which is used below.
refs.html - Lists the technical references used in this document.
Below I shall write e.g. 'zuras' or 'speed' (single quotes) when referring to the documents above.]
Multiword integer multiplication is a cornerstone of many software packages that have a 'number theory' component e.g. cryptographic packages. Oddly, there is little in the way of experimental performance data to guide projects.
[An exception to this is Dan Zuras's work (see 'zuras').]
When I set out to write this and associated documents I had the objective of 'sorting the subject out'. This is a hopeless objective. There are so many algorithms and variations of algorithms that the task is huge. Not only that, if 2 programmers were to be given a flowchart for some algorithm and told to implement it, the speed of the implementations might easily be tens of % different in execution time - results depend significantly on experience and skill levels, and writing high-speed code is not a common skill.
I have now adopted a more sensible objective - the provision of timings for a handful of Toom Cook methods that are, in the view of most experts, front runners. Perhaps someone might be inspired to collate other results to provide a global summary.
Although this document is all about experimental results, 2 methods - the schoolroom and the Karatsuba methods - are simple enough to need only a brief introduction and this is included below. The other methods are discussed in dedicated documents 'zuras' and 'tc_notes'.
Timings below are always in processor clock ticks (see 'speed'). Doing this means that processor clock speed gains, from hardware improvements, are taken out of consideration. This leaves results relevant over a longer term. Even when this is done, there are variations between processor types.
The speed of multiplication algorithms is dominated by 2 competing influences: -
[From here down I shall use 'IPn64' to indicate the number of 64-bit words needed to hold the input arrays.]
In general, when IPn64 is small, the simplest methods are fastest (the administrative time must be as small as possible). As array sizes grow so more complex methods become faster than simpler ones (more administrative time can be tolerated so long as number crunching time savings are even greater). Yet further growth in IPn64 enables yet more complex methods to be the fastest ...
[This workload breakdown explains the historical evolution of this subject since the early days of x86 processors.
A reference book tells me that a 286 processor took 21/24 ticks to perform a (16 bit) MUL opcode. Today an AMD A8 has a latency of 5 ticks (register operands) and a throughput of 2 ticks for a 64 bit MUL.
This means that 'number crunching' times have shrunk dramatically. Simultaneously, the MOV's ADD's (and a few MUL's) that make up the administrative workload are only reduced on the back of increased register size. This in turn, means that the advantage of the more complex methods has shrunk to not far from the point of disappearing.]
Furthermore, the more complex the method the longer it takes to write - some of the methods discussed below might take 3 months or more to write, even if a suitable suite of assembler-written subroutines are available. If these need writing as well, a doubling or trebling of writing time is possible.
From the project management viewpoint, the above points raise serious problems of money and skilled manpower allocation and project completion dates. A conclusion from the above is that, for the more complex methods discussed below, speed gains will be moderate and the expense of obtaining them will be unpleasantly high.
One influence that eases these problems, is the fact that when graphically plotting timing results versus IPn64 for all the methods discussed below the curves intersect one another at very fine angles. If one defines an insignificant speed benefit as, say, a 10% speed gain, then there is little point in implementing some more complex method unless your project needs to handle input sizes much larger than the IPn64 of the intersection point.
On top of fine intersection angles, when IPn64 is steadily incremented, algorithm run time increases in a slightly jerky manner. This is caused mainly by data line movement numbers increasing somewhat erratically, by variations of flow paths within methods and by variations of the internal 'state' of the processor look up tables at any instant. These points mean that the phrase 'intersection zone' might be more appropriate than 'intersection point'.
My regular test machine is a 64-bit AMD A8-5500 with 6 GB of RAM with a 64-bit Windows 8.1 operating system. I sometimes use an Intel i3 machine (64-bit Windows 7) for comparison. The i3 takes 30% - 50% less time when IPn64 is small; but the A8 is faster when IPn64 is large (see below).
The development system is Visual Studio (2008).
All results below use 64-bit code and 64-bit assembler. This point is important because, for the type of work discussed in this document, there is a speed gain of 2-4 available merely by making best use of the capabilities of the host processor. This factor is of the same order as that available from the 'cunning plans' discussed on this site.
[This point is also important if you are attempting to sell to the general public without restriction. There are many 32-bit OS's out there and they are not going away any time soon. If you intend to provide the best possible performance for your clients you will have to develop both 32-bit and 64-bit versions, find the OS word size of the host machine 'on the fly', then use this to select between code versions. Alternatively, just write a 32-bit version as this will run on a 64-bit OS with only a tiny loss of performance (relative to a 32-bit OS).]
All the results below are taken in 'hot cache' conditions (see 'speed' for the meaning of this). Such results are minimum times, unlikely to be found when a module is embedded in a larger project. Such timings have the benefits of experimental repeatability along with ease of measurement. Cold cache timings can be much larger (even 10 times larger). Warm cache times (typical of use in large projects) will be adjacent to hot cache times.
In ordinary development work I take 4 timings in rapid succession; but below I use 16 timings because some of the timings are very large and then Windows task switching becomes more likely to interfere with measurements - the only way to subdue this is to increase the number of timings and use the minimum.
4 further effects need consideration: -
[On a technical note, all timings are taken in Visual Studio's 'Release' mode (as opposed to 'Debug' mode).
[Typically, Debug mode increases code volume by 10-30%, which increases timings (increases code line movements as well as conditional jumps).]
On my system, even in Release mode a '/Zi' flag (= add symbolic debug info) is left set in the compiler command line (check your compiler 'Properties'). I have deleted this flag for all timings in this document to remove any trace of debug padding.]
I shall discuss algorithms in order of increasing complexity.
This algorithm gets its name from the way it mimics long multiplication as taught in a schoolroom. In this method products of individual digits are found one by one, working from the right, arranged in a rhombus skewed to the right at the top, and finally summed down the columns, making allowance for carries. In software it is possible to 'add in' the rows, but summing down the columns is slightly faster since line movements to the output array are then reduced - column sums can be done instead into 3 registers (2 for products, 1 for carries).
Writing n-by-n modules along these lines is very simple, if tedious. I have written straight line (i.e. loop free) modules up to IPn64 = 16 for both multiplication and squaring.
Beyond this I use a module that can find the products for a single rhombus column of any length and sum them, initially into registers then, when complete, into the output array. This is used in conjunction with modules that can evaluate the right/left hand 16 columns of the rhombus. This compound module can find schoolroom products up to any IPn64.
For the A8 the speed of these modules is close to: - 5.IPn642 ticks.
If IPn64 is small, the constant is a little above '5' because of subroutine calling overhead and the shortness of the rhombus columns; if large the constant drops off to about 4.8.
[This undercutting of the latency time is small because the administrative workload between MUL calls is large enough to restrict it.]
Testing a 24-by-24 sized schoolroom product: -
A8 time = 0xB53 (289910) ticks
hence ticks/MUL = 2899 / 242 = 5.03
i3 time = 0x780 (192010) ticks
hence ticks/MUL = 1920 / 242 = 3.33
[I have been unable to find the i3 MUL latency time on the Intel website, but it is probably no more than '3' and may be '2'.]
Clearly, the i3 has a superiority with the schoolroom method, and it is only reasonable to suppose that this will extend to all methods. Below I demonstrate that this is not correct.
[Just to put give some context to these figures: - the speed of light is about 3.108 m/sec. For a PC clock speed of 3 GHz (3.109) a photon would travel 0.1m per tick. Hence finding a 320-by-320-bit product (5-by-5 64-bit words, using 5 ticks/MUL) would take the same time as a photon takes to travel 12.5m!
This blistering speed demonstrates that schoolroom methods will be impossible to beat for speed so long as the input array sizes are small. However, run time is going to go up very close to the square of IPn64. If a method can be found whose run time grows with a smaller exponent than '2' (it can!), then the 2 timing curves will intersect and beyond the intersection, the method will be faster than the school room method.]
The next point is obvious, but needs to be made clear. The more sophisticated methods split the inputs into parts (either 2, 4 or 8 parts in this document) then call another routine (possibly calling themself or another method) to find the products of these parts. Which method is called depends on an answer to - "What is the fastest way to multiply the parts?" The parts may be split again and again ... eventually the parts will be small enough for the answer to this question to be "the schoolroom method". In other words, the schoolroom method will underpin all the more sophisticated methods. This in turn means that any hint of poor performance in implementing the schoolroom method is going to be passed on to the performance of the more sophisticated methods.
[The schoolroom method plays a valuable role in debugging the more sophisticated methods. It is so simple that the method should work correctly 'straight out of the box'. If it doesn't, errors can be easily spotted and rectified. This means that it can act as a 'gold standard' for accuracy. So long as no upper limit on IPn64 is built in (as is achieved by the technique above) it can be used in a go/no-go test.]
The points above force attention onto the question - "What algorithm timing curve intersects the schoolroom timing curve at the lowest IPn64; and at what is the value of this intersection IPn64?"
The Karatsuba method can be developed as the simplest possible Toom Cook method.
(see the ZURAS papers in 'zuras'. Karasuba's paper (1962) predates Toom's (1963) and Cook's (1966), so presumably he did not know this.)
In the search for the method with the lowest possible intersection point with the schoolroom method, and given that the administrative overhead for the Toom Cook methods grows sharply with increasing complexity, the obvious place to start looking is the simplest possible Toom Cook method.
The development of the Karatsuba method can be found in any numerical algorithm textbook (e.g. 'Ref' 11 pg 295) so I shall just give an algorithmic summary: -
Given 2 input arrays, say g[] and h[], split them into a most significant half (MS) and a least significant half (LS) +----------+----------+ | MS_g | LS_g | +----------+----------+ +----------+----------+ | MS_h | LS_h | +----------+----------+ [Data movement should be avoided here - just set 2 pointers to relevant addresses. (For simplicity I assume that the IPn64 of both g[] and h[] are even numbers here. There are several ways to handle the even/odd length problem - I recommend using different subroutines; this keeps things simple and works well.) Division at a word boundary is essential for best speed.] (a) Multiply MS_g by MS_h (call this MS_prod). (b) Multiply LS_g by LS_h (call this LS_prod). [For the next, I use an Abs_Diff() module (finds the absolute difference) that returns an int (diffsign), that indicates which of the input arrays was greater. This is used to find Abs_Diff(MS, LS). (Note that Abs_Diff() will 'nearly always' only need to compare the most significant word of MS_g and LS_g to determine which way round the subtraction needs to be done to get the absolute value and set the return flag.) Temporary spaces are needed for both differences.] (c) Form: - (i) diffsign_g = Abs_Diff (MS_g, LS_g, Diff_g, n64) (Puts absolute difference of MS_g[n64] and LS_g[n64] into Diff_g[n64]) (ii) diffsign_h = Abs_Diff (MS_h, LS_h, Diff_h, n64) (d) Form Diff_g * Diff_h -> Diff_prod [Another temporary space for Diff_prod needed.] [If diffsign_g != diffsign_h add Diff_prod below If diffsign_g == diffsign_h subtract Diff_prod below.] (e) Add the results from above as: - +----------+----------+----------+----------+ | MS_prod | LS_prod | +----------+----------+----------+----------+ | +MS_prod | +---------------------+ | +LS_prod | +---------------------+ | +-Diff_prod | +---------------------+ ^ carries or borrow possible here (f) Incorporate carries/borrow (cannot ripple out). #----------------------------# As an example consider 3521 * 4122 (= 14513562) [all numbers in decimal here.] MS_prod = 35 * 41 = 1435 LS_prod = 21 * 22 = 0462 Diff_g = abs (35 - 21) = 14 (with MS_g > LS_g) Diff_h = abs (41 - 22) = 19 (with MS_h > LS_h) hence Diff_prod = 14 * 19 = 0266 (since both have MS > LS, this needs subtracting) hence result formed as: - 14350462 1435 0462 Sum = 14540162 -0266 Result = 14513562 Reversing the MS & LS halves of 1st term above gives another example: - 2135 * 4122 (= 08800470) MS_prod = 21 * 41 = 0861 LS_prod = 35 * 22 = 0770 Diff_g = abs (21 - 35) = 14 (with MS_g < LS_g) Diff_h = abs (41 - 22) = 19 (with MS_h > LS_h) hence Diff_prod = 14 * 19 = 0266 (since diffsign_g != diffsign_h this needs to be added) hence result formed as: - 08610770 0861 0770 Sum = 08773870 +0266 Result = 08800470 #----------------------------#
If we think about Karatsuba from the speed viewpoint, the 'number crunching' part is now 3# half-sized products - a 25% reduction in schoolroom MUL calls. Simultaneously, the administrative workload is going up by the following items: -
(a) 2# half-IPn64 sized Abs_Diff() calls.
(b) 3# IPn64 sized add/subs to recover the output.
(c) Carry/borrow adjustments.
Items (a) and (b) grow linearly with IPn64. Item (c) will 'nearly always' require adjustment to a single word - trivial. This means, that as IPn64 grows large, Karatsuba's execution time will drift towards 75% of the schoolroom execution time. If a second Karatsuba call is made on the split parts of the first, another near 75% speed gain is possible ... For multiple levels of Karatsuba, the overall speed gain can rise to factors of 'several'.
The IPn64 beyond which a single stage of Karatsuba (calling schoolroom on the split parts) is faster than the schoolroom method applied to the whole can only be found by experiment: -
TIMINGS FOR THE SCHOOLROOM (SR) METHOD vs THE KARATSUBA METHOD [IPn64 in decimal, timings in 'hex'] AMD A8 || INTEL i3 IPn64 SR KARATSUBA || SR KARATSUBA 12 2A8 354 || 1BC 2E4 14 393 437 || 258 390 16 4BC 4CD || 314 448 18 64E 5F3 || 420 50C 20 7CA 745 || 524 5F8 22 9BB 85B || 648 710 24 B4F A26 || 784 81C 26 CEB B88 || 8D0 938 28 F1F D0D || A38 A60 30 1185 E7E || BC0 B98 32 1455 108B || D44 CE8 CONCLUSIONS FROM TABLE (1) For the AMD A8, the cross-over point for the 2 methods is IPn64 = 16. (2) For the Intel i3 the cross-over point for the 2 methods is IPn64 = 30. (3) For both methods, the timings are small so that 'jitter' is apparent. Nevertheless, underlying trends are clear.
The fact that the different processors perform differently, is a minor irritation. It would be possible for your executable to identify the manufacturer of the host processor 'on the fly' by using opcode CPUID, then use this to set a variable and use this to branch to the appropriate code; but this is a step too far for me. I use a fixed '#define' constant KSTART to control the cross over point.
In my implementation, the Karatsuba subroutine can handle any value of IPn64 (assuming dynamic memory is available). It carries on repeatedly splitting the input arrays and calling itself until the parts are <= KSTART, at which point the schoolroom method is called. Then as the subroutine returns, the partial products are assembled stage by stage.
The performance of my Karatsuba subroutine as KSTART varies is: -
TIMINGS FOR THE KARATSUBA METHOD when IPn64 is large [IPn64 and timings in 'hex'] For KSTART = 16 AMD A8 || Intel i3 IPn64 0x40 34-- || 2E-- 0x100 1F4-- || 1EB-- 0x400 121--- || 127--- 0x1000 9D3--- || AA6--- 0x4000 5AB---- || 60F---- 0x10000 329----- || 36E----- For KSTART = 24 AMD A8 || Intel i3 IPn64 0x40 35-- || 2F-- 0x100 1F1-- || 1EC-- 0x400 122--- || 127--- 0x1000 9CB--- || AA9--- 0x4000 5A6---- || 60F---- 0x10000 32B----- || 36D----- For KSTART = 32 AMD A8 || Intel i3 IPn64 0x40 3F-- || 30-- 0x100 26--- || 1F4-- 0x400 15B--- || 12B--- 0x1000 BB2--- || ACD--- 0x4000 6C1---- || 626---- 0x10000 3C8----- || 37A----- CONCLUSIONS FROM TABLE (1) The A8 is slower than the i3 when IPn64 is small. However when IPn64 is large, it is a little quicker. Why should this be? I am not sure; the following is my guess. When IPn64 is small, the work tests the processors ability to execute a 'MUL'. From above, the i3 is superior at this. When IPn64 is large, data line movements are large and then the work tests cache performance. At this the A8 is superior. AMD have replaced Intel's northbridge/ southbridge architecture with their own technology, and this appears to be superior. (2) The i3 is insensitive to KSTART values. The A8 performs best when KSTART=24; and performance falls off for KSTART=32.
On the basis of the results above I settled on KSTART = 24.
The speed gain for Karatsuba above schoolroom was found to be: -
RATIO OF THE SCHOOLROOM METHOD TIME DIVIDED BY THE KARATSUBA METHOD TIME [IPn64 in 'hex'] [KSTART fixed at 24] AMD A8 || Intel i3 IPn64 0x40 1.62 || 1.14 0x100 2.65 || 1.70 0x400 4.50 || 2.81 0x1000 7.63 || 4.77 0x4000 13.7 || 8.34 0x10000 23.6 || 15.1 CONCLUSIONS FROM TABLE (1) The larger speed gain for the A8 relative to the i3 should be noted. This is caused by a combination of the larger MUL time of the A8 increasing the numerator (= schoolroom time), and superior A8 cache performance reducing the denominator (= Karatsuba time). (2) The Karatsuba performance is good when IPn64 > 100 (decimal). Also, the schoolroom performance becomes steadily more unacceptable. (3) Projects that need to handle only IPn64 < 100 might get away with just a schoolroom module. This would save a good deal of software effort and few clients would notice.
Toom Cook methods can be categorized by the number of parts that the input arrays are split into. There are theoretical reasons why merely incrementing the number of parts should deliver ever smaller speed gains. At the same time, to write many Toom Cook splitting modules implies a great deal of effort.
Faced with these problems, I have just implemented methods that split the inputs into 2 (i.e. Karatsuba, above) 4 and 8 parts and skipped other possibilities. Doubling up, like this, seems like a reasonable payoff between software effort and speed payoff.
Even after a splitting number is selected, there are many variations of the method available. For splitting into 4 parts I have implemented 2 of these variations. The first follows Toom and Cook's original presentation and principles are covered in 'tc_notes'.
The second follows the recommendations of Dan Zuras (see references ZURAS_A and ZURAS_B in 'zuras').
For the 8 part splitting method only 1 method has been implemented - see 'tc_notes' again.
[All theory from here down is in either 'tc_notes' or 'zuras' - consult these for further understanding.]
I have implemented an 'original form Toom Cook' 4-way splitting module I call "TC_4", and also a more recent Toom Cook variant I call "Zuras_4" which follows the signal flow diagrams given in Dan Zuras's documents ZURAS_A and ZURAS_B (see 'zuras' for references). I found Zuras_4 faster than TC_4 by several % across the board, hence I will not discuss TC_4 further.
Zuras_4 is not complicated to implement, but it is bulky. The time taken to implement it will depend on whether you have available 'atomic' assembler modules for multiword add/subtract plus a few others. Even if you already have these available, you could be looking at 2-4 months of work to get it running. Below I demonstrate that Zuras_4 is only significantly faster than Karatsuba when IPn64 is greater than several hundred words. If your project has no use for IPn64's as large as this, do not waste your time with it.
[In 'zuras' I discuss a variation of Zuras_4 with a different 'end-game'. Times below are for this variation, though the speed gains from it are tiny.]
[Below I discuss an 8-way splitting module I call TC_8 which is faster than Zuras_4 when IPn64 is very large. This module calls down to Zuras_4 (when the parts are small enough) or to itself. This means that the speed gains from Zuras_4 are passed directly along to TC_8. This provides a handy (but not transformative) boost to the speed of TC_8. Hence, if you need a multiword multiplication module to handle very large IPn64 (i.e. you need TC_8), you may want to reconsider implementing it. Since TC_4 is easier to implement, this could be an alternative.]
RATIO OF Zuras_4 to Karatsuba TIMES [IPn64 in 'hex'] [KSTART 24] A8 || i3 IPn64 0x200 1.06 || 1.27 0x400 0.97 || 1.09 0x600 0.95 || 1.03 0x800 0.94 || 1.0 0xA00 0.91 || 0.96 0xC00 0.89 || 0.95 EXPERIMENTAL CONDITIONS For the table above, the Zuras_4 module is called once i.e. makes 1 and only 1 splitting; then the split parts are handed down to Karatsuba. CONCLUSIONS FROM TABLE (1) For the A8, the intersection point is IPn64 approx= 0x400; for the i3 approx= 0x800. (2) The tiny angle at which the times for the 2 modules approach each other should be noted. This means tiny changes to the code will have large effects on intersection points. (3) I use a '#define' constant called Z4START set to 0x600 to control switching between Karatsuba and Zuras_4. Because of point (2), when IPn64 is much larger than Z4START, multiplication times will be insensitive to Z4START.
For the range of IPn64 in the table above, it will be clear that implementing Zuras_4 is a poor prospect. The gains in speed are small and the software effort large, meaning few projects could justify it. So how do the speed gains develop as IPn64 grows? The following tables answer this.
TIMES OF Zuras_4 and Karatsuba methods as IPn64 grows. [IPn64 & times in 'hex'] [KSTART 24; Z4START 0x600] Processor = AMD A8 Karatsuba || Zuras_4 IPn64 0x1000 9A9--- || 888--- 0x2000 1D5---- || 1718--- 0x4000 591---- || 415---- 0x8000 108F---- || AD3---- 0x10000 318----- || 1DCF---- 0x20000 95D----- || 4ED----- 0x40000 1BFC----- || D76----- Processor = Intel i3 Karatsuba || Zuras_4 IPn64 0x1000 A41--- || 9E5--- 0x2000 1EF4--- || 1A9B--- 0x4000 5D6---- || 4CC---- 0x8000 1190---- || C8D---- 0x10000 34F----- || 236----- 0x20000 9ED----- || 5B6----- 0x40000 1E62----- || FEE----- EXPERIMENTAL CONDITIONS With Z4START fixed at 0x600, when IPn64 is large, Zuras_4 hands down to itself, possibly several times, before handing down to Karatsuba. Ratio of Zuras_4 to Karatsuba speeds. [IPn64 in 'hex'] [KSTART 24; Z4START 0x600] A8 || i3 IPn64 0x1000 0.88 || 0.96 0x2000 0.79 || 0.86 0x4000 0.73 || 0.82 0x8000 0.65 || 0.71 0x10000 0.60 || 0.67 0x20000 0.53 || 0.58 0x40000 0.48 || 0.52 CONCLUSIONS FROM TABLES (1) Zuras_4 continues to pull away from Karatsuba, but IPn64 has to be very large before significant gains are found. (2) The A8 continues to edge slowly ahead of the i3. (3) Jitter from experimental problems is easy to spot.
The 8-way splitting method implemented is an 'original' form Toom Cook method that I call "TC_8" (see 'tc_notes' document for principles).
TC_8 is faster than all the other methods discussed in this document when IPn64 is very large.
Since Zuras_4 is the fastest method to this point, the question is "What is the cross over point at which TC_8 becomes faster than Zuras_4?"
The following timings answer this: -
TIMINGS FOR Zuras_4 AND TC_8 WITH AMD A8 PROCESSOR [All numbers in 'hex'] [KSTART 24; Z4START 0x600] Zuras_4 || TC_8 IPn64 0x1000 8AC--- || 978--- 0x1200 AB8--- || B0B--- 0x1400 CE0--- || CC1--- 0x1600 F22--- || E94--- 0x1800 10C7--- || 1070--- 0x1A00 117B--- || 118A--- 0x1C00 1354--- || 1346--- 0x1E00 1540--- || 1503--- 0x2000 1798--- || 173F--- TIMINGS FOR Zuras_4 AND TC_8 WITH Intel i3 PROCESSOR [All numbers in 'hex'] [KSTART 24; Z4START 0x600] Zuras_4 || TC_8 IPn64 0x2800 261---- || 26C---- 0x2A00 28B---- || 293---- 0x2C00 2B7---- || 2BC---- 0x2E00 2E8---- || 2E7---- 0x3000 31F---- || 311---- 0x3200 387---- || 36A---- 0x3400 393---- || 374---- EXPERIMENTAL CONDITIONS Above, TC_8 is called once and once only; then the parts are handed down to either Zuras_4 or Karatsuba, whichever is faster. CONCLUSIONS FROM TABLE (1) For the A8, the intersection point is IPn64 = 0x1300; for the i3 0x2E00. This looks like a significant difference, but is caused by the very fine angle at which the 2 sequences cross each other. (2) Experimental jitter is sufficient to cause the A8 sequences to cross more than once! This indicates that precision is meaningless here. (3) I use a '#define' constant called TC8START set to 0x1A00 to control switching between TC_8 and Zuras_4. [Note that at this value, some IPn64 values will cause TC_8 to miss out Zuras_4 altogether and go straight to Karatsuba. This is because TC8START < 8*Z4START.]
Once again, for IPn64 in the range of the tables above, TC_8 offers negligible speed benefit in return for a significant software writing effort. What happens with IPn64 values larger than above?
TIMINGS FOR TC_8 [All numbers in 'hex'] [KSTART 24; Z4START 0x600; TC8START 0x1A00] AMD A8 || Intel i3 IPn64 0x2000 178C--- || 1D3F--- 0x8000 A399--- || BA66--- 0x20000 48EB---- || 501A---- 0x80000 1E122---- || 22179---- 0x200000 BC9EC---- || C5C09---- RELATIVE SPEED FOR TC_8 vs KARATSUBA and Zuras_4 ON AMD A8 PROCESSOR [IPn64 in 'hex'] [KSTART 24; Z4START 0x600; TC8START 0x1A00] % Karatsuba || % Zuras_4 IPn64 0x2000 82.6 || 100.3 0x8000 61.7 || 93.4 0x20000 48.4 || 91.9 0x80000 35.4 || 83.8 0x200000 24.6 || 75.1 RELATIVE SPEED FOR TC_8 vs KARATSUBA and Zuras_4 ON Intel i3 PROCESSOR [IPn64 in 'hex'] [KSTART 24; Z4START 0x600; TC8START 0x1A00] % Karatsuba || % Zuras_4 IPn64 0x2000 94.4 || 109.4 0x8000 66.4 || 92.4 0x20000 50.5 || 87.5 0x80000 37.9 || 81.7 0x200000 24.5 || 67.5 CONCLUSIONS FROM TABLES (1) The table just above indicates TC_8 is slower than Zuras_4 when IPn64 is 0x2000. The explanation for this is that after splitting, TC_8 will hand down 8 parts with IPn64 = 0x400 (approx) (to Karatsuba - not calling Zuras_4). This is not ideal. A little cunning can eliminate problems like this at transitions. (2) TC_8 is not much faster than Zuras_4 until IPn64 is very large. (3) Karatsuba becomes less attractive as IPn64 grows.
Multiword integer multiplication methods are numerous and all these methods have many variations. The Toom Cook method is merely one of these. I set out to find its potential for speed and the above is the result. I do not claim to have extracted every last scrap of performance from the method - tweaks and tricks constantly suggest themselves as writing proceeds. Nevertheless, I hope I have provided basic guidance for future projects.