Anyone setting out to write fast executing software, or has written some software modules which are unacceptably slow and need speeding up, needs to be aware of the information in this document. Faced with slow software, the first step is to avoid despair and take on board the statement - "All is not lost; there are remedial measures available!". This document concentrates on ensuring that a Pentium processor is working flat out; or, put another way, that the processor is idle as little as possible. This approach might easily give a speed gain of 2, perhaps more, and can give results quickly.
[The alternative approach is to find faster algorithms; this approach requires a lot of development work and successful speed ups will be elusive. (This document contains little on algorithm selection.)]
[Some other documents on this site are referred to repeatedly
within. All the links to these documents are given here; return here if
you want to view the contents.
eucnotes.html and fastmuli.html
- These discuss finding word and multi word multiplicative inverses, respectively,
and use the techniques below.
msummary.html
- Not referred to below but provides a 'case study' for the techniques below. Might be considered "Further Reading".
refs.html - The technical references used in this document are listed within.
Below I shall just write e.g. 'Ref 11' or 'eucnotes' (single
quotes) when referring to any of the files above.]
[A good deal of the information on the workings of processors, used below, comes from 'Ref 7' and 'Ref 8']
Most software writers have at least a basic idea of how a processor works. Generally, however, this knowledge is focused on the execution units, and in particular on the atomic operations (MUL, ADD, ADC and the rest) that a processor can perform. They may have access to a textbook on assembler that lists the time (in clock ticks) that the processor needs to execute the atomic operations. From there, it is only a hop step and jump to (in theory at least) use this information to obtain an estimate of the time any module will take to execute, merely by counting operations, multiplying by the execution time for each, and summing.
This estimate will always be hopelessly wide of the mark - generally on the low side, and generally wrong by a factor of 'several'. Anyone doubting this might care to scan down this document where (towards the bottom of the 'Speed Testing' section) 4 timings (see 'Results Set 2'), obtained in quick succession are given for executing the same piece of code. The 4 timings obtained, cover a range of several to one. The estimate from above would be nearest to the last of the 4 - it may over estimate this value, if processor parallel capabilities are neglected; but it will be wildly out (on the low side) for the first of the 4.
Definition :- a clock tick in this document means 1 cycle of the CPU clock. Thus a 1 GHz processor, 'ticks' 1 billion times per second.
This document is about explaining how Pentium processors can yield slow execution times, and then offering some tips to avoid the worst of the problems. Once timing experiments are made, the value of the speed estimates derived as above, will be found to be somewhere between worthless and poor.
The scatter in the values in 'Results Set 2' needs explanation, and contemplation of the execution units does not provide it. An estimate derived as above is not wrong per se. The execution units do indeed perform as advertised - but the instruction execution times given in books are minimum values and there are other significant influences acting to slow down execution. Hence if a software module is taking much longer to execute than expected, only one possibility presents itself - the execution units are idle for a fraction (possibly the bulk) of the time. This being the case, 'What is slowing down the execution units?' becomes the obvious question. To understand the influences that interrupt the smooth working of Pentium processors, AMD and Intel documentation has to be searched.
['Ref 7' is Intel's overview of Pentium operation. 'Ref 8' is typical of a the operation of a specific AMD processor type. Documents like 'Ref 8' (and Intel's similar productions) are intended for compiler writers, so may be too heavy for general software writers - for these 'Ref 7' is the better option.]
[Note that AMD and Intel produce many publications to assist software writers to produce faster executables. Their interests lie in that direction. In other words there is nothing hush hush in this document. It is just that many software writers do not bother to study the problem.]
Searching processor literature, you will come across sketches of processor 'pipelines'; these are chains of hardware modules that immediately precede execution units. The overall task of the pipeline is to keep the execution units busy. On the Pentium Pro pipeline, in front of me as I write this, there are 10 modules leading to 5 ports (each leading to an execution unit) leading to a further 2 modules to dispatch results. Detail discussion of these modules is not appropriate here, but the explanation for the slow timings found experimentally above can be laid at the pipeline's door. Hence, a basic grasp of those parts of pipeline operations that affect overall speed is a prerequisite to tuning software.
Processor instructions (compiled code) come into the pipeline at one end, move through it module by module, are eventually passed through the ports to the execution units, then ejected. At the head of the pipeline are two 'branch prediction' modules followed immediately by two 'instruction fetch' modules. These modules are the most important for this document; the remaining modules will be ignored. However, one point about the other modules is worth bearing in mind - they hold a (tiny) stockpile of instructions, so that, should the flow of instructions into the pipeline be interrupted, the execution units can keep going for a short while before having to stop.
[The fact that there are 2 branch predictor modules and 2 instruction fetch modules, may be of interest to hardware designers, but is not important to this document. From here down I shall assume that they are both single modules.]
Every software writer has sketched out a subroutine flow chart many times. An important component of flow chart designs is the rhomboid shape with comments within it such as 'Is so-and-so TRUE?' - if it is - take the right hand path; if not - take the left hand path. At the processor level this if-else construct forces the pipeline branch predictor to guess which path is more likely to be executed next. The 'guess' comes about because the execution units are downstream in the pipeline, and only they know the value of variables changed within themselves and can determine the correct outcome of the 'Is so-and-so TRUE' statement in the rhomboid without error. So the branch predictor is forced to guess, then having guessed, the instruction fetch module (next in the pipeline) sets about 'fetching' the relevant instructions. (Actual hardware fetching circuitry is controlled by a dedicated fetch execution unit (on the end of a pipeline port) controlled by the pipeline fetch module, but the distinction between the two is not important here.) The relevant instructions may be in the pipeline, or the level 1 cache, or the level 2 cache, or in RAM. The further to the left of this list, the faster instructions can be transferred to the pipeline.
If they are in RAM the delay in the instructions reaching the pipeline will be very large - 'Ref 7', Chapter 8, pg. 111 quotes:-
"... the absolute minimum memory latency on the Intel Xeon MP is about 800 clock cycles, assuming the [data] bus is not in use. On average, the latencies would be much higher since the [data] bus is often in use, and so the memory request has to wait for earlier [data] bus requests to be processed ...".
[Definition :- 'latency' is the number of clock ticks the processor needs to finish some operation, once started. A related parameter is 'throughput' - this is the number of ticks that have to elapse before another operation of the same type can start. Throughput is always less than or equal to latency.]
[The latency of level 1 cache fetches is 2-4 clock ticks, level 2 cache fetches is 7-20 clock ticks and level 3 cache fetches about 12-160 ('Ref 7' Table 8.1)]
For a software writer attempting to speed up an executable, a figure of 800+plenty clock ticks immediately becomes the 'elephant in the room'. The execution units would be able to work, for a short while, with the instructions in the pipeline, but after this short interval they would have to stop and wait. This problem drives a coach and horses through the naive speed estimating method at the top of this document. Fortunately, hardware designers have come up with a technique to ease this severe problem - hardware prefetching (more discussion below). But despite their efforts, speeding up memory fetch's, by software prefetching (i.e. prefetching instigated by the software writer), is often the most productive speedup technique. Determining the delays the hardware 'fetch' (discussed above) and hardware 'prefetch' mechanisms (discussed below) are imposing on your executable is an essential first step in tuning work.
If the delays are small in percentage terms, trying to reduce them is a waste of time. If they are large, then the techniques in this document can be used to reduce them.
What happens when the branch predictor guesses wrongly? The time penalty from recovering from a wrong guess is only 2-10 clock ticks ('Ref 7' pg. 58) assuming the correct code is in a low level cache.
Definition :- A 'pipeline crash' is the name given to a wrong branch predictor guess. To recover from this the correct code has to be fetched (refer to timings above) then fed through the pipeline (a handful of clock ticks more) and only when it reaches the execution units can calculation restart.
Individual pipeline crashes are a minor irritation; nevertheless total time lost might build up if many mispredicted conditional branches are in play in some subroutine - a doubling of run time is possible.
To reduce branching problems the branch prediction block maintains lists of previous guess histories (in a buffer called the Branch Target Buffer or BTB) at all recently encountered conditional jumps and it refers to these histories to improve the chance of 'guess' success. So long as previous jumps were either nearly always TRUE or nearly always FALSE or follow a short repetitive pattern, the branch predictor will have a good chance of a successful guess, and reduce wasted clock ticks. If the sequence of outcomes at some jump is erratic (e.g. jumping in response to a random bit pattern), the branch predictor will guess poorly (the correct guess ratio will drop towards 50%).
[Both Intel and AMD regard the details of their branch predictor techniques as commercial secrets and they are not available to the general public; hence detail discussion of them is not possible.]
'Ref 7' says that a respectable branch prediction performance is a 95% correct guess ratio, and that if it drops below 90% a tuning effort starts to have a chance of paying off.
[These high figures indicate that branch prediction performance is impressive.]
What can be done if branching problems are suspected of causing slow running executables? The first thing to do is to check whether your suspicions are well founded. Intel provides a utility with their compiler called VTune which can count branches correctly predicted along with mispredicted branches. These values will tell you whether you have a real problem or it's just your imagination. (see Tip 3, below, for more details on VTune)
[These counts are maintained by internal registers on Pentium processors which Intel is coy about. Intel's literature maintains that these registers are only accessible in Real mode - I do not believe this. It beggars belief that VTune has to drop in and out of Real mode to recover these register contents. Is this even possible? Nevertheless, as things stand, anyone tempted to write modules to access these registers, will have to adapt their software to run under DOS!]
If a problem is positively identified, 'Ref 7' Chapter 7 contains many suggestions for work arounds that may help; some of these are discussed below but this document is too small to contain all of them.
If the BTB contains no history of some branch outcomes (first encounter or very, very rare encounters (i.e. previous history has been flushed)) what happens? The prediction block falls back on it's default response - which is to 'fall through' i.e. NOT to make the (assembler level) jump. This sounds simple enough, but high level software writers cannot know, at the assembler level, how their compiler has implemented some if-else construct.
['Ref 7' suggests exploiting the default behavior in 'first use, speed critical' modules; it also suggests such efforts are a wild goose chase!]
Below, I discuss a case where the default branch predictor behavior, in conjunction with the worst case fetching performance (above) wrecks the first use execution time.
The great problem with the discussions above is that modern processors are immensely complicated, and trying to second guess how they will perform is hopeless. Different Pentium models, may appear to software writers as very similar, but they have radically different internal architectures. What is needed for tuning work on some specific processor, is a precise method of performance measurement. Once this is available, experimental testing of different strategies becomes possible enabling steady progress towards faster executables. The software needed to do this is presented in the 'Speed Testing' section next. Following that I provide some general purpose tips for speeding up software.
The code below can be used to measure the time any module takes to execute. The code exploits the fact that Intel Pentium processors (excluding a tiny number of Pentium 1's) have a register on board (the Time Stamp Counter or TSC register) which does nothing more than increment each time a processor clock tick occurs (being set to zero on PC startup or reset). Put another way, a 2 GHz Pentium chip would increment this register 2,000,000,000 times per second. The code loads the current contents of the TSC (a 64 bit register) into the edx:eax pair, from which it can be copied into software variables. Then, by merely reading the TSC at the start and end of some module execution and subtracting the two values, a precise execution time in processor clock ticks can be obtained.
This little piece of code exposes the speed performance of Pentium processors warts and all; there will be times when you will stare at results from this code in utter disbelief - but believe it! It never lies. Nevertheless, the values can be 'corrupted' as discussed below.
There are 2 ways of using assembler instructions within 'C' high level code - write a subroutine completely in assembler, or (if you are writing 32 bit code) make use of C's embedded assembler capability - an asm{ ... } block with the assembler instructions listed between the {} brackets. This latter option is easier to implement and is the one demonstrated in (a) and (b) next. (Unfortunately, embedded assembler is not available in Visual Studio 64 bit C code. For this, see items (c) and (d) below.)
(a) If you are lucky you might have access to an assembler which will recognize the RDTSC (Read Time Stamp Counter) assembler op. Having got the TSC values into edx:eax copy them to local/class/global variables as needed.
(b) If the above does not fly, use the DB assembler directive (DB = define byte) to force bytes equivalent to RDTSC into the code stream as follows :-
unsigned long most_sig, least_sig ; // local variables being used as result holders asm { DB 0x0F,0x31 mov most_sig,edx mov least_sig,eax } // optional - copy local variables most_sig, least_sig // to where needed.
(c) Use a Visual C++ 'intrinsic' called __rdtscp (see the Visual Studio documentation). This obtains the clock reading without recourse to assembler at all, and will be the best method for most software writers.
(d) If you insist on keeping all code under your control, the following 64-bit MASM snippet may be for you :-
// In a convenient dot-h file :- extern "C" UL_64 __fastcall get_current_ticks (void) ; // Module returns a 64 bit unsigned integer of the TSC contents. // ('extern "C"' stops C++ name mangling.) // I use UL_64 'define'd as unsigned __int64 for convenience. .code PUBLIC get_current_ticks get_current_ticks PROC ; Note neither rax or rdx are non-volatiles in 64-bit code ; - hence no pushing/popping is needed to preserve them. ; To check upper 32 bits of rax cleared by rdtsc (in debugger) ; fill rax with junk. Yes, it is cleared! ; mov rax,1234987612349876h rdtsc ; ml64.exe, the Microsoft 64 bit assembler, recognises op 'rdtsc'. ; Even with 64 bit assembler, results are still in edx:eax here. ; Now amalgamate edx:eax (both 32 bits) into rax (64 bits) ; the returned value. shl rdx,32 or rax,rdx RET get_current_ticks ENDP End ;=================================
Some authors recommend just using the eax register contents. At current processor speeds the least significant 32 bits of the TSC overflow into the most significant 32 bits every 1-3 seconds. This means that if a short module is being timed (a timing value of, say, 10,000 clock ticks), an overflow is unlikely to occur. Hence, just using eax will probably be OK. On the other hand why take the chance?
(a) All recent versions of Windows grab the processor at regular intervals
to check if any other task needs processor time - since Windows 3.1 days
this cannot be disabled. Even if no other task is waiting, this Windows
operation will add clock ticks to the TSC register - i.e. contaminating
the result if it happens to occur in mid-timing. Fortunately, so long as
you are timing small modules, this will be rare; but you need to be on
the look out for it. If you start timing large swathes of code, or small modules that
need timing many thousands of times (e.g. modules whose speed depends on the input values),
this problem will become more common and it may be necessary to post process the timing
results to strip out Windows contaminated values. If other tasks simultaneously
need processor time, this contamination will be more serious.
The upper limit on module size that can be timed by the
TSC without worries about Windows contamination is about 1 million clock
ticks. Above this the problem gets more intrusive. My usual practice (when
timing individual subroutines) is to take 4 timings in rapid succession;
but when large modules are being timed increasing this number may provide
a way around the problem. If, say, 16 timings are taken, those that are
contaminated by Windows will be obvious.
(b) A corollary of (a) is that on a PC connected to any form of network
and/or busy multi-tasking, the contamination will increase. It may be necessary
to set up a stand-alone speed testing machine (i.e. totally disconnected
from your office networks and with all superfluous tasks disabled) to control
it.
In general 2 sets of timing results will be needed :-
(i) For comparing the speed of competing algorithms, an uncontaminated
result is essential.
(ii) For timing completed projects (or significant sections of a project)
on a clients machine, it is possible that the client will refuse to disable
networks or deactivate tasks. In this scenario, only contaminated timings
will be available and many hundred of these may be needed to get a balanced
picture; but then again, the results will be more representative of end
user conditions.
[In 'eucnotes' and 'fastmuli' I present some results obtained from testing modules whose speed is input data dependent. To average out the data dependency required many thousand timings being taken. (The extended Euclidean algorithm was being timed.) Although the probability of any single timing being corrupted was small, nevertheless timings were corrupted occasionally. The solution used to eliminate this problem was to check that the last timing of 4 (see below) was less than the penultimate one (the regular outcome); and if not use the penultimate one.]
(c) The RDTSC op code is what Intel call a 'serializing instruction'. As you may know, Pentiums have multiple execution units, multiple copies of the main registers and multiple pathways to them to enable a good deal of on chip parallel processing.
[This activity is not to be confused with multi-core parallel processing. In this document everything refers to the operation of a single core.]
All of this internal parallel activity is forced to stop by RDTSC doing its thing; and then it takes a while, after completion of RDTSC, before full parallel activity is restored. This means that RDTSC causes many more clock ticks to be wasted than might be presumed for merely doing a register to register copy. This means that the last 1 or 2 decimal digits in a timing should be ignored, and that timing anything with a run-time below, say, 20 clock ticks gives an almost worthless result.
[The above means that RDTSC is not useful for timing individual mispredicted branches. As mentioned above, this event may waste only 10 clock ticks - below RDTSC's resolution. Nevertheless, the timing technique, below, does reveal the expected steady improvement in branch prediction performance.]
(d) Shortly I shall present some results from using RDTSC. If you think that a Pentium is a Pentium - you are WRONG! Timings will vary considerably between processor models. This means that if you are writing code for a specific client, you may need to get a machine representative of the client's to use for timing work.
[The timings below cover more than 10 years of Pentium development. The timings on more recent models are approximately 2-3 times faster than older ones. This ratio is testament to the brilliant success of Intel/AMD hardware designers.]
The example module - multiplying multi-word numbers (very large integers) for cryptographic purposes, is a type of module for which high speed performance is essential. Almost by itself, it can transform the slickness of a cryptographic executable as experienced by a user.
Consider the problem of multiplying two 512 bit positive integers into a 1024 bit product. For 32 bit code, such values would have to be stored in unsigned long arrays (I shall use UL_32 for convenience from here on). Thus a declaration of :-
UL_32 A[16], B[16], OP[32] ; // multiplying A[] * B[] -> OP[]
If we consult textbooks for suggestions on performing this calculation, we find several algorithms exist. The simplest method is usually called the 'schoolroom' method because it consists of taking each UL_32 of B[], multiplying array A[] by it, then after suitable shifting adding the 17 UL_32 partial product into OP[] - analogous to the technique used in schoolrooms to multiply mult-digit decimal numbers.
[If a sketch is made of the position that the partial products occupy in the final result, a rhomboid shape arises. In the school room, the usual method finds the rows of the rhombus. In software, a small amount of time is saved by working down the columns.]
Clearly, 256 calls to a Pentium's 32 bit MUL op code will be needed to do the work. This op code is generally quoted as taking 5-13 clock ticks depending on Pentium model, hence a rough estimate of the runtime for this work is 256*(5 to 13) = 1280 to 3328 clock ticks. We might add on, say, 20% to this number to allow for all the adding and moving data around that has to accompany the 'MUL' work; giving an initial time guesstimate of 1500 to 4000 clock ticks.
[If you consult Intel/AMD reference sheets for the time taken by the 'MUL' op, the general trend is downwards over the years, but only in an irregular way. (It actually shot up for one Intel model.) This is another case where detail knowledge of a clients processors helps towards understanding timing results.]
As a part of developing modules for this type of work, I wrote 2 methods capable of tackling the problem above, both using the schoolroom algorithm. The first I'll call MUL_16_SL() which straight lined the whole calculation. This method was written in assembler. The work proceeds by moving down columns (accumulating results into registers) so as to minimize data moves on and off the chip. This method makes 256 explicit calls to 'MUL'.
The second I'll call MUL() which was written to calculate the product of 2 different sized input arrays, say A[m] and B[n] with m != n. This listing, also written in assembler, only contains 1 explicit call to 'MUL' and also proceeds by moving down columns. The work is controlled by using nested loops, loop counters and moving pointers. Although not designed specifically to handle equal sized inputs, it can do so. (Note, both modules end up making the same number of calls to 'MUL'.) Note that whereas MUL_16_SL() is devoid of conditional jumps; MUL() has many conditional jumps to control the nested loops. Also note that whereas the object module from MUL_16_SL() is large, that from MUL() is about 1/3 the size.
What happens when we time these methods? For reasons that will be explained later, I usually take 4 timings in quick succession using a loop that contains no work other than saving the clock tick results into an array. After loop exit, only then are the results displayed or filed.
[Some jargon definitions :- The speed of any module depends significantly on the cache conditions when it is called; and also the branch histories maintained in the BTB. The 4 timings below move from what I call 'cold cache' to 'hot cache' conditions. If no part of a module's code or data is in any cache and if there is no branch history of the conditional jumps, I call this 'cold cache' conditions. If all of a module's code and data is in the level 1 cache and if there is a full branch history of the conditional jumps I call this 'hot cache' conditions.]
Three machines were used to obtain timings (32 bit code being timed
unless stated otherwise) :-
(A) An old Pentium III in a desktop (32 bit chip/32 bit OS).
(B) An AMD Athlon from 2005 in a desktop (64 bit chip/32 bit OS).
(C) An AMD A8 from early 2012 in a laptop (64 bit chip/64 bit OS).
Typical 'hot cache' timings from processor A :- MUL_16_SL() = 1768; MUL() = 3705 ditto from processor B :- MUL_16_SL() = 1244; MUL() = 2908
[Using 64 bit code (+ a 64 bit OS) on an Athlon the straight line calculation takes 231 ticks! This, of course, is for a module containing 64 calls to a 64 bit MUL.]
ditto from processor C :- MUL_16_SL() = 636; MUL() = 1872
I normally only use processors A or B for timing development work.
Typical 4 set timings are :-
Typical experimental timings from processor B (moving from cold to hot cache conditions) :-
(i) MUL_16_SL() = {21153, 1316, 1276, 1276} (ii) MUL() = { 4532, 3154, 2975, 2876} (iii) Modified MUL_16_SL() = {14244, 1247, 1248, 1248}
A good deal of information can be obtained by studying these results :-
The most peculiar result is the cold cache timing (the 1st timing
of the 4) returned from Result Set 2 (i). This timing is roughly 16 times
larger than the hot cache timing (the last timing). What is going on? Actually,
more than 1 thing is going on - below, merely by minor adjustments of some
C code, I obtained the 'modified' timings in row (iii). Leaving that discussion
aside for the moment, I shall concentrate here on the 'cache loading' problem.
The size of the MUL_16_SL() object module was 4762 bytes. Now Pentiums load code (or data) in chunks that Intel/AMD call a 'line' and the size of a line is usually 64 bytes. Hence the object module comprises 75 lines.
[Some Pentium models boast 128 byte lines - these just read lines in pairs. The size of a line can be obtained, by an executable, from the data returned from the CPUID assembler op code. (Cache sizes are also available from CPUID.) From here down I shall assume that a Pentium has a line size of 64 bytes, and the word 'line' will always refer to a Pentium line.]
On Pentium processors, when a line of code is loaded, what is done about the next line depends on processor model. On early Pentiums, only a 'fetch' pipeline module existed. This module would inspect the most recent line to arrive, decide (possibly guess) the next line it needed, then speculatively request it. This next line might take a long time to arrive and the 'fetch' module, on it's own, left a lot to be desired. (This 'fetch' mechanism is present in late models as well.) From above, the waiting time for a line, after a request has been sent, may be very large if it is in RAM (as is the case for 'cold cache' results).
From the 1st timing of set 2 line (i) 'fetching' is seen to be taking many times the time needed to execute it! For the 1st timing of set 2 line (i) , the processor is, on average, active for approximately 9% of the elapsed time and waiting for more code for 91% of the time! In reality, these 1st timings are generally a test of the capabilities of the RAM memory chips and the front side bus (easily the slowest components in the code's journey from RAM to processor) and the capability of the on chip line loading circuits to keep the processor supplied with code (these circuits will often be overloaded in cold cache conditions).
[In a tuning exercise, the time of a single line 'fetch' from RAM for the target machine is an important item of information - use RDTSC to find it. The comments, below, on obtaining 'cold cache' times will apply.]
After the first module timing is completed, 'warm' cache conditions
will prevail and the processor can rapidly access code and data for second
and subsequent executions.
On all recent processors, AMD and Intel's response
to the poor 'fetch' performance problem has been to incorporate circuitry
into their pipelines to 'prefetch' a code or data line speculatively. The
'prefetch' circuits do not displace the 'fetch' mechanism discussed above;
rather they augment it. The hardware prefetchers do not start to work until
2 contiguous lines have been requested in succession (this stops the prefetcher
overloading the data highways and wasting cache space) then the 3rd line
is skipped (ordinary 'fetch' requests from the pipeline must handle this
one too - the prefetcher always tries to look further ahead than the fetcher
so that the 2 mechanisms do not compete/interfere) and then the 4th and
subsequent lines are speculatively prefetched until sequential accesses
stop.
[Early hardware prefetchers could only work in the direction of increasing addresses, but all recent Pentium prefetchers can work either way.]
[The hardware prefetcher will stop working at a page boundary (a 4096 byte/64 line boundary). After that, it will restart it's 'wind up' procedure given above. For the 75 line module, MUL_16_SL(), it is a possible that 2 page boundaries are being crossed - needing 2 hardware prefetcher restarts!]
[Another aspect of the code/data fetching problem is the 'stride' pattern i.e. the relationship between successive lines. For the MUL_16_SL() code, lines are in a simple sequence (a regular stride pattern) and this is the ideal pattern for the hardware prefetcher. Irregular stride patterns will reduce, possibly wreck, the effectiveness of hardware prefetchers.]
Hot cache timings are the easiest to obtain - just call the module several times (4-8 is generally enough); then at the end the code is going to be firmly in the level 1 cache and the branch predictors will be 'locked on' to the branching patterns (i.e. hot cache conditions). As mentioned above, one thing to watch out for, is to ensure that the loop obtaining the timings is free of all extraneous work. For instance the results above were obtained from a Win32 console type application, where the old printf() method is still available. Now printf() is over 10k bytes in size. The temptation to call printf() (or anything else) from within the timing loop has to be resisted since it might eject lines of the tested code from the level 1 cache (cool it down a bit) and clog up data highways.
Hot cache timings are easy to obtain, these are the ones I normally use to drive experimental
work and log - hence the Set 1 values above.
Cold cache timings are trickier. The objective is to obtain a timing when no
residue of the code, data or branching history is on chip. The best way of ensuring this, is
to do a PC turn-off / turn-on cycle; then run your timing program. All
the 1st entries in any set above were obtained in this way.
[You might think that just doing a Windows 'restart' would be good enough. It isn't. If you don't believe me, conduct your own experiments. Some residue of previous work will survive a Windows restart, spoiling the cold cache value. This difference gives a hint about the large size of the caches and branch histories in the BTB on modern processors.]
Unfortunately, there are influences which undermine the validity of
the cold cache timings e.g. :-
After turning a PC on, and the display has stabilized, if you watch
the disk activity LED you will notice that it continues flickering away
for some minutes. This is Windows carrying out low priority house-keeping
tasks. Windows does this to (fraudulently?) give the illusion of a rapid
startup. Hence it will be several minutes before Windows finally enters
a 'waiting' state. If you attempt to take your first timing during this
'house-keeping' period, as like as not you will also be including an unwanted
component from disk activity.
[The processor will always respond immediately to requests from high speed hardware, like disks.]
All of this means that the first cold cache timing results are likely to be very erratic on the high side - varying by factors of even many thousands. If you are unaware of this effect, you might worry yourself silly. Once your executable starts, Windows will put all outstanding house-keeping work on hold - giving your code priority - but work already in train (involving hardware activity) may require processor effort, and Windows will see that it receives it promptly. This description of events leads to a solution. In the timings above, I timed both modules in the same executable. Having got one set of timings, I cut and pasted to swap the order of the modules, formed a new executable and ran that. The second module timing in each case is likely to be the required result (Windows based activity having (hopefully) been supressed by the time the second module starts). This trick only works so long as the code and data sets of the modules are mutually exclusive - otherwise the first module will 'warm up' the second.
Even after going to all this trouble, you would still be advised to
repeat the process several times to build confidence in your cold cache
timing. If all this is too tedious (and it might take half a day to get
a good quality cold cache timing), ignore the first timing as 'junk'.
A cold cache time will be needed when it comes to timing a module in
the final product. The cold-hot cache 'spread' is the only way to identify :-
(a) Windows corrupted readings.
(b) The possible pay off from software prefetching. (more below)
The problem with cold and hot cache timings raises the question 'Which multiply module would be best in my final app?' In cryptography (the target app for the modules tested above) a wide range of scenarios exist. Many multiplies in succession may occur (e.g. in calculating exponentials); furthermore, in general, a multiply is followed by a 'mod' operation, an operation with a significant code volume that may 'cool down' the multiply code in the level 1 cache. Nevertheless warm cache conditions will probably prevail. In this situation the cold cache time penalty can be 'amortized' over, possibly, several hundred calls. On the other hand if only a few multiply calls are made, together with large intervals between calls, a large fraction of calls might well be in cold, or chilly, cache conditions and the cold cache time penalty may be unacceptable; in this scenario, a smaller code size module would be a better bet.
Note that the only way to prove the relative merit of these theories is to use RDTSC! In recent model i3, i5, i7 processors a level 3 cache is included and the level 3 caches may be several megabytes in size. For these processors, complete ejection from the caches of any module is unlikely - even when a module is called rarely. Hence the cold cache time will probably only occur once - the first call.
The other big cause of slow executables is branching mispredictions. The influence of these is likely to be an order of magnitude less than the fetch problems discussed above, and also be much harder to tune away. Consequently, problems with these should only be tackled after the 'low hanging fruit' available from tackling fetch problems are exhausted.
The set times for the 2 modules given above, give a feel for the time penalty from pipeline crashing. MUL() has several local variables holding pointers and loop counters which cost clock ticks as they are changed, but a fraction of the time penalty associated with MUL() is from pipeline crashes. Notice also how the sequence of timings (in the set of 4) from MUL() continues to drift down - this is a demonstration of the branch predictor circuits steadily improving their guess work. MUL_16_SL() has no conditional jumps, so branch prediction improvements do not occur. Once the caches are filled (1st call), times are constant.
Above I guesstimated the run time for the modules being considered as 1500/4000 clock ticks. In the event, some of the hot cache timings are less than this. This is a measure of the skill of hardware designers improving on chip parallelism and branch prediction on Pentium processors.
[It also depends on the difference between 'latency' and 'throughput' times - see below.]
[------------
For comparison, when the 512 by 512 bit multiply work above is done
by 64 bit assembler the Athlon completed the work in 19% and the A8 in
47% of the times given above for the straight line module (both hot cache
times). By 64 bit assembler I mean using 64 bit operations - for the problem
above, this means 64 calls to the 64 bit MUL were needed versus the 256
calls for the results above. This being the case, it might be expected
that the 64 bit code would run in 25% of the times above.
Tables give a 64 bit MUL as generally requiring
1 or 2 clock ticks longer to execute than a 32 bit MUL. This effect might
increase the estimated run time to 30-35%. (At the same time the code volume
shrinks to 25% - having a dramatic effect on the cold cache times.) However,
for more recent chips (e.g. the AMD Athlon) MUL throughput times (discussed
below) of 2 ticks are on offer, and straight line code can approach this.
Note that whatever the speed gain from upgrading
to 64 bit assembler,
it will be considerable, at least when heavy
integer arithmetic is being handled, and this provides another 'get out
of jail' possibility to speed up code. Today, any cryptographic software
provider not exploiting this option is providing poor service.
[If you are an end user of software, you need to be aware
of a possible 'scam' on this topic. It is possible to convert 32 bit assembler
code to run under a 64 bit OS - only the interface needs changing - calculations
using the old 32 bit registers (or even 16 bit registers...) are still
permitted (the hardware with only a few 'oddball' exceptions is backwards
compatible).
If you go to the GoAsm site you will find a free
utility called AdaptAsm which does precisely that! Negligible speed gain
or loss will occur from this. However it provides a lazy software provider
a means to convert some old 32 bit assembler modules into modules able
to run under a 64 bit OS - while simultaneously claiming to have upgraded
his product!
The only way to get to the truth of such a possible
fraud is to have an expert examine the executable using a disassembler.
The scenario sketched above might be called the
'sheep in wolf's clothing' scenario. A 'wolf in sheep's clothing' scenario also exists.
[My version of Visual Studio is from 2008; I do not know if the following comments apply to later versions.]
When developing 64 bit assembler modules, the console
display is just as handy for inputting and outputting results as it was
for 32 bit code and VS still provides it. The console VS provides is the
old 32 bit version but with all the subroutine interfaces upgraded to 64
bit compatibility.
[This work will have been done by Microsoft's version
of AdaptAsm. If you search through the VS directories into the 64 bit section,
you will find all the old stalwart binaries like kernel32.dll and the rest.
These are 64 bit-ised versions of the old 32 bit .dll's. where the interfaces
have been upgraded but the underlying code is still 32 bit i.e. these .dll's
are 'sheep in wolfs clothing' code. For the console, there is negligible
benefit from reworking all these subroutines.]
[One way to test whether your 64 bit system performs
the same as the 32 bit version, is to use printf() to try to print out
a 64 bit integer. Since the 32 bit version knew nothing of 64 bit integers,
it could not print them out correctly. Merely updating printf()'s interface
will, of course, leave the underlying behaviour unchanged, and the 64 bit
printf() will not correctly display a 64 bit integer. My system is of this
type.
[A work around for this problem is to call printf()
twice with both halves in turn. This only works in hexadecimal format,
and the little endian problem means the halves have to be reversed.)]]
This means that 64 bit assembler code can be called
from the console, along with any speed benefit that might accrue. On seeing
a console screen, nearly every software writer would say 'Ah, a 32 bit
application'. But that may not be correct. It might be a 'wolf in sheep's
clothing'. Again, only an expert with a disassembler could get at the truth.
All the timings, above, for 64 bit assembler modules, were obtained in
this way.]
------------]
RDTSC is a wonderful forensic tool for optimizing software for speed. However it opens a can of worms relating to the inner workings of processors. To get the best from it, you will need to brush up your hardware knowledge - and the first place to find this is 'Ref 7'. To fine tune your knowledge, download the equivalent of 'Ref 8'.
A word to the wise. Using RDTSC becomes addictive. It is essential to stop members of your software team spending days/weeks just 'messing about' with it. It will particularly beguile 'puzzle freaks', and if you suspect them of no longer pursuing project objectives, management control will need to be tightened.
In a similar vein, it is necessary to 'let things go'. Do not go hunting clock ticks. Do not even go hunting for the last 5% of speed gains when tuning software. Once convinced that only slim pickings remain, draw a line under it and move on.
The most important idea that you must have clear in your mind, when starting out to speed up some software, is that the processor cannot be speeded up; the only thing that can be done is to find sections of code that are either 'superfluous' or causing the processor to be idle, and bear down on them.
Above, I presented an exceptional case where the execution units were idle for over 90% of the elapsed time. Finding such idle moments is a precursor to tuning work. Until they are identified, tuning is rudderless.
The comments below are intended to provide some ideas for tuning experiments. 'Eureka' moments are rare in tuning work; more commonly progress is made by a sequence of hard pounding, cheese paring moves accompanied by many false moves.
I make no mention of competing algorithms below - the topic is too large. If you are unsure of the benefits of competing algorithms the only way forward is to implement them and time them.
[In my own work I have developed a set of 'atomic' modules
that enable me to put together many complex modules in number theory quickly.
I know that such crude implementations of complex modules can be polished
to cut perhaps 30-50% off their run time, but this technique gives a rough
initial estimate of the relative performance of competing algorithms.
(In military circles this technique might be called
'doing a reconnaissance' - and it is just as likely to save you a lot of
grief.)
Unfortunately, when the complexity rises, the practicality
of doing this decreases. In aho_gcd.html I discuss
one such case. This algorithm is complex and the time taken to 'do a reconnaissance'
would not have differed greatly from the time for 'a full assault'.]
This may seem an odd way to start a discussion of speed-up tricks; but everything below will cost a good deal of time and effort, probably by your most talented software writers. From a business management standpoint, they will lower productivity, increase costs and push back completion dates. It is only sensible to resist using the techniques discussed below until necessity compels you to use them. These tips run loosely from 'cheap' (this one!) to more expensive.
Try to determine, as early as possible in project development, "Is speed
performance going to be adequate"?. High speed code might take perhaps,
10 times longer to develop (and hence cost 10 times as much), so, if a
switch to more intense development for speed is not made early, the project
will almost certainly over-run it's target completion date. Nevertheless
the answer to "Is speed performance going to be adequate?" can only be
obtained by experiment; and the experiments needed will only be possible
well into project development.
Early in project development, if performance is
doubtful :-
(a) Write 'place-holders' for speed critical modules
to enable a project to proceed. Such place-holders should be as simple
as possible. You might think that a simple module is bound to be slow,
but in my experience simple modules are likely to be the fastest.
(b) Hand the place-holders over to a 'speed-up'
team (with instructions to leave the interface unchanged) to see what they
can come up with. If the speed up work achieves significant improvements,
the improved modules can then be fitted back into the project with little
difficulty.
This immediately means a C++ compiler is needed. 'C' was originally developed as a language for writing operating systems for which fast execution is essential. It may be nice and convenient to use Java or C# to write your masterpiece (i.e. short development times and hence cheap) but an execution speed 'several' times slower, relative to C-code speed, will be a consequence.
[Note that such things as Java compilers exist, so if your client finds your Java developed product too cluncky, these may get you out of trouble.]
There are compilers and there are compilers. Some years ago the Advanced
Encryption Standard (AES) competition invited all and sundry to submit
block encryption algorithms to replace DES. Speed of operation was one
of the metrics used to judge entrants and Borland C++ version 5 was the
specified compiler. Well known author, Bruce Schneier, was a co-author
of the Twofish entrant. In testing he found that the Borland compiler produced
an executable that ran about 30% slower than the equivalent Visual C++
one! Hence, if all you need is a speed increase of a few 10's of %, trying
alternative compilers may get you there.
Today, this almost certainly means that you need the Intel compiler
(a bolt-on for Microsoft Visual C++). This optimizing compiler can produce
code targeted to get the best results from any specific Pentium processor
(or even several targeted processors) - and if Intel doesn't know how to
do this, who does?
[Compilers can be "assisted" to produce faster executables using good high level coding practice :- e.g. see Chapter 2 of 40546.pdf on the AMD web site.]
It makes no sense to attempt to speed up every line of code in a project. Instead, identify 'speed critical' sections and modules which can provide the best return from more intense development. Your executable will be spending a large part of it's time in these critical modules. Fewer than 20-30 modules may make it onto your 'critical' list. If you sketch out a subroutine tree for your project, those at the bottom of the tree, particularly those embedded in loops in higher level subroutines, will provide the most likely candidates for more intense development.
If you are in doubt as to which modules should be put on the 'critical' list, get out your RDTSC code and take measurements! It is just silly to try to speed up some module that contributes, say, 5% towards the run time of some section. For such a module, even if a speed gain of 2 could be achieved (optimistic), the overall effect would only be a 2.5% speed gain - which no one will notice.
[This type of project testing is called 'profiling' and
specialist software can be found to do it. Microsoft Visual C++ has it's
own profiler available.
Intel provides a better utility called VTune. A
feeling for VTune's capabilities can be got from 'Ref 7', Chapter 3. Intel's
book "VTune Performance Analyzer Essentials" by J. Reinders, ISBN 0-9743649-5-9
covers this profiler in depth.
(A VTune evaluation version can be downloaded for
free from the Intel Software Development Products web site.)]
Take each of your 'critical' modules and think about possible high level
code rearrangements to obtain speed improvements e.g. :-
(1) Cut out as many references to 'new' as you can.
This is the C++ dynamic memory allocator - an operating system call. More
realistically, try to move calls to 'new' as high up the subroutine tree
as you can. I obtained the following timings (in ticks) with a Windows
7 system (64 bit) running on an AMD A8 :-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
[I cannot explain the jitter on these numbers.]
[Running the software that gave the results above on a Windows XP (64 bit) system gave results that were several times larger (up to 16 times for some of the entries!). Values for a 32 bit OS were much worse! Hence it would seem that, latterly, Microsoft have worked hard to speed up this aspect of their operating systems.]
Despite the speed gain that Microsoft have achieved, moving 'new'
+ 'delete' calls away from modules at the bottom of your subroutine tree
will give an instant speed gain. It is often better to set up a class to
do the administrative work of holding a large 'pool' of the structures,
that require calls to 'new', high in the subroutine tree, and pass a reference
to the 'pool' controlling object down the tree. At the bottom of the tree,
methods can then ask for pointers to structures, use them, then return
them. In this way the use of 'new' at low levels can be eliminated.
(This technique is called 'pooling'.)
[Keep in mind that any space obtained from 'new' is much more likely to be cool or cold as far as the caches are concerned. In other words apart from the times needed to get a pointer to it, there is likely to be a heavy time penalty to pay as soon as you start to use it - it will probably have to come from RAM! On the other hand pooling will keep the data spaces mostly hot.]
[C++ sneaks in a few calls to 'new' without some software writers being aware - e.g. copy constructors. Brush up on your C++.]
More generally, all calls to the operating system (updating the screen, updating disk files etc.) should be moved as high up the subroutine tree as possible and thinned out as far as possible. Experiment with reducing screen updates until 'jitter' becomes unacceptable, then back off until performance is 'just' acceptable.
(2) Look carefully at your critical subroutines and strip out nonessentials. This includes things like copying and bureaucratic tasks of all sorts. When you copy some data item from A to B, in essence you have moved no nearer towards the end of a calculation. When confronting a copy instruction, adopt a 'no way' mindset, try to 'work around' the need for the copy, and only accept it as a last resort. Similarly updating records and such like, has no place in a speed critical module; put records into a buffer and let a less critical module sort out updating work and filing away; and even then, only after a significant number have accumulated.
One example where copying dominates is the transposition of large square arrays.
[In transposition, array element A[i, j] is set to A[j, i] if i != j]
It is a trivially simple piece of code to implement the formula above
using nested loops. Such a module requires zero calculation effort, but,
when the array size is large, ties up a Pentium processor spectacularly
effectively since all data moves have to pass through it.
If a Boolean flag is attached to the array to indicate transposed/not
transposed, toggling this single variable achieves precisely the same result
- subroutines downstream only need to check this flag, swap indices if
necessary, to retrieve any required array item.
If this trick is unacceptable, the following technique should be considered.
[Next, I shall assume the array is composed of (4 byte unsigned long's) UL_32's]
Think about the array in terms of lines - i.e. 64 byte wide (16
UL_32's) by 16 element deep tiles.
Firstly the start of each row needs to be 'line aligned'.
[In Visual Studio C++ this can be done using __declspec(align(64)) at the start of each row.]
The tiles on the diagonal can be transposed within themselves. For the rest, load an off diagonal 'tile' into a scratch pad (16 line reads); load its mirror image (about the leading diagonal) into a 2nd scratch pad (another 16 line reads).
[Optional - start prefetching the next tiles here into another pair of scratch pads while the work next is being done. Note that the rows of the tiles are unlikely to be contiguous, hence the hardware prefetcher will not be working; hence software prefetching is essential for best speed here.]
Transpose the two scratch pads within themselves and copy the tiles back into swapped positions ...
The logic behind this is to think about the task from the processor's point of view. Failure to do this could increase execution time by a factor of several dozen!
[Once a fast transposition technique is available, it might
be used to speed up square array multiplication. The second square array
has to access the elements in column order (a form of line reading multiplier).
Once the second array has been transposed, row order (i.e. moving through
lines) accesses are needed, requiring fewer line reads.
Your software prefetcher can be profitably used here.]
[Note that no cunning algorithm is in play here; just understanding
how the work will be done and giving the processor a helping hand.
The key question in play here is 'How many line
reads will be required?' by alternative methods. When the transposition
problem is broken down into 'tiles' the line moves are a minimum. Also,
in this case, there is a nice pause while tiles are being transposed during
which the next pair of tiles can be prefetched keeping the data highways
busy.
For a problem like this (calculation negligible,
data volumes high) the minimum elapsed time is given when :-
(a) The number of line moves is minimized.
(b) Given (a), the data highways are never allowed to be idle.]
Most software writers spend their time focused on implementing algorithms, with little if any regard for the difficulties processors have to overcome. Merely by adopting the maxim "think processor" can yield speed gains of 'several'.
As noted above, cache problems are likely to be an order of magnitude
more severe than pipeline crashes in causing processor idle time. Hence
the objective here is always to keep the caches as 'warm' as possible as
far as work 'in the offing' is concerned.
Above, when discussing MUL() I concentrated on code
cache problems, but data caches suffer similar, possibly worse, problems.
A fundamental question at the start of tuning work is to decide whether :-
(a) Is fetching instructions the main problem.
(Code volumes are high, data volumes low)
(b) Is fetching data the main problem. (Code
volumes low, data volumes high)
The following comments refer to case (b).
(1) Line align your data structures. If your project
works with data structures that are small (say, 1-5 lines in size) if you
line align them, they will be contained in significantly fewer lines. This
might avoid several % of line reads and keep items in the caches for longer,
since there will be less wasted space in the caches. If structures are
large, say over 10 lines in size, the benefit from doing line aligning
fades away.
Definition :- line aligning means the start address of a structure will have it's least significant 6 bits set zero - line start addresses are always of this form. You can do this yourself by dynamically allocating an oversize memory slot, then suitably setting the structure start pointer within the slot. Your compiler will likely be able to do this for you - check the capabilities of the ALIGN macro.
In the multiply examples above, the arrays A[] and B[] were precisely 1 line in size, and if no attempt to line align them were made, they would almost certainly straddle 2 lines. Hence line aligning would usually save 50% of line reads. For the array OP[], this needs 2 lines as a minimum to hold it, and would almost certainly straddle 3 lines. Here line aligning would usually save 33% of line reads. Notice the diminishing returns from line aligning as a structure grows in size.
Sometimes, large data structures will be in play, but the most commonly accessed data fields are few in number. For this situation, setting up a line aligned sub-structure of the commonly accessed data items, and just accessing this, as a first step, will save on line reads.
[Note that variables in complex structures are allocated space in the same order as they are listed in the '.h' file. Most software writers will unthinkingly list variables in groups that are in some way associated. To cut down on line loads, they need to be listed in 'frequency of access' order.]
In a similar vein, a little homespun data compression might pay off.
Seriously contemplate each data structure variable in turn with a view
to compression :-
(a) A Boolean flag needs only 1 bit - hence 32 of them can be
held in a single UL_32. Do not allocate a UL_32 to each of them.
(b) Many real world variables only require 16 or even 8 bits
(air temperature perhaps) to hold them. These can be packed into 32 bit
(or 64 bit) variables by shifting and ORing and recovered by ANDing and
shifting (both very fast operations). Using a 'float' or even a 'double'
to hold them will do little for result quality but work wonders at overloading
processor data highways.
[Note compressed data items MUST be packed - if your compiler is instructed to 4 byte align data items (a common default), an unpacked byte will be stored as 3 unused bytes followed by the data byte. This achieves precisely nothing as far as reducing line counts goes.]
(2) In a situation where a data structure passes through several stages of processing (with each stage significant enough to require a weighty subroutine) think about setting up intermediate buffers for the structures; i.e. processing many structures through the first subroutine (with output to a buffer) then passing the buffer to the next subroutine etc. In this way subroutine code in the caches will be mainly hot. If you do the opposite - take the first structure, pass it from subroutine to subroutine, then take the next structure etc. - each succeeding subroutine will partially displace the previous one(s) from the caches (cool them down) and take longer.
[Note that you should never copy structures when high speed is needed - just hand over a pointer. Copying will generate even more line moves.]
(3) A software prefetcher can reduce cache problems significantly. The
software prefetcher that I have written makes use of data/code reads into
the MMX registers. The main problem with this is that the MMX registers
are dual purpose; they are used in floating point calculations. In my own
work, floating point calculations are absent so that there is no conflict.
If this is not the case, use prefetch macros (see Visual Studio documentation)
or the PREFETCH assembler op, available on all recent Pentiums. There are
several variants of the assembler op and they all have the benefit of not
tying up any processor registers. The benefits of a software prefetcher are :-
(a) A software writer will generally know well ahead of events
when a code or data item is going to be needed, enabling a beneficial early
prefetch to be launched. (see the array transposition discussion above)
The difference between cold and hot cache timings, above, gives some idea
of the benefit from this.
(b) If an executable is working with a large number (large enough
to overwhelm the caches) of structures that are small and erratically scattered
in RAM (irregular stride pattern), the hardware prefetcher will not work
well. If the structures are less than 4 lines in size, it will not work
at all! (see above) In such a scenario, the processor data highways will
be in deep trouble (even if you don't know it!), and a software prefetcher
can dramatically decrease execution time.
If you write your own software prefetcher, avoid the following :-
(i) Do not use your software prefetcher in situations where
the hardware fetch or hardware prefetch mechanisms can be expected to be
reasonably successful. These are highly tuned and any interference with
them will probably result in a loss of speed. Only use it when you have
good reason to believe that both mechanisms are going to perform poorly
or when it is possible to predict data needs well ahead of the hardware
prefetcher (a minimum of perhaps a dozen lines ahead of the hardware prefetcher
might be OK).
(ii) The maximum number of 'in flight' line requests for your
target processor is an important parameter for software prefetcher use.
For my AMD Athlon desktop, this number is 8. Once 8 line requests are in
the queue, the execution units may be able to continue for a short while,
but once a new line request needs to be sent, the pipeline may be forced
to stop until an earlier request has been completed. If you send off 20
line requests in rapid succession, this will be very successful at creating
processor idle time!
Perhaps sending line requests off in bunches of 4 would be OK
- leaving the other 4 for the processor's own use. However, in a scenario
where many small structures have to be processed, each requiring little
execution unit effort, the data highways will be the choke point, and the
line request queue must never fall empty. Here the execution units
are going to be idle no matter what you do, so sending off more than 8
at a time, may give the fastest results. These scenarios need careful experimentation
with RDTSC to get right.
(4) 'Ref 7', chapter 8 contains many more tips on reducing cache problems than I can include here.
Pure pipeline crash problems are difficult to find and difficult to reduce. Some pipeline crash problems can have the secondary effect of overloading data highways and I concentrate on these.
Anywhere that conditional jumps are clustered together is likely to be a source of pipeline crash problems. One situation where this applies, and is open to a little cunning, is the switch{} block. A moments thought will tell you that a switch block is a mass of conditional jumps - hence a fruitful source of pipeline crash trouble and also (less obviously) a fruitful source of useless line loads.
If, in the switch block, one of the 'case's is a 'nearly always case' (i.e. eminently predictable) and the others are rare, it will generally pay off to put the 'nearly always case' into an if{} block and the rarities into a switch{} block within the corresponding else{} block. This technique might be extended with further nested if-else blocks.
At the other extreme, if all the 'case's are of similar likelihood,
remove the switch block completely and replace it with nested if-else blocks.
This replaces a linear list of conditional jumps with a set of conditional
jumps arranged in a binary tree. The former maximizes the number of conditional
jumps, the latter minimizes it.
Note that in the extreme situation where the cases are large
in number and the selection is close to random, the BTB and the fetch mechanisms
will not function well. For this situation definitely replace the switch
with a binary tree (discussed below).
As an example, consider a switch with 16 cases numbered 1-16.
If a binary tree structure were substituted, it would look like ('s' the switch parameter =
7 below) :-
if (s <= 8) { // split the cases in two. if (s <= 4) { // and in two again ... . . . } else { // s=7 will come here when (s <= 4) fails if (s <= 6) { // split in 2 again . . . // s = 5 or 6 handled here } else { // s=7 will come here when (s <= 6) fails call required subroutine // success } } } else { // similar layout for s > 8 cases . . . }
When s=7, three conditional jumps will occur in the binary tree layout,
and seven in a simple linear 'switch'. As an extreme example of the problems
that can occur, consider the following :-
Background information for the example :-
The 2 multiply modules speed tested above, were methods in a class
I call Mul_Sq which multiplies and squares multi-UL_32 integers. The constructor
for this class receives a MAX_n_UL_32 parameter which indicates the upper
limit of input array size that an instance of the class is expected to
handle. The class uses the 'schoolroom' method for smaller sized inputs
and Karatsuba's method for larger sizes.
The Karatsuba section requires dynamically allocated scratch pads, sized
using the MAX_n_UL_32 parameter. The class has to be able to handle all
smaller input array sizes up to, and including MAX_n_UL_32.
In the 'multiply' public method to handle the different n_UL_32
sized arrays, a switch block was (initially) used. This used the method
input parameter, n_UL_32, to select the relevant 'case' as follows :-
switch (n_UL_32) { case 1: call module to handle n_UL_32 = 1 ; break ; case 2: call module to handle n_UL_32 = 2 ; break ; ... (other cases in ascending order) case 16: call module to handle n_UL_32 = 16 ; break ; ... (more cases up to the maximum set in the constructor) }
A cold cache timing of this module gave the 'Results Set 2(i)' values above.
To recap, for the Athlon, the cold cache time found for this was a dazzlingly
slow 21153 ticks with a hot cache time of 1276 ticks. I found it hard to
generalize away the cold cache time merely in terms of code line fetching
and after experimentation found the switch block was the culprit :-
At the assembler level the piece of code, above, would probably
be implemented by something like :-
case 1: cmp n_UL_32,1 // cmp = compare (only sets flags) jne case2 // jne = (conditional) jump if not equal // ** Point 'X' ** (see below) call (module to handle n_UL_32=1) jmp switch_end // unconditional jump case 2: cmp n_UL_32,2 jne case3 call (module to handle n_UL_32=2) jmp switch_end . . . case 16: cmp n_UL_32,16 jne case17 call (module to handle n_UL_32=16) jmp switch_end ... switch_end:
The 'jne' (at ** Point 'X' **) is a conditional jump whose outcome cannot
be known for sure until the execution unit carries out the immediately
preceding 'cmp'. If cold cache conditions prevail, there will be no branching
history to consult. Hence the branch predictor falls back on its default
decision - fall through.
It will guess that :-
call (module to handle n_UL_32=1)
is about to be executed - and notify the 'load' execution unit to load
the first line of the wrong module, simultaneously crashing the
pipeline!
After about 10 clock ticks, the pipeline will have been reset, and
the branch predictor will initiate (erroneously) a fetch for the first
line of (module to handle n_UL_32=2) ... and so on.
Now, since the Athlon processor is limited to 8 'in flight'
line requests, the line request queue will quickly fill up and then execution
will stop until such time as space for another line request appears - possibly
a relatively long time! Clearly the line request system will stay saturated
through all the subsequent cases all the way to case 16.
In short, many lines would be fetched from RAM without any benefit whatsoever! Simultaneously, repeated pipeline crashes and execution unit freezes are occurring. All this pushes up the slow cold cache time found.
[Note that if the switch were replaced by a 'binary tree' of if-else blocks the 'fetch' circuits would never saturate (there would never be 8 erroneous line loads - not unless the switch contained 256 'cases'!) and the bulk of the time wasted would not occur.]
The unpleasant performance of the 'switch' above was removed as follows :-
Suppose when the constructor was called a parameter of 16 was handed
to it. This would indicate that the class instance had to be able to handle
input array sizes from 1 UL_32 up to 16 UL_32's. But the likeliest input
array size could be expected to be 16 UL_32's.
It is possible to declare a pointer (called pmodule below) to a multiplication function in the Mul_Sq 'dot-h' file as :-
void (Mul_Sq::*pmodule) (UL_32 *, UL_32 *, UL_32 *) ; // The multiply modules taking 2 array pointers (inputs) // and 1 output (product) array.)
This pointer can be set to the likeliest function by :-
pmodule = &Mul_Sq::mul16 // mul16() mults A[16]*B[16] into OP[32] module_nlw = 16 ;
Then at the start of the publicly accessible multiply module - 1 of whose parameters is nlwIP (the current input array size) :-
if (nlwIP == module_nlw) { (this->*pmodule)(IP1, IP2, OP) ; // mul16() called here } else { switch block put here }
When the construct above was used, the cold cache timing drops to the 'Modified' MUL_16_SL() cold cache time of 14244 clock ticks. This removes about 1/3 of the original cold cache time, but is still a long way above the hot cache speed. The remaining speed penalty comes from the fact that mul16() is 75 lines long. There is little to be done about this because after the first 3 lines, the hardware prefetcher would be working flat out (mul16() is a straight line module with a regular stride pattern albeit crossing page boundaries), and the speed penalty is due to fetch circuitry saturation.
Software prefetching would not be able to improve this unless the call to mul16() could be anticipated a long time ahead. Note that the time penalty, from code fetching, disappears on second and subsequent calls.
[The large cache capacity on modern processors will hold onto code and data for a long time, unless a flood of either causes the caches to be flushed. This will prevent it recurring.]
The other modules will suffer the penalty of a few pipeline crashes over and over - but the penalty from erroneously fetching the first lines of the modules will not recur - these lines will be in the caches after the first pass. Long term, the pipeline crashes cause the bulk of the time penalty.
[You may be puzzled by the concept of code prefetching here. The variable 'pmodule' above is the memory address of the relevant module i.e. the point in RAM where the code starts.
[The following snippets of background information are needed here.
On all recent Pentiums, what I have called 'the
Level 1 cache', above, is actually 2 caches - 1 for code and 1 for data.
(On the Pentium 4, the code cache is a 'trace cache'
which contains micro code rather than the compiler output code - all Pentiums
actually execute micro code rather than compiler output (one of the pipeline
blocks performs this transformation))
The Level 2 (and Level 3) caches are what Intel
calls 'unified' caches. This means that they contain lines from RAM - and
whether they happen to be code or data is irrelevant. The processor ensures
that only code is moved to the Level 1 code cache, and only data moved
to the Level 1 data cache (they are easy to discriminate since they are
in different regions of RAM). Hence code can be prefetched into the Level
2 cache at will.]
It should be clear, from the above, that the overall speedup from software prefetching code is likely to be small if only because code volumes are unlikely to overpower the caches. But software prefetching data can save a good deal of time if the data volumes are large.
[Incidentally, this example highlights the benefit of taking the trouble to get a cold cache timing.]]
[This paragraph provides a possible alternative solution
to the problem above. These comments are extracted from the GoAsm Assembler
Manual (http://www.godevtool.com/) I have not experimented with them, so
can vouch for nothing. They are only suitable for switch blocks written
in assembler - and few software writers would be prepared to go to the
trouble of writing such a thing!
From the Manual (verbatim) :-
" In the P4 (and possibly in some earlier processors) you can give the processor a 'hint' as to whether or not the branch prediction is likely to occur. This is done using 2Eh and 3Eh branch hint bytes which are inserted immediately before the conditional jump instruction. Respectively they mean as follows :- 2Eh - hint that the branch will not occur most of the time. 3Eh - hint that the branch will occur most of the time."
In effect these 'hint' bytes can be used to override the default processor branching behaviour.]
'Ref 7', chapter 7 contains many more tips to avoid branching problems than I can include here.
(a) I have no hands on experience with this one. Anyone working with multi-media problems needs to be aware of the immense power lurking below the surface of Pentium chips in MMX and SSE registers etc. In 'Ref 7' a few chapters, towards the end, are given over to speeding up a piece of code involving the fading away of one image on screen while simultaneously bringing another one up. Using just the old x86 register set, the results were jerky; but by switching the work to the multi-media registers and instructions, a speed gain of over 10 was obtained giving a smooth transition.
(b) Try to assist the Pentium's parallel execution capabilities :-
(All the discussion of 'parallel capabilities' here relates to either
a single core chip or a single thread on a multi-core chip.)
[Any textbook on parallel processing emphasizes the importance
of threads (on multi-core chips) working on independent data streams if
parallel processing is to be slick.
An individual Pentium core's parallel capabilities can be assisted in a
similar way. In this section, by 'data independent' I mean 'data that is not the output of
some closely preceding op'. If a data dependency between successive calculations exists,
parallel activity is suspended and the processor has to wait until the result from
the preceding calculation becomes available.
(this is usually true, but some Pentiums will 'carry on calculating'
speculatively - with these models the execution unit has to return to the
start, if it finds that it's input data was wrong after the event)]
(i) If you look in tables of Pentium op-code execution times both latency and throughput values are given. The difference between the two gives a feel for possible gains from the execution unit's maximum capabilities. Take floating point multiply - the latency of this instruction varies between processor models but is in the 6-12 clock tick region, whereas the throughput (late models) is 2 clock ticks. In other words several floating point multiplies can be moving through the hardware simultaneously!
An example of code rearrangement to exploit execution unit parallelism :-
problem is to find x = a + b + c + d + e + f + g + h ; (all
variables are floats)
[On a Pentium IV, FADD has a latency of 5 ticks and a throughput of 1 tick.]
Bad :- x = a+b+c+d+e+f+g+h ; // 7# adds = 35 ticks Good :- temp1 = a + b ; // tempn are locals temp2 = c + d ; temp3 = e + f ; temp4 = g + h ; // 3 ticks to start of this op temp1 += temp2 ; // temp1 with temp2 not available // till 6 ticks temp3 += temp4 ; // temp3 with temp4 not available till 8 ticks temp1 += temp3 ; // temp1 with temp3 not available till 13 ticks // 5 more ticks to complete :- total = 18 ticks
In favorable circumstances, (e.g. adding of vectors) a speed gain of over 2 is possible exploiting this principle! Most optimizing compilers will spot the potential gains above automatically; if your compiler comes from the stone age, you will need to rearrange your code manually.
(ii) Floating point divisions (Pentium IV) have a latency and also a throughput of 23 clock ticks. Hence multiple divisions cannot occur in parallel. But all is not lost! If a data independent calculation is available for insertion between divisions, the processor will execute it in parallel with the division. This principle can be used to relieve the time penalty of large latency ops, e.g. division, square root, logarithms and trigonometric functions.
For a Pentium IV (from Intel doc 24896601.pdf, Tables C-6 & C-7) some latency/throughput values (clock ticks) are :-
IA-32 Floating Point op | Latency | Throughput |
FDIV | 23 | 23 |
FCOS | 190-240 | 130 |
FSIN | 160-180 | 130 |
IA-32 General Purpose op | ||
ADC | 8 | 3 |
MUL | 14-18 | 5 |
IMUL | 14 | 3 |
DIV | 56-70 | 23 |
IDIV | 56-70 | 23 |
ROL/ROR | 4 | 1 |
[The full tables are many times larger than the above. Such tables differ between processor models and cannot be generalized.]
The table above helps to explain the difference in speed between MUL_16_SL() and MUL(). In the latter there is significant 'admin.' work (i.e. saving results, moving pointers, changing loop counters) between MUL calls (more than the '5' MUL throughput figure) - hence the execution unit that handles the MUL's could start the next MUL but the next values do not arrive in good time. In the former the 'admin.' work is much less (nearer to the '5') and hence the MUL execution unit (there is only 1 of these) is working closer to maximum (throughput) capacity.
The general approach for speeding up work using op-codes with large latency and (significantly) lower throughput is to ensure that the next op-code is presented to the execution unit as close to the throughput figure as possible. In this way the time spent processing the total sequence of such op-codes is minimized. Ideally, you will want to bunch a large number of such operations as close together as possible; store the results in a buffer, then process the buffer contents in a second processing stage to achieve best speed.
In cases, like FCOS, where both latency and throughput are large, a moderate amount of admin. work can be inserted between FCOS calls so long as this does not exceed the 130 tick throughput value. This sort of 'dodge' is meat and drink to an optimizing compiler; but the use of a buffer to multi-stage the flow of operations will probably be beyond it. Suitable multi-staging of high level code should help steer the optimizing compiler towards better speed.
The time to execute a MUL (latency) can be a startlingly low 3 clock ticks with register operands (6 with memory operands) on AMD64 processors (see Table 13, Appendix C, AMD document 25112.pdf) with a throughput value of 2. For such processors, chasing the throughput value, as above, is a hopeless task. Notice how 'knowing your target processor' can guide a speed up exercise. The low MUL latencies for AMD64 processors explain the high speed of the straight line 16 by 16 word multiply modules for an Athlon given in 'Results Set 1; processor (b)' above.
Elsewhere on this site I include a discussion of an extension of the Rijndael block cipher which I call RijnType. Rijndael can be speeded up considerably by the use of lookup tables as suggested in rtref 2. (the relevant lookup tables are included in rtdotcpp.html) These lookup tables are not too large and require some moderately heavy calculation to derive. Both of these conditions are needed simultaneously, if the lookup table technique is going to be a success. If the lookup table is large (relative to the caches), then each line read is likely to be done in 'cold cache' conditions - needing 100's of clock ticks. If the needed table entry can be calculated in a few 10's of clock ticks, use of a lookup table might easily slow progress down! The ideal lookup table size is probably less than half the available Level 1 data cache space; and then it might be software prefetched en masse at constructor time.
Assembler modules can provide considerable speed gains in some circumstances, but are error prone and very expensive to write (1 line per man hour will be a typical production rate!). The 'downside' of using them will almost always hold sway and only when the 'upside' is significant and necessary for project success, should they be used.
It is a fact that the functional capabilities of a Pentium processor are greater than the functionality that a C++ compiler can reach. Only if you need functionality in the 'gap' between these two, should an assembler module be contemplated. Items such as retrieving processor flags, multiplying 2 32/64 bit numbers into 64/128 bit products, division of a 64/128 bit numerator by a 32/64 bit denominator, need an assembler to access easily. In general, any task that requires extracting the ultimate arithmetical performance from a processor (e.g. cryptography) will require serious consideration of the use of assembler modules. Similarly, writing speed sensitive code to extract best performance by forcing a slow op-code to work flat out (bottom of Tip 7) may require assembler code.
It may also be possible to squeeze a module's input and local data into the tiny amount of Pentium register space available - if this is the case, access to the caches is not required; hence cache problems disappear! Elsewhere on this site an assembler module to find the multiplicative inverse of a 32 bit unsigned integer, relative to another, is listed. This module needs no more storage space than the Pentium registers (mainly) and so is very fast. (see 'eucnotes' and also 'fastmuli'.) I have yet to read a textbook that even mentions tackling word sized problems in this way; in fact it should be the first tactic tried.
RDTSC, as all experimental methods do, tells you 'the way it is' which
may well not coincide with the way you think it is. If your 'need for
speed' is a dominant consideration, RDTSC will set you right.
Having said that, you have to be aware of the influences that
might act to corrupt it's results.
The worst mistake a beginner can make is to call it just once. As should be obvious at this point, the first call will be heavily corrupted by cache loading problems. If single tests are persisted with, the sequence of values obtained will jump around wildly and be useless as a guide for tuning software. After a while a beginner may conclude that the method is useless and try to find an alternative technique.
This is quite wrong! The ability of RDTSC to measure cache performance is a very valuable asset - since cache problems are likely to be the main source of processor idle time; and a first step towards reducing this is to measure it. If idle time is excessive, try the techniques above to reduce it; if not move your efforts elsewhere.
Most software writers, most of the time, will mainly be interested in
measuring the 'hot cache' time a piece of their code is taking to execute,
in order to compare it to other algorithms. The way to get at this, is
to use RDTSC a few times (perhaps 4 or 8) in rapid succession. Then the
cache loading overhead will only corrupt the first of these and the 2nd,
3rd ... results will be the desired results. The corruption from branch
prediction errors will be small; that from operating system activity can
be large. I have discussed above how to counter these.
Even though the last timing of the set may be the most important,
it is a good idea to display and contemplate the whole set.
In summary, RDTSC is a very precise experimental method for timing software. However, anyone using it has to be alert to spot corrupting influences. Only a moderate skill level is needed to prevent corruption; Failure to prevent it can result in nonsense being generated.
Suppose, now, that you are about to embark on writing some modules which the project management insists on being made as slick as possible. How to proceed? A very profound statement has to be taken on board at this point :-
it is impossible to estimate the time
a Pentium will take to execute an algorithm.
The underlying reason for this is that the performance of Pentium processors
is far too complicated to predict. I doubt if even the Intel chief hardware
designer could get within, say, 20-50% of any algorithm's run time.
Given the above, the only procedure that is guaranteed to end up with
high speed code is :-
(a) Get a short list of candidates from any trusted reference
(possibly human, possibly textbook) you can find.
(b) Implement them all; then time them all and select the fastest.
[Somehow I doubt that this procedure will be met with enthusiasm! However, the only way it can fail is if your list omits the best candidate.]
Above, I mentioned that I have a set of 'atomic' modules that enable me to piece together many number theory algorithms quickly. Such a preliminary step enables your candidate list to be arranged in a rough order of speed - enabling the likeliest candidate to be worked up first. It also enables a 'place holding' module to be handed over to the main project builders to enable their work to proceed.
One false prophet I should warn you about when writing high speed code
- the 'big-O' estimates of run time to be found in any textbook.
If you start using RDTSC, eventually you will end up sat at
your desk with your log book of results on one side and some heavy algorithmic
textbook on the other. As your eyes move back and forth from the one to
the other, this thought will come into your mind :-
I have wasted a great deal of time down the years by ignoring the previous sentence. The poor advice peddled by algorithmic textbook writers, on the subject of high speed code, might be a joke if it were harmless - but it isn't. By all means consult textbooks as a source of methods; but if you have real need to find the fastest possible executable, your options are one in number - conduct experiments.