[Some documents on this site are referred to repeatedly below. All the links to these documents are given here; return here if you want to view them.
speed.html - Discusses the basics of writing high speed code for x86 processors. Some jargon is introduced in 'speed', which is used below.
eucnotes.html - Discusses the problem of finding multiplicative inverses.
I believe the 32-bit module, presented within using a method of my devising,
is the fastest in the world.
My 64-bit (unpublished) version is used below.
tc_notes.html - Discusses the performance of some multi-word multiplication algorithms.
refs.html - Lists the technical references used on this site, together with ISBN numbers should you wish to obtain a copy.
Below, I write e.g. 'speed' when referring to the above.
The references of particular relevance in this document are :-
Ref [15] "Prime Numbers - a Computational Perspective"
by Crandall and Pomerance.
(Below I refer to this as 'Crandall'.
This is the principal reference for this document.)
Ref [9] "Algorithmic Number Theory - Vol 1"
by Bach and Shallit.
(Below I refer to this as 'Bach')
Ref [10] "A Course in Computational Algebraic Number Theory"
by Cohen.
(Below I refer to this as 'Cohen')
Ref [11] "The Art of Computer Programming, Vol 2,
Seminumerical Algorithms" by Knuth.
(Below I refer to this as 'Knuth')
Ref [16] "Handbook of Applied Cryptography"
by A.J. Menezes, P.C. van Oorschot, S.A. Vanstone.
(Below I refer to this as 'Menezes')]
[This document has some fixed assumptions. These are :-
(A) My standing problem was for primes of a size up to a most significant bit at 233 set.
('Crandall' (see article 9.2.2) uses a function B() which returns the exponent of the most significant bit of an integer. Hence maximum B(prime) = 33 within.)
(B) As a consequence of [A], most of the work below is in 64-bit assembler. This can be assumed unless I state otherwise, when high level 'C' code is used.
(Hence the term 'word' below always means a 64-bit word.)
(C) All results below were obtained using a PC with an AMD A8 processor and Windows 8.1 64-bit OS. Visual Studio (2008) was the development system.
(D) Issues not part of the main theme of this document, (like this block) or items that I guess most readers will be familiar with are enclosed in square brackets.]
Projects that use square root mod prime modules tend to use them either a small number of times or else in massive numbers; there seems to be no middle ground. If you are faced with the first scenario, the time and expense needed to develop high speed modules (as discussed below) could not be justified by project managers.
On the other hand if you are working with e.g. the Quadratic Sieve (QS) or the Number Field Sieve (NFS) algorithms, many millions (even many billions) of calls to these modules may be needed. In this case, the need for speed justifies just about any speed-up efforts.
(The QS and NFS algorithms are used to factorize composite integers - as B() for the composite grows, so the number of calls grows sharply. Extracting maximum performance from all the component modules these algorithms contain pushes back the point at which they become impractical. Run times for these algorithms can run to days or even weeks.)
A glance at any presentation of square root mod p algorithms (e.g. 'Crandall' algorithm 2.3.8) indicates that exponentiating mod p is going to be the main task.
Several cunning windowing schemes to speedup the 'exponentiating' side of this statement exist; but when 1-word arguments are in play, performance variations between them are small. It is possible that a 2-bit window method would pay off (not implemented).
The 'mod p' side of the statement is a different story. For this 2 methods suggest thenselves :-
# -------- #
Inspecting e.g. 'Crandall' algorithm 2.3.8 indicates that 3 cases have to be handled :-
These methods can be found in any of the references, so I shall not reproduce them here.
The first 2 cases are straightforward; the 3rd case is not; so what follows concentrates on the prime = 1 (mod 8) case.
[Algorithms that target primes of this form can be shown to find a square root for any type of prime. Anyone deciding to write just 1 module, would have to implement this one.
Comments below are specific to primes that are 1 mod 8; for other primes some of the comments need reworking.]
For this case all the references, given above, start from an algorithm originally developed by A. Tonelli (1891) and latterly improved by D. Shanks. Variants appear in all the references above.
'Bach' presents an algorithm he labels as Tonelli() (pg 165) which presumably is close to Tonelli's original presentation. The other references rely on Shank's development of Tonelli's algorithm; and are similar but not identical. I implemented 2 of these - algorithm 3.34 of 'Menezes' and algorithm 1.5.1 of Cohen.
[The following theory is needed shortly :-
I define the p* set as {1, 2, ... (p-1)}.
Clearly for an odd prime 'p', the p* set has an even number of elements, and this set splits into 2 equal order ('order' means 'number of elements') sets - based on whether the element has a square-root or not. In text books, the first is called the Quadratic Residue set and the 2nd the non-Quadratic Residue set.
I shall use {QR} and {NQR} as labels for these sets.
(The Legendre symbol is used to indicate which of the 2 sets an element belongs to :-
the elements of {QR} have a Legendre symbol of +1 and
the elements of {NQR} have a Legendre symbol of -1.)
(Clearly the input value to any square-root mod p algorithm has to be in {QR}, otherwise the algorithms will be attempting to find a value that does not exist. In the textbooks, this problem is handled internal to the algorithms; but in my target application this problem is handled externally - so I have deleted these steps.)
(It can also be proven that if 'sqrt' is the value that is the square root of some element of {QR} then (prime - sqrt) is also a square root .
There is no guarantee as to which of these 2 values will be returned by the algorithms under discussion; failure to allow for both possibilities during testing will result in a "drat and double drat" moment.)]
The 1 mod 8 algorithms, discussed below, require finding an element of {NQR}. As stated above, elements of {NQR} have a Legendre symbol of -1. Textbooks recommend 2 methods for doing this. First, try random values; second, try small prime values in ascending order (3, 5, 7, 11, ...).
[Recall that, for primes of form 1 mod 8, the Legendre symbol of '2' is always +1. i.e. '2' is always in {QR} for these primes;
(doodle with e.g. 'Crandall' eqn 2.10 to see this)
and also that the Legendre symbol of a composite is the product of the Legendre symbol of the composite's factors (see e.g. 'Crandall' eqn 2.8).]
#
~
~
]
]This suggests a third method :- try the odd integers {3, 5, 7, 9,...}, in ascending order. The idea here was that since '2' is in {QR} then {6, 8, 10 ...} will have the same Legendre symbol as {3, 4, 5 ...}; but '3' will have been tested, and rejected
before '6' is reached, so it must have been an element of {QR}
, and so '6' will have the same Legendre symbol as '3' (i.e. +1), and need not be tested. So this scheme steps over some elements of {QR} and there should be a bias for small {NQR} elements to be odd.
[The attraction of this idea is reduced when considering what happens to squares (larger than the prime) when the 'mod p' operation is done. The resulting values are scattered all over the p* set. Indeed, it can be shown that over the whole p* set there is no bias of evens/odds being in {NQR}/{QR}.]
There is some wasted effort with this scheme :- {9, 25...}, being odd natural squares, have a Legendre symbol of +1, without need of a test. My code 'steps over' 9 but ignors 25 and higher odd squares since the routine will almost certainly have succeeded before reaching them.
Nevertheless, the scheme will always succeed (some odd elements of {NQR} will always exist) and the wasted effort is small.
To see if any advantage is available from one of the three methods, I found 4096 random primes with B() values of 33, then tried the candidate methods.
[Of the 4096, 1162 were found to be of the form '1 mod 8'.
(The expected value is 1024. The standard deviation of the binomial distribution for this problem is 27.7 and so 1162 is 4.98 standard deviations away from the mean, a suspiciously high figure. On the other hand, the total probability for a value greater than 1162 being found is 0.088.)]
[Since the size of the {QR} set equals the size of the {NQR} set, there is a 50-50 chance of success from any trial. Hence the chance of succeeding on the first trial is 50%; the chance of success on the second trial is 25% (50% of the 50% chance of failure from the first trial) ... The expected number of trials before success, for this probability distribution, is 2.]
I found little difference between the 3 methods :-
Average number of trials using random selection = 1.99
Average number of trials using prime numbers = 2.01
Average number of trials using odd numbers = 2.05
[The experimental error from these tests is surely greater than the differences between these values, so these values are, as far as the experiment here is concerned, equal.
Notice, also, that there is no evidence from these tests that the 'odd numbers' method or the 'prime numbers' method gain any advantage over the 'random number' method.]
Consider prime = 41 (clearly of the form 1 mod 8) :-
{QR} = { 1, 2, 4, 5, 8, 9, 10, 16, 18, 20, 21, 23, 25, 31, 32, 33, 36, 37, 39, 40} (order = 20) {NQR} = { 3, 6, 7, 11, 12, 13, 14, 15, 17, 19, 22, 24, 26, 27, 28, 29, 30, 34, 35, 38} (order = 20)
Note :- 4 of the first 6 entries in {NQR} are odd compared with only 1 entry in {QR} (excluding {1, 9} - not members of the search sequence) being odd. Hence, for '41' and small entries, the chances of success from the 'odd number method' are good.
Note also that the number of evens and odds is equal in both sets; indicating that the even/odd bias does not apply to the sets as a whole, i.e. it applies only to smaller values.
In conclusion :-
The 'odd number' method is simple, cannot fail, and is as good as any other.
The 'prime number' method (assuming a look-up table of small primes is used) may fail by running off the end of the table; although the probability of this can be made as tiny as you like by making the table large. On the other hand why take the risk?
The 'random number' method is probably superior to the others, but any advantage is tiny.
One final point. The Legendre symbol is conventionally found by e.g. 'Crandall' algorithm 2.3.5, which exploits symbol properties and reduces argument size, step by step.
[My version of this is written in assembler, and hence very fast.
Note that inspection of any Legendre symbol algorithm presentation, has a line (after swapping input values) :-
... a = a mod m.
Here, for the first pass, 'a' will typically be several 10's of bits in size, and 'm' probably < 4 bits in size. Repeated subtraction to perform such a 'mod' would be slow. Similarly, use of 'DIV' (in every pass) would be slow. I use a modding technique which aligns the most significant bits of 'a' and 'm' then repeatedly attempts subtraction followed by a shift to get to the 'mod'.
An alternative technique would be to use 'DIV' on the first pass then repeated subtraction with subsequent passes. (not tested)]
A second possibility for finding the Legendre symbol is to use the defining formula :-
Legendre symbol (a/p) = a(p-1)/2 (mod p)
In tests, when B(prime) = 33, the Legendre algorithm was a little over twice as fast as the defining formula (with the Montgomery method in use) - saving about 500 ticks. This seems conclusive - but there is room for debate.
Consider the case where a project decision has been taken to implement the square root mod p algorithm using the Montgomery technique. In this case, evaluating the Legendre symbol using the defining formula is trivial. Since evaluating exponentials rapidly is the principal part of finding square roots, nearly all the code needed will be in place and only a handful of extra lines are needed to evaluate the defining formula.
[This compares with several man weeks of effort to write a fast assembler version of e.g. 'Crandall' 2.3.5.]
Consider, also, a project where 1 million calls to the Legendre symbol module are to be made :- a simple calculation gives the time wasted by using the defining formula as 0.2 secs on a 2.5 GHz processor - for many projects irrelevant.
The 'Menezes' algorithm 3.34 needs the multiplicative inverse of the value being square rooted; whereas the 'Cohen' algorithm 1.5.1 does not. (This is the principal difference between them.)
In 'eucnotes' I present the world's fastest 32-bit word multiplicative inverse finder, which I developed some years ago.
[The claim that the 'eucnotes' module was the world's fastest was put on the web at the same time as the document, and nobody has challenged it yet!]
[For this document (where B(prime) = 33 i.e. beyond the reach of the 'eucnotes' module) I used an unpublished 64 bit version.
When testing the 32-bit module, I found it 100's of times faster than a gauche effort written previously.
Speed was not important when I wrote this earlier version and it was a crude cobbling together of other modules.]
[You may want to compare the speed of your multiplicative inverse module to the results presented towards the end of 'eucnotes'.
(I recommend being sat down with a stiff drink on hand when doing this - you will likely find that your masterpiece is a pile of dingo's kidneys. If it isn't let me know!)
(Speed testing of software is a task with hidden pitfalls. Read document 'speed' to both learn the basics and avoid the worst of the pitfalls.)
If your multiplicative inverse module is not competitive for speed, then the speed performance of the 'Menezes' algorithm is going to be compromised. If this is the case you will have 2 options; either write a fast multiplicative inverse algorithm using the method presented in 'eucnotes' (requires a significant effort) or go with 'Cohen' algorithm 1.5.1.]
I am not going to present the full theory of this method; instead I just discuss the nuts and bolts. Any textbook will supply the full theory and its application.
[What follows applies to a single mult-mod. The Montgomery method provides no speed benefit for such a task. However, finding exponentials involves many mult-mods and then the Montgomery method provides considerable speed gains. Studying the single mult-mod case just provides an introduction to the method.]
The method uses 3 stages to perform the mult-modding of 2 integers :-
Stage 1 :- Map the integers into the 'Montgomery domain'.
Stage 2 :- Multiply the Montgomery domain values.
Stage 3 :- Map the Montgomery domain values back to the integers.
The method makes use of a variable 'R' = 2n; where 'n' is any convenient value such that R > prime.
[I tried using :-
(a) n = B(Prime) + 1 (here 34)
(b) n = 64 (word boundary)
I found (a) to be about 30% slower than (b); so (b) was adopted.]
[I shall represent variables in the Montgomery domain by adding '-hat' to the variable name. Thus a variable e.g. 'x' in the integer domain becomes 'x-hat' in the Montgomery domain.
The Montgomery method is not limited to prime modulii - it might be used with a composite modulus (so long as this modulus is odd). However in this document, the modulus is always prime and the word 'prime' is reserved for the modulus.
('Crandall' uses 'N' to represent an odd composite modulus - I follow his convention below whenever a statement applies beyond the reach of this document.)
Below I assume 'x' and 'y' are the input integers and z = x.y mod prime is the output required.
(Everything below relates to the 1 word variables case. However, the Montgomery method applies to the the multi-word problem with minor adjustments.)]
The Montgomery method :-
Stage 1 maps integers into the Montgomery domain using :-
x-hat = x.R mod prime
and with R = 264
x.R = {x, 0} (64 bit words)
so x-hat is found using opcode 'DIV'
with {x, 0} as numerator and prime as divisor,
whereupon x-hat is the remainder.
[Note that x-hat < prime
and zero maps to zero.]
y-hat is found in a similar way.
Below, I shall need a value 1-hat.
1-hat = {1, 0} (= R) mod prime,
in words, 1-hat is the value '1' mapped
into the Montgomery domain.
[1-hat is used as both a variable start value
and in conditional statements such as
if (variable == 1) then ...
It is impractical to map out of the
Montgomery domain, make such tests
then map back in again before proceeding.
Instead the following can be used :-
if (variable-hat == 1-hat) then ...]
Stage 2 is given in detail in 'Crandall' algorithm 9.2.5 which he calls algorithm M().
M() takes 2 Montgomery domain values and finds their product, also in the Montgomery domain.
i.e. z-hat = M(x-hat, y-hat) ;
M() is a 2-step module :-
Step 1 :- Find t = x-hat.y-hat
('t' an intermediate value)
[This is ordinary multiplication.]
Step 2 :- Find z-hat = f(t)
f(t) is given in 'Crandall' eqn 9.7 as :-
z-hat = (t + prime*((t*N')&(R-1))) >> n ----(A)
[Here '&' is bitwise ANDing of 2 words
and '>>' is the 'C' right-shift operator.]
It is possible for (z-hat >= prime) to occur.
If so, set z-hat -= prime ;
(It can be proven that only 1 subtraction
will ever be needed.)
[Note that 1 MUL call occurs in 'Step 1'
and 2 in 'Step 2' - a total of 3.]
[Below, I show that the other operations
in eqn (A) are "do-nothing ops" -
- so long as 'n' is a multiple of the word size.]
Stage 3 maps Montgomery domain values back to the integers. This is done by calling M() :-
integer = M(integer-hat, 1)
In this document exponentiation is the central problem. 'Crandall' covers this in algorithm 9.2.6. In this algorithm the value sought, (say v-hat) starts as 1-hat, evolves depending on the exponent, then at the end, is mapped to the integers by calling M(v-hat, 1) which returns the integer sought.
The above is the skeleton of the method. I shall now flesh it out.
Details for implementing the Montgomery method.
Notes on N'. Variable N' is defined as :-
N' = (-prime-1) mod R
[Note B(N') < n (= 64 here)]
Finding prime-1 must be done rapidly for the standing problem of this document. The obvious method is to use the module in 'eucnotes', but here it is not the fastest. Crandall suggests using a Newtonian method detailed in his Exercise 9.12. I tried both methods and found the latter about 3.2 times faster.
[This might seem to contradict my claim that the method in 'eucnotes' is the worlds fastest inverse finder. The Newtonian method of Exercise 9.12 is only fast when the modulus, R, is of the form 2n. For any other modulus it's speed is poor. Hence the Newtonian method is only fastest for these special forms of modulus; whereas the 'eucnotes' method is much faster with general modulii.]
[This point is important whenever the Montgomery method is being contemplated with either single or multi-word modulii.
Conventional inverse finding methods should never be used to find prime-1 (or N-1).]
In tests, I found the Newtonian method found prime-1 in about 7 iterations with B(prime) = 33, with an elapsed time of about 240 ticks.]
Notes on evaluating Eqn (A).
The workload to evaluate Eqn (A) depends on whether 'n' (the exponent of 2 in the expression for 'R') is (a multiple of) wordsize or not. Above I found that fractional wordsize n's are not competitive for speed. Hence, what follows only applies when n = (a multiple of) wordsize (and multiple = 1 here).
[(Next you will need to know that the results of a (64 bit) MUL are placed in rdx:rax, with rdx the most significant word...).
I shall use get_rdx(), get_rax() as code for 'extract the most-significant word' and 'extract the least-significant word' from the output of a MUL. Note that these are "do-nothing" modules in assembler, since the values are sat in registers, ready to be used.]
Starting with the shift at the right of Eqn (A) :-
{128-bit expression} >> n
since 'n' = 64 this is equivalent to :-
get_rdx({128-bit expression})
Secondly consider the "&(R-1)" part :-
Noting that (R-1) = a word with all bits set,
then the central part of Eqn (A) becomes :-
get_rax (t * N').
Putting these 2 formulae together gives :-
get_rdx (t + (prime * get_rax (t * N'))) -- simplified Eqn (A)
Working with 'n' = 4 (i.e. R = 16) and prime = 7 to simplify calculations :-
[The use of R=16 here (rather than R=8) means the
get_rax() method can be replaced by "get least significant hex_digit()"
and get_rdx() by "get most significant hex digit()", which eases things next.]
The "map-in" is found to be :-
integer value - 0 1 2 3 4 5 6 Montgomery domain value - 0 2 4 6 1 3 5 [As an example, consider the map-in of 5 :- we need to find {5, 0} mod 7 (numbers in nibbles) = 0x50 = 80 (base 10), and 80 mod 7 = 3.] The (inverse of 7) mod 16 = 7 (by trial and error) then N' = N - (inverse of 7) mod R = 16 - 7 = 9. [Note that N' is an 'integer domain' value but is used as-is in the Montgomery domain.] Suppose (2 * 5) mod 7 is being sought. (result = 3 with result-hat = 6) Stage 1 gives :- 2 maps in to 4; and 5 to 3. Stage 2; Step 1 gives :- (Stage 2 is 'Crandall's M()) t = 4 * 3 = 12 Stage 2; Step 2 gives :- Simplified Eqn (A) with t=12 becomes :- get_rdx (12 + (7 * get_rax (12 * 9))) 12 * 9 = 108 = 0x6C and get_rax(0x6C) = 0xC (= 12 base 10) Then (12 + (7 * 12)) = 96 = 0x60 and get_rdx (0x60) = 6 (note '6' is the the correct result-hat value) [Note, get_rax(0x60) = zero ALWAYS. This provides a useful debugging check.] Stage 3 :- To map-out 6 to the integers we call M(6, 1) :- Step 1 :- t = 6 * 1 = 6 Step 2 :- get_rdx (6 + (7 * get_rax (6 * 9))) 6 * 9 = 54 = 0x36 and get_rax (0x36) = 6 then (6 + (7 * 6)) = 48 = 0x30 and get_rdx (0x30) = 3 (Phew! It works!) All this looks ugly on first encounter - but it isn't too bad. IT'S GREAT ADVANTAGE IS THAT IT IS FAST :- There are only 3 MUL calls in each call to M() with no DIV calls. The end call to M() to 'map-out' is a one-off and when finding an exponential, which incurs maybe 50 calls to Stage 2, is a negligible overhead. Ditto the 'DIV' calls in the 'map-in' Stage 1.
[An alternative map-out method exists; this uses :-
x = (x-hat.R-1) mod prime (or mod N)
(If a mapping table exists R-1 mod prime can be found by inspection - find the value of '1' in the x-hat row, and take the x-value it came from.
In the mapping of the example, this gives R-1 mod 7 = 4.)
In general, finding R-1 is a time wasting task; hence is unacceptable when speed is important. However, whenever a mapping table has been worked out, (i.e. some form of doodling exercise is being done) it provides an easier method of mapping out than the Stage 3 method.]
Points from the example.
A problem when debugging is that Montgomery domain values are close to meaningless, and this reduces the value of the debugger.
I found that, when checking the exponential module, I had to work a non-Montgomery test case by hand; 'map in' all the intermediate values, before debugging, then use the debugger and compare the pre-calculated variable-hat values with what appeared on screen. This technique is limited to small primes (say less than 2 decimal digits) otherwise the hand calculation is likely to disintegrate.
The speed gain from using the Montgomery method to evaluate 'mod N exponentials', with either word or multi-word sized arguments, should be obvious at this point. The case for using it with multi-word one-off multiplications is much weaker.
The Barrett method provides an alternative to the Montgomery method, but for 1 word arguments they are essentially the same.
The performance of square root mod prime algorithms is complicated by the fact that results depend upon the distribution of bits in the exponents, which are derived from the prime. When evaluating exponentials the number of calls to Stage 2 (above) ('Crandall's M()) depends upon the number of set bits in the exponent - this effect causes a scatter of about 20%.
Also, primes of the forms 5 mod 8 and 1 mod 8 both contain conditional statements which may/may not lead to extra exponentials having to be evaluated. This causes a scatter of about 50%.
The only way to handle these problems is to time many primes, and take averages. Additionally, best case and worst case results give a feel for scatter.
Below, I feed 32 primes into the modules to gather statistics.
[The results below are all based on B(prime)=33.]
[All timings, below, are 'hot cache' - see 'speed' for the meaning of this term.]
Only 1 exponential ever needs evaluation for primes of either form. Square roots for primes of this form will execute faster than the other 2 types below.
Results found were :-
Speed results for Primes of form 3 mod 8 :- Average ticks = 1340; Worst ticks = 1447; Best ticks = 1225 Speed results for Primes of form 7 mod 8 :- Average ticks = 1326; Worst ticks = 1450; Best ticks = 1206
[These results should be identical - the variation is due to the vagaries of speed testing on x86 processors. Invariably, the last 1-1½ hexadecimal digits have to be ignored. This scatter is of minor significance.]
[Also, these results indicate the time needed to perform a single Legendre symbol test by means of the 'defining formula'. Such tests need to be evaluated twice on average for any algorithm requiring an element of {NQR}.]
For these primes I found :-
Speed results for Primes of form 5 mod 8 :- Average ticks = 1914; Worst ticks = 2600; Best ticks = 1246
Note that the 'best' speed is little different from the {3, 7} mod 8 case. The 'worst' speed is about double the best speeds for the {3, 7} cases. Also, the 'average' speed roughly splits the 'best' and 'worst' speeds, indicating that about half the input primes force the execution of a second exponential.
[No elements of {NQR} and no multiplicative inverses are needed by the 5 mod 8 algorithm.]
For these primes I found :-
With Cohen's algorithm 1.5.1 :- Average ticks = 3132; Worst ticks = 5966; Best ticks = 2455
These times include the external calculation needed to find an element of {NQR} by my Legendre symbol finding module.
With Menezes algorithm 3.34 :- Average ticks = 3956; Worst ticks = 6470; Best ticks = 3378
These times include the external calculation needed to find an element of {NQR} by my Legendre symbol finding module, along with finding a multiplicative inverse of the value being square rooted.
The average time is about 25% more for the Menezes algorithm than the Cohen algorithm.
The Menezes algorithm is slightly the easier of the two to implement. However, if your multiplicative inverse module is not competitive with the module presented in 'eucnotes', the time to execute the Menezes algorithm will grow by the difference, which may be large.
[All the above exclude the check that the input value is an element of {QR} (this is handled within my target application).
Also, the text book descriptions all start with a line to reduce the input value mod p; this is not included in the timings either, since my target application prevents this.]
'Knuth' algorithm 'S', exercise 19, section 4.6.2 is similar to Cohen's algorithm, except that step S1 of the algorithm uses the 'defining formula' for the Legendre symbol to find an element of {NQR}. As found above, this wastes a little time, but should run Cohen's algorithm close for speed (not implemented).
(A cautionary note about the 'Knuth' presentation :-
The last line of section S3 might easily be misunderstood. What the line intends to convey is that the set of 4 parameters should be updated with each new parameter calculated using the previous parameter set; then when complete, overwrite the old set by the new. Treating the parameters singly (rather than as a set) will result in a mess.)
An algorithm not discussed in this document is Cipolla's method (aka Cornacchia's method) (see e.g. Menezes algorithm 3.39). This requires exponentiating indeterminate 'x' mod a degree 2 polynomial with Finite Field coefficients. This is a daunting task in assembler and I have not implemented it. It is difficult to believe that it would be competitive with the methods above.