Notes on the block cipher RijnType,

an extended fast Rijndael clone

WARNING :- RijnType has NOT been subjected to independent assessment.
I guarantee only that it represents my best efforts.
I give no warranty that it is fit for any purpose!

This document is one of 5 inter linked documents. The others are :-
(a) The RijnType home page (no technical content).
(b) RijnType.cpp C++ code.
(c) The RijnType.h file to accompany (b).
(d) Technical references.

Go to Site Home Page

This document describes the block cipher RijnType. This cipher takes all of its inspiration from the block cipher Rijndael, the winner of the AES competition in 2000.
[Note the version of Rijndael submitted to the AES had a fixed data block bit size of 128 bits, and key block bit size of 128 or 192 or 256 bits. Rijndael itself is defined with data block bit sizes of 128 to 256 bits, in steps of 32 bits, and key block bit sizes of 128 to 256 bits, also in steps of 32 bits. Below, I shall refer to these 2 versions as either Rijndael (implying the full version) or AES Rijndael (implying the reduced capability version).]

In modern cryptography, ciphers generally consist of 2 components :- a 'public key' component (which might be RSA, or an elliptic curve system, to name but the two most common systems) followed by a block cipher. The public key component arranges a 'session key' for use by a block cipher which carries out the message encryption/decryption. Public key ciphers are typically several hundred times slower than block ciphers in operation, but are included to solve the difficult problem of 'session key' set up. Hence these two components combine to provide a cipher with a good balance of ease of use and speed.

The RSA algorithm has been in the public domain for over 20 years and has been the focus of relentless attack. It is now so weak that it would not be considered in 'new build' projects. Elliptic curve (EC) systems have superseded RSA for public key components.

From the security viewpoint, the two cipher components can be thought of as a chain with two links; with the weaker of the two dictating overall penetration resistance. When an elliptic curve system is the provider of the public key role, the bit size of the Field prime of the elliptic curve has to be double the bit size of the block cipher if the two components are to be matched in strength. (see e.g. ref 1 Section 1.3) (The ratio of '2' is best thought of as a 'rule of thumb' rather than an iron law. A few percent difference in the ratio of '2' is neither here nor there. The ratio could in any case be changed at any time by advances in theory or technology.)

Currently, (mid 2008), the main purpose of my business is the sale of elliptic curves for embedding into the cryptographic products of other software teams. These curves are some of the strongest elliptic curves available to the general public and are large enough to provide communications and database security, for any commercial purpose, over at least a medium term time scale.
If sensitive commercial information is to remain secure for, say, 10 years, then security planning has to estimate the computer power available to an attacker 10 years hence. Contemplation of the advances in processor power over the past 10 years, leads to the conclusion that block cipher bit strengths in the mid hundreds of bits matched to elliptic curve bit strengths up to 1000 bits are needed to guarantee security.
Block ciphers of this strength are hard to find and elliptic curves to match them even harder. As a result of developing software to implement the Atkin Morain method for setting up elliptic curves, I can generate prime order elliptic curves of any bit size at will. However, the commercial value of the elliptic curves that I can generate is much reduced by the absence of a 'matching' strength block cipher to go with them. For this reason, I have reluctantly been obliged to enter the world of block cipher design, and RijnType is the result.

The assumptions underlying the design of RijnType :-

(a) RijnType is founded on the design principles of block cipher 'Rijndael' as given in refs. [2] and [3].

(b) Where I have gone beyond the scope of Rijndael, I have introduced some new features but these remain 'in the spirit' of Rijndael.

(c) If a mode of operation for RijnType is within the limits of Rijndael, RijnType produces identical results to Rijndael. (however, see the section on 'compatibility' below)

(d) Rijndael was developed to be satisfactory on smart cards and equipment with tiny memory resources. No such considerations have influenced the design of RijnType. My target machine is a PC with a 32 bit Intel x86 processor and with unrestricted memory resources.
Both Rijndael and RijnType benefit considerably in speed when lookup tables are introduced. Tables T0[], T1[], T2[] and T3[] for encryption and iT0[], iT1[], iT2[] and iT3[] for decryption are used in the code file rtdotcpp.html. The theory and use of these tables is discussed in ref. 2 section 4.2.
The speed gain that results from these tables, and the functional equivalence of RijnType to Rijndael (so long as RijnType is restricted to 256 bit (or below) block size and key size) makes RijnType of interest to those software teams looking for nothing more than a fast implementation of Rijndael.

(e) RijnType is intended to interface with elliptic curve public key cipher systems.
/* ------------------------------------------------
The interface between elliptic curves and block ciphers
Since elliptic curve public key cipher systems are several hundred times slower than block ciphers, It is generally expedient to call them just once per 'session'.
The secret information passed through the elliptic curve is a scalar multiplier less than the 'order' of the elliptic curve. Hasse's Theorem indicates that the 'order' is limited to (very nearly) the same bit size as the elliptic curve Field bit size. As mentioned above, for balanced security the elliptic curve module Field bit size has to be double the bit size of the block cipher. Hence the possibility exists for the transmission of a key of bit size up to double that of the block cipher.
[From here down, I shall use the label ECscalar[] when referring to this item. From above, ECscalar[] will be a word32 array of double the word32 size of the block cipher.]
In fact, operating RijnType with a double sized key would be unusual, since ECscalar[] would typically include fields for items like MAC's (message authenticating codes) time stamps and others, along with the block cipher key material. If the elliptic curve is to be called just once, and it is not considered wise to include these fields into the block cipher key, this will reduce the space available for block cipher key material. Hence the key material needs to be adjustable in size to handle different project requirements. Allocating fields for the input to the elliptic curve stage needs careful thought on a project by project basis.
For RijnType the minimum cipher block bit size = the minimum key bit size = 128 bits (also the Rijndael minimum); and these bit sizes can be independently increased in steps of 32 bits to a maximum cipher block bit size of 512 bits and a maximum key bit size of 1024 bits. This has been done to enable RijnType to be adaptable to handling the considerations in the paragraphs above.
[RijnType (as Rijndael) processes data in a rectangular array of bytes with 4 rows and a variable number of columns (given the variable name Nb). The upper limit of 512 bits on the data block bit size causes the rectangular array of bytes used by RijnType to be 4 rows by 16 columns. This compares to a maximum of 4 rows by 8 columns in Rijndael and a maximum of 4 rows by 4 columns for AES Rijndael.
[At some point in the future I may develop a RijnType8 block cipher which will have 8 rows and 8-16 columns for data block bit sizes of 512-1024 bits in steps of 64 bits. RijnType8::MixColumn() would need a mutually inverse pair of degree 8 polynomials over GF(2^8) ideally with small coefficients. Such a block cipher would work very well on a processor capable of 64 bit arithmetic.]]
-------------------------------------------------- */

The block cipher Rijndael

This section is intended to give a very brief outline of Rijndael (see refs. 2 or 3 for full details). RijnType is based on identical theory but uses a different software implementation.
Rijndael takes the 'key' material passed to it and expands it to a length suitable for the block encryption being done. Blocks of this expanded key are XOR'd to intermediate results at the beginning, end, and between each round. The key material exerts a profound influence on the security of any block cipher.
A Rijndael round is the 'engine room' of the block cipher. Within a round, data is arranged in a rectangular array of bytes and operations are performed on the individual bytes (the SubBytes() function) the rows (the ShiftRow() function) and the columns (the MixColumn() function). The design idea behind these functions is to provide non linearity (from SubBytes()) together with very rapid 'diffusion' from the other 2 functions (which can be taken to mean that every bit in a data block has an influence on a large number of other bits within the block), round by round.

Rijndael (and RijnType) uses the following variables to control its working :-
   Nr = the number of rounds for the block cipher, dependent on Nb and Nk.
   Nb = the number of 32 bit words for the data block = data block bit size/32.
   Nk = the number of 32 bit words for the key block = key block bit size/32.
   C0, C1, C2, C3 = the rotations applied to the 4 rows of the data block within ShiftRow().

From the data already given Nb for Rijndael is el{4, 5, 6, 7, 8} and {4} for AES Rijndael.
Similarly Nk for Rijndael is el{4, 5, 6, 7, 8} or el{4, 6, 8} for AES Rijndael.
[For RijnType Nb is el{4,...16} and Nk is el{4, ...32}.]

Nr is obtained from Nk and Nb as :- Nr = 6 + max(Nb, Nk) (For Rijndael 10 <= Nr <= 14)
[This formula is also used in RijnType; for which 10 <= Nr <= 38]

The values of C0, C1, C2, C3 are given in ref. 2 table 3.1 (C0 is always 0, and C1 always 1). These values are used unchanged in RijnType. For values in RijnType beyond the limits of Rijndael, see comments on class ShiftRow, below.

A discussion of the block cipher RijnType

[In this section I shall assume that you have some familiarity with Rijndael theory and practice.]
RijnType was written for use on a desktop PC. This implies that, where possible, the code needs to be 32 bit oriented (Rijndael is strongly byte oriented) and that memory usage is unrestricted. RijnType has been written to be as fast as possible without the use of assembler. It is written in object oriented C++ code, whereas the Rijndael reference code is in 'C'. (see ref. 2 Appendix E)
The authors of Rijndael give as their design criteria 3 items :- 'Resistance against all known attacks', 'Speed and Code Compactness' and 'Design Simplicity' (see ref. 3 section 3) and it is widely accepted that they achieved all these. I have given more emphasis to speed, principally by the introduction of large look-up tables Tn[] and iTn[] (which are discussed, but not listed, in ref. 2 section 4.2). With the introduction of these tables, the 'Simplicity' and the 'Code Compactness' objectives are weakened, but the underlying 'simple' Rijndael theory is unchanged.
In summary, RijnType leaves the Rijndael theory unchanged, extends the principal variables to handle larger block sizes, and makes software changes to enhance speed.

I shall now discuss the RijnType classes, one by one, to explain the working of the software. These classes are largely self documenting in the files rtdotcpp.html and rtdoth.html which contain code listings, hence the following merely fills in a few gaps.

Class RT_ByteOps
This carries out the standard Rijndael S-box substitutions and word32 rotations.

Class ShiftRow
This class both holds the intermediate data values as word32 items, and it's do_shift() method performs the same task as ShiftRow() in the Rijndael reference code. Although functionally identical to ShiftRow(), it works differently in that no word32 shifts left or right are done.
Two word32 arrays colsource[] and coldest[] are used to hold the data block. In methods RijnType::Encrypt() and RijnType::Decrypt() (see below) data is read from colsource[] and written to coldest[] after use of the look up tables Tn[] and iTn[]. The ShiftRow::do_shift() is called at the end of each round simultaneously moving the data from coldest[] back to colsource[]. This is done for 2 reasons :-
(a) It avoids 'byte' MOV's which on Pentium machines involve a time penalty (this processor actually 'moves' 64 bit quantities and then has to shift the data to extract a byte. This shift carries a time penalty. 32 bit items (so long as the last 2 bits of their address are zero) are not affected by this. Failure to align data on a pentium in this way carries a typical 30-40% time penalty and possibly more. (Alignment of data is ensured at compile time by setting 'flags' - an important detail when preparing a RijnType executable.) RijnType is intended to be very fast and a time penalty of this sort is unacceptable.
(b) No shifts (as in '<<' or '>>' C code) are needed in ShiftRow::do_shift(). These operations translate into SHL and SHR machine operations which are not what Intel calls 'UV' operations. This means that they cause V-pipeline stalls, which implies time penalties. A glance down the code of ShiftRow::do_shift() indicates that only AND, OR, MOV will be needed all of which are in the 'UV' set; hence V pipeline stalls (time penalties) will be avoided.
Given that C0=0 and C1=1 always (as Rijndael), the C2 and C3 values used in ShiftRow were derived using the following formulae :-
    C2 = max (2, floor(Nb/2) - 1) ;
       if (Nb > 8) {
          if (Nb > 12) C2-- ;
          while (gcd (Nb, C2) != 1) {C2--} ;
       }
    C3 = min (3, ceil(Nb/2)
[These formulae were set using the following deliberations :-
(a) They give the Rijndael values when Nb <= 8.
(No formulae for these coefficients is given in the Rijndael references and these formulae are not unique.)
(b) Row 3 is intended to perform the largest shifts possible, and this requires C3 to be near to the 'half cyclic shift' value Nb/2.
(c) C2 is intended to shift row 2 roughly mid way between row 1 and row 3; with the additional condition that row 2 should 'visit' every possible position in Nb calls, which leads to the gcd() condition.]

Class KeyedLFSR
This class is the only item in RijnType which has no connection to Rijndael. It can be viewed as performing a secondary key-expansion.
[Any user wanting a 'pure' Rijndael extension, can just go through RijnType.h and RijnType.cpp stripping out all references to this class. Please rename the block cipher that results to something of your choice, perhaps RijndaelE - 'E' indicating 'extended'.]
With RijnType working at its maximum capacity, the expanded key (array KeySpace[]) occupies about 5 times the space needed by Rijndael at its maximum. During development of RijnType, it occurred to me that the key-expansion method used in Rijndael would benefit from a 'boost' to handle this larger expanded key.
KeyedLFSR uses a highly populated 32 bit LFSR together with the key material handed to RijnType to form a moderate quality key driven 32 bit random number generator. It takes 3 successive word32's of key material and 'expands' them into 8 word32's for use in KeyHandler::KeyExpansion(); then starts again with the next 3 word32's of key material .... The output word32's are intermittently XOR'd into the expanded key formed using the Rijndael key expansion. Hence the Rijndael key expansion process is left unchanged and this secondary key expansion technique is added to reinforce it.
[If Nb and Nk are both <= 8 (the Rijndael limits) then KeyedLFSR is not called into action - this is necessary to maintain RijnType as a Rijndael 'clone']
LFSR's, when operated in the 'multiply by x' mode, are very fast in software, and, for cryptographic products, have some attractive properties and some unattractive ones. The main problem with them is that the auto correlation curve of the output bit pattern is not ideal. Various schemes are in use to improve the quality of LFSR output e.g. see ref.4 chapter 16.
For this application a very high quality output is not essential - the quality of the output of KeyedLFSR::next_value() is mainly dependent on the quality of the key material. Nevertheless, I have included some quality enhancing features.
Three variables are used to improve the output quality - 'state', 'run length' and 'selector' all set from the 'raw' key material. After 8 word32's have been output from method KeyedLFSR::next_value(), all 3 variables are replaced by using new 'raw' key material. The quality enhancing technique uses multiple output values from the LFSR and XOR's them together to form the composite output value of KeyedLFSR::next_value(). The 3 variables control this process - 'State' is the LFSR initial contents, nibbles from 'run length' give the number of raw LFSR outputs that are XOR'd together, and the bit pattern in 'selector' is used to decide whether an LFSR output value is XOR'd into the output value of KeyedLFSR::next_value(), or not.

Class KeyHandler
This class performs the same function as Rijndael's KeyExpansion() method, but with the following exceptions :-
(a) When Nk is beyond the limit of that covered by Rijndael (max Nk=8) then KeyedLFSR discussed above is introduced. This is a secondary key expansion process that has no part in Rijndael.
(b) As mentioned above, RijnType makes a greater effort for speed than Rijndael. The main speed enhancing step is the introduction of the arrays Tn[] and iTn[], which are used in the RijnType::Encrypt() and RijnType::Decrypt() methods discussed below. This technique is introduced in ref. 2 section 4.2, (Ref.2. Fig 3.12 is particularly useful for clarifying the process.). The use of these tables requires changes to the expanded key when decrypting a message. Excluding the first and last Nb word32's of the expanded key, the expanded key has then to be passed through Rijndael's InvMixColumns() method.
[If you search through rtdotcpp.html for method MixColumns() you will not find it. The Rijndael methods MixColumns() and InvMixColumns() are redundant (for encryption and decryption) once tables Tn[] and iTn[] are brought into play. However, InvMixColumns() is needed by KeyHandler, and is a private method (cannot be accessed by other classes) in KeyHandler.]
The method KeyHandler::InvMixColumns() is functionally equivalent to Rijndael's InvMixColumns() but performs the calculation on word32's rather than bytes. This, together with KeyHandler::Mod(), which 'mod's 4 bytes simultaneously, provides another small speed gain.
[see Ref.2 eqn 3.15 to understand the code in KeyHandler::InvMixColumns(). Matrix multiplication eqn 3.15 of Ref.2 can be rearranged into multiplications by 0xE, 0xB, 0xD and 0x9; then adding (XORing) after suitable rotations.]

Class RijnType
This class contains the Encrypt() and Decrypt() methods and inherits all the above. From the comments on class KeyHandler, it will be clear that because the use of the lookup arrays Tn[] and iTn[] is built into RijnType, the expanded key when encrypting differs from the expanded key when decrypting. This means that only one of RijnType::Encrypt() or RijnType::Decrypt() can be used with any specific RijnType object. Which of these is accessible is fixed by 'Mode' in the RijnType constructor; following which a call to RijnType::Process_block() is sent to the appropriate function. Users must be aware of this behaviour! Clearly, where simultaneous encryption's and decryption's (with a key held constant for a long period) can occur (a server perhaps), 2 objects must be set up - 1 to access Encrypt() and 1 to access Decrypt().
The 2 methods Encrypt() and Decrypt() organize the computational core of RijnType. The other methods handle administrative details only. Inspecting the listing for, say, Encrypt() the use of the Tn[] arrays at the center of the nested loops will be clear. In brief, the 4 table look-ups replace the Rijndael calls to (SubBytes() + ShiftRows() + MixColumns()) and this saves a great deal of time; indeed it saves more time than all the other RijnType speed enhancing features put together. (see ref.2 section 4.2 and Fig. 3.12 for details) (A similar sentence applies to Decrypt() with all references to tables/methods using inverses.)
[The tables Tn[] and iTn[] are derived from Rijndael's (SubBytes() + ShiftRows() + MixColumns()).]

RijnType::Process_block() performs what is technically called an 'Electronic Code Book' (ECB) encryption or decryption. This gets its name from the fact that (given a fixed key) if asked to encrypt some plaintext block repeatedly it will always return an identical ciphertext - a behaviour which mimics 'code books'.
[These were popular up to about WW II. They enabled common letters, words, phrases and even sentences to be changed into typically 5 digit numbers. Code book security systems can be steadily 'solved' by statistical analysis of letter and word frequencies typical of any language. A good flow of ciphertext had to be available and the process was analogous to putting together a huge jig-saw puzzle. To foil statistical attack, code book's needed regular replacement, something very difficult to organize in the chaos of war. An even better attack method was to go and steal one of the books! Better yet, photo-copy the code book so as to avoid alerting the user to the collapse of his message security system!]
A similar situation might arise with RijnType working in ECB mode. Consider the following scenario :- suppose a secure server, holding simple text files, is operated with a fixed key in ECB mode for a long period. If an attacker could obtain a large portion of the encrypted files (but not the key), it is just conceivable that a statistical attack, on the lines of the 'Code Book' attack above, might succeed. Of course, if he can obtain the key as well, the attacker can skip the hard work involved in a statistical attack, and use the source code of RijnType (or any block cipher) to decrypt away.
For the above and other reasons ECB mode is rarely used in cryptographic projects; rather it's importance is that it forms a building block for more sophisticated schemes (see RT_CBC next).

An Initialization Vector (I.V.) is a random bit string the same size as the block length (Nb), which can (optionally) be fed into the RijnType constructor. If RijnType is paired with an elliptic curve (the target situation) the I.V. will have to be embedded into ECscalar[], reducing the space for key material. The decryption of a ciphertext which was obtained using an I.V., requires the same I.V. to be available at decrypt time and this requires organizing.
The use of IV's will defeat the statistical attack discussed above, but at the cost of having to securely archive the I.V.'s themselves. Hence, the use, or otherwise, of I.V.'s requires careful thought on a project by project basis.

Class RT_CBC
This class performs 'Cipher Block Chaining' mode encryption/decryption with RijnType as the block cipher. ( ref. 4 Article 9.3 guidelines are used) In essence, the output of each block encryption is mixed (XOR'd) into the next input block. This might be thought of as internal I.V. generation process to prevent the 'Code Book' statistical attack discussed above. When RijnType is operated in CBC mode, only the first block is susceptible to statistical attack and, ordinarily, there will be too few of these for statistical attack to have any chance of success.
RT_CBC can accept an I.V. for which similar comments to RijnType I.V.'s apply.
An alternative scheme to protect the first block when RT_CBC is in use is to part (or completely) fill the first block from a secure random number generator. Then, on decryption, this part can be stripped out.

Class RT_MAC
This class generates a Message Authenticating Code a.k.a. a MAC a.k.a. a keyed Hash. These are a 'Modification Detection Code' (see RT_Hash below) with a bit string 'signature', authenticating the sender, attached.
[These items are invariably a component of paper less office schemes. Such schemes seem further from fruition than ever, but hope springs eternal.]
RT_MAC inherits RT_CBC which it uses to implement the CBC_MAC algorithm (see ref. 5 algorithm 9.58).
The idea is that someone wishing to send a message such that :-
   (a) any attempt to modify it will be detected,
   (b) it can be irrefutably linked to the sender,
      can use a MAC to achieve these ends.
[The 'signature' referred to is actually a very large number, uniquely associated with an individual. For very large networks, e.g. the World Wide Web, they would be issued by some business founded on that line of work. For smaller networks a nominated 'Key Master' might issue them.]
When MAC's are in use, the message itself need not be encrypted. Many messages in commerce are of the form "Deliver 10 lengths of 4x2 to ...". Such messages are meat and drink for many businesses and are of no interest whatsoever to an eavesdropper, but carry financial connotations. The receiver wants to know whether the person sending the message is authorized to do so by a valid client (i.e. he will get paid!) and that the message is not either a total invention or a slight adjustment of a valid order (e.g. the number of items surreptitiously changed from '1' to '10').
[The discussions in the paragraph above may seem (on first sight) to be the germ of an idea for a huge business empire, since it is a service with universal appeal. Beware :- many very clever people have tried to set up paper less systems; some have had slight success, most have failed.]

Class RT_Hash
This class generates a 'Modification Detection Code' a.k.a. a 'Message Digest' a.k.a. an unkeyed-Hash a.k.a. a Hash a.k.a. a message/file 'fingerprint'.
RT_Hash uses the Miyaguchi-Preneel algorithm (see ref. 5, algorithm 9.43) with RijnType to achieve this. The output of RT_Hash depends only on the message/file fed into it; whereas the output of RT_MAC depends on both the message/file and the signature material (used in RT_MAC as the key material) fed into it. Hashes are used for the purpose of detecting tampering with a message/file. A typical use is to prevent 'hackers' replacing operating system 'executables' with versions of their own devising with the objective of easing future system penetration.
[Hashes are actually cryptographic primitives with a wider use than the last sentence may indicate.]
The Miyaguchi-Preneel algorithm requires the block cipher key to be updated after each block is entered; hence RT_Hash has to call KeyHandler::KeyExpansion() at the end of each block, to update the key. For RijnType, this step takes as much time as encrypting 1.5 blocks, and makes RT_Hash 2.5 times slower than RT_MAC (see the comments on 'speed performance' below for the time to execute the constructor - most of of which is due to KeyExpansion()).
It is possible to generate a MAC by using any Hash generator by pre pending the message with the 'signature number', then Hash the composite message.

++++++++++++++++++++++++++++++++++

The Compatibility of RijnType with Rijndael

As stated above RijnType is a 'clone' of Rijndael (so long as RijnType is restricted to the Rijndael limits i.e. 4<=Nb<=8; 4<=Nk<=8). However, software writers, in particular, need to be aware of the problem discussed in this section.
The Rijndael reference code (ref. 2, Appendix E) is byte orientated code, whereas RijnType is word32 (unsigned long) orientated. As software writers will be aware, x86 processors are 'little endian' - bytes are placed into 32 bit registers in right-to-left order. Thus the following code snippet :-
/***************/
char buff[64] ;
word32 *pbuff32 ;
strcpy (buff, "Hello sailor. Come up and see me sometime.") ;
pbuff32 = (word32 *) buff ; /* using a 'cast' to set pbuff32 */

Feed 'buff' to Rijndael
and feed pbuff32 into RijnType ...
/***************/
will generate different ciphertext between Rijndael and RijnType. The little endian effect is the cause of this.

# ------------------------- #
Some technical stuff To flesh out this example a little; let us assume Nb=Nk=4 (both 128 bits or 16 bytes) and that, for simplicity, the Key[] is all zero. (Note that the Key[] array is affected in a similar way to the data array being focused on here.) Somewhat idiosyncratically, Rijndael requires the input data array to be arranged as a 2 dimensional array of word8's with 4 rows and 4 columns. Hence the first 16 bytes of input (buff[] = "Hello sailor. Co"), giving character and ASCII equivalent value, for Rijndael would need to be arranged as :-

H (0x48)	o (0x6F)	i (0x69)	. (0x2E)
e (0x65)	" " (0x20)	l (0x6C)	" " (0x20)
l (0x6C)	s (0x73)	o (0x6F)	C (0x43)
l (0x6C)	a (0x61)	r (0x72)	o (0x6F)

This encrypts in Rijndael to :-

0xE0 0x46 0x20 0x2E

0x68 0xDA 0xF0 0xFD

0x0F 0x20 0xD3 0xE5

0x2E 0x24 0x3F 0xEA

Using a 'cast' to set pbuff32, as in the code snippet above, we might print out the array pbuff32[] to find [0x6C6C6548, 0x6173206F, 0x726F6C69, 0x6F43202E] which, it will be noticed, has it's bytes swapped round (the little endian effect at work!). Encrypting this in RijnType the output is [0x2F2AA57C, 0x3EDD46AE, 0x8FEC3C30, 0x2B7E2965] which is unrelated to the Rijndael ciphertext.

If function byte_swap() (listed towards the end of this section) is used on the array pbuff32[], it becomes [0x48656C6C, 0x6F207361, 0x696C6F72, 0x2E20436F] (the little endian effect having been reversed). Encrypting this with RijnType we get a ciphertext of [0xE0680F2E, 0x46DA2024, 0x20F0D33F, 0x2EFDE5EA] which are seen to be the columns of the Rijndael ciphertext.

The little endian problem is still not quite sorted. Suppose that the ciphertext now has to be filed. If we want a Rijndael compatible result (which requires the bytes copied out of the 2-D ciphertext array, column by column), we might use a C 'cast' such as :-
/****************/
word8 *pOP ;
word32 ciphertext[] ;

** execute a RijnType encryption into ciphertext[] here, then **

pOP = (word8 *) ciphertext ;
... then start feeding the pOP[] array byte by byte to the required file.
/****************/
Doing this sends bytes [0x2E, 0x0F, 0x68, 0xE0, ...], in that order, to the file. The little endian effect strikes again! To prevent this use byte_swap() on ciphertext[] to reverse the order, then the 'cast', then write to the file.

# ------------------------- #

A high level way to load/unload RijnType data arrays, if results identical to Rijndael are required, is as follows :-
(This was part of a real test program that I used to prove compatibility. In order to make a full test program, you will need to include the Rijndael reference code from ref. 2, Appendix E.)

/***************/
int main()
/* A DOS test program to run Rijndael and RijnType head-to-head. */
{
word16 i, j, k, Nk, Nb ;
word8 aIP[4][8], a[4][8], rk[8+6+1][4][8] ; /* Rijndael variables */
word8 sk[4][8] ;
word32 RTa[8], RTk[8], RTop[8], RTop2[8] ; /* RijnType variables */
word32 random ;
BOOL flag ;

   for (Nb=4; Nb<=8; Nb++) {
    for (Nk=4; Nk<=8; Nk++) {
      printf("Nb=%u Nk=%u\n", Nb, Nk) ;
      for (k=0; k<1000; k++) {
        for (i=0; i<Nb; i++) {
           /* fill variable 'random' from a random number generator here, then */
          RTa[i] = random ;
           /* Rijndael overwrites the IP array a[][] with the encrypted
               results. Using aIP[][] as a record holder of the IP. */
          a[0][i] = aIP[0][i] = (word8) (random>>24) ;
          a[1][i] = aIP[1][i] = (word8) (random>>16) ;
          a[2][i] = aIP[2][i] = (word8) (random>>8) ;
          a[3][i] = aIP[3][i] = (word8) random ;
        }
        for (i=0; i<Nk; i++) {
           /* fill variable 'random' from a random number generator here, then */
          RTk[i] = random ;
          sk[0][i] = (word8) (random>>24) ;
          sk[1][i] = (word8) (random>>16) ;
          sk[2][i] = (word8) (random>>8) ;
          sk[3][i] = (word8) random ;
        }
          /* recall that BC, KC, ROUNDS are global variables in the Rijndael reference code. In RijnType Nb=BC, Nk=KC, and Nr=ROUNDS */
        BC = Nb ;
        KC = Nk ;
        ROUNDS = numrounds[KC-4][BC-4] ;
          /* do a Rijndael encryption :- */
        KeyExpansion (sk, rk) ;
        Encrypt (a, rk) ;
          /* and now a RijnType encryption :- */
        RijnType rt (ENCRYPT_MODE, Nb, RTk, Nk) ;
        rt.Process_block (RTa, RTop) ;

          /* Check encryption's are identical :- */
        j = 0 ;
        flag = TRUE ;
        while ((j<Nb)&& flag) {
          if (a[0][j] != (word8) (RTop[j] >> 24)) flag = FALSE ;
          if (a[1][j] != (word8) (RTop[j] >> 16)) flag = FALSE ;
          if (a[2][j] != (word8) (RTop[j] >> 8)) flag = FALSE ;
          if (a[3][j] != (word8) (RTop[j])) flag = FALSE ;
          j++ ;
        }
        if (!flag) {
          printf("Cipher failure with Nb=%u Nk=%u j=%u k=%u.\n", Nb, Nk, j, k) ;
          exit(0) ;
        }
          /* do a RijnType decryption :- */
        RijnType rt2 (DECRYPT_MODE, Nb, RTk, Nk) ;
        rt2.Process_block (RTop, RTop2) ;
          /* and compare this with the original IP :- */
        j = 0 ;
        flag = TRUE ;
        while ((j<Nb) && flag) {
          if (aIP[0][j] != (word8) (RTop2[j] >> 24)) flag = FALSE ;
          if (aIP[1][j] != (word8) (RTop2[j] >> 16)) flag = FALSE ;
          if (aIP[2][j] != (word8) (RTop2[j] >> 8)) flag = FALSE ;
          if (aIP[3][j] != (word8) (RTop2[j])) flag = FALSE ;
          j++ ;
        }
        if (!flag) {
          printf("Plain failure with Nb=%u Nk=%u j=%u k=%u.\n", Nb, Nk, j, k) ;
          exit(0) ;
       }
      }
    }
}
puts("All OK") ;
return 0 ;
}
/***************/

Note carefully how RijnType's word32's are packed into (and unpacked from) Rijndael's two dimensional arrays, byte by byte.

These deliberations impact on project planning as follows :-
(a) If you are looking only for a fast implementation of Rijndael, and the extensions in RijnType are of no interest to you, then pack/unpack data arrays in one of the ways discussed above. This must be done if your target network, now or at any point in the future, uses, or might use, a mixture of Rijndael and RijnType code. This will generally be the case if your objective is no more than to speed up sluggish secure servers, without having to bother with upgrading the client's whole network.
[A fast packing/unpacking technique, if the machines to be upgraded are all Intel powered, is to use the Intel (486 and later) assembler instruction BSWAP (byte-swap) (a 1 clock operation). This achieves precisely the same effect as all the sliding and ORing in the high level code above, with a negligible time penalty.
A possible implementation in C :-
/************/
word32 byte_swap (word32 IP)
{
asm {
mov eax,IP
bswap eax
}
return _EAX ;
}
/************/]

(b) If you are looking to enhance the security of a subset of your target network, already using Rijndael, and RijnType meets the enhanced security requirements, then the considerations in (a) apply again. This would mean that only the subset would need a software upgrade, and it's interface with the remainder of the network would be unaffected.
(c) In summary, only if you are starting out on a self-contained 'green field' security project with no prospect of ever having to interface with a Rijndael equipped user should you not pack/unpack data arrays as discussed in this section.

Speed performance of block cipher RijnType

[All results in this section were generated on a PC with an AMD 2204 MHz x86 processor, and a Windows XP operating system. In this section all results are obtained with Nb=Nk unless explicitly stated differently.]
Timing RijnType block encryption speed against Rijndael gave the following results (these figures exclude the time to execute Rijndael's KeyExpansion() and RijnType's constructor (which performs a similar task)) :-

Nb,Nk	4	5	6	7	8
Rijndael	309	379	498	627	785
RijnType	22	28	36	45	55

This gives an average speed gain ratio for RijnType of nearly 14.
The main cause of this speed gain is the use of the look-up tables Tn[] and iTn[] which are built into RijnType. They could just as easily be built into Rijndael, since they are derived from the main Rijndael reference: ref.2 section 4.2. All the other features included in RijnType contribute only a few % speed gain.

Inspecting RijnType::Encrypt() with a view to obtaining yet greater speed, the method ShiftRow::do_shift() seems a likely candidate. It is possible to write modules dedicated to each permutation of Encrypt/Decrypt and Nb. A total of 10 such modules would be needed for a Rijndael clone, and 26 for a full RijnType class. The module for Encrypt/Nb=8 would look something like :-

/************/
void
ShiftRow::do_shift_E8 (void)
// dedicated to ENCRYPT_MODE with Nb=8 with the code 'straight-lined'.
// Notice that C2 and C3 are embedded in this code
{
word32 blank[4] = {0xFF000000L, 0xFF0000L, 0xFF00L, 0xFFL} ;

colsource[0] = (coldest[0] & blank[0])
        | (coldest[1] & blank[1])
        | (coldest[3] & blank[2])
        | (coldest[4] & blank[3]) ;
colsource[1] = (coldest[1] & blank[0])
        | (coldest[2] & blank[1])
        | (coldest[4] & blank[2])
        | (coldest[5] & blank[3]) ;
colsource[2] = (coldest[2] & blank[0])
        | (coldest[3] & blank[1])
        | (coldest[5] & blank[2])
        | (coldest[6] & blank[3]) ;
colsource[3] = (coldest[3] & blank[0])
        | (coldest[4] & blank[1])
        | (coldest[6] & blank[2])
        | (coldest[7] & blank[3]) ;
colsource[4] = (coldest[4] & blank[0])
        | (coldest[5] & blank[1])
        | (coldest[7] & blank[2])
        | (coldest[0] & blank[3]) ;
colsource[5] = (coldest[5] & blank[0])
        | (coldest[6] & blank[1])
        | (coldest[0] & blank[2])
        | (coldest[1] & blank[3]) ;
colsource[6] = (coldest[6] & blank[0])
        | (coldest[7] & blank[1])
        | (coldest[1] & blank[2])
        | (coldest[2] & blank[3]) ;
colsource[7] = (coldest[7] & blank[0])
        | (coldest[0] & blank[1])
        | (coldest[2] & blank[2])
        | (coldest[3] & blank[3]) ;
}
/************/

Test results from the above were disappointing - the figure of '55' in the table above dropping to 50, which raises the speed gain to about 15.5. This seems hardly worth the effort for all but the most speed sensitive of projects.

Rijndael's speed for decryption is slightly slower than encryption; this being caused by InvMixColumns() being slower than MixColumns(). This effect is absent from RijnType - the lookup tables Tn[] and iTn[] nullify it.
However, the time to execute a RijnType ENCRYPT_MODE constructor differs from that for a DECRYPT_MODE constructor. Recall, from above, that a DECRYPT_MODE constructor has to call InvMixColumns() on all the columns of all the intermediate blocks of the expanded key, and this makes the DECRYPT_MODE constructor roughly 4 times slower than the ENCRYPT_MODE constructor. Very roughly, the time to execute an ENCRYPT_MODE constructor is 1.5 * (the time to encrypt 1 block), and the time to execute a DECRYPT_MODE constructor is 6 * (the time to encrypt 1 block).

A large set of timed results could be generated for RijnType when worked beyond the scope of Rijndael since Nb and Nk can vary independently. The following might be enough to indicate trends :-

Block encryption time (relative to Nb=Nk=8 case) with Nb=Nk and 8 <= Nb <= 16

Nb,Nk 8 9 10 11 12 13 14 15 16

rel. time 1.0 1.2 1.4 1.63 1.85 2.11 2.38 2.67 3.0

Block encryption time (relative to Nb=Nk=8 case) with Nb=4 and 4 <= Nk <= 32

Nk 4 6 8 12 16 20 24 28 32

rel. time 0.49 0.6 0.69 0.89 1.11 1.27 1.45 1.65 1.85

Block encryption time (relative to Nb=Nk=8 case) with Nb=8 and 8 <= Nk <= 32

Nk 8 12 16 20 24 28 32

rel. time 1.0 1.26 1.55 1.85 2.08 2.37 2.64

Block encryption time (relative to Nb=Nk=8 case) with Nb=16 and 8 <= Nk <= 32

Nk 8 12 16 20 24 28 32

rel. time 3.0 3.0 3.0 3.50 4.05 4.58 5.11

From the above we can find that the speed ratio for RijnType at it's maximum security level (Nb=16, Nk=32) relative to it's minimum (Nb=4, Nk=4) (also the AES Rijndael and the Rijndael minima) is 5.11/0.49=10.4. Since RijnType is about 14 times faster than Rijndael, this means that RijnType at its maximum security level is faster than Rijndael or AES Rijndael at their minima. In other words substituting RijnType for Rijndael always brings a speed benefit, even when comparing worst case to best case, along with a huge increase in security.

0xE0	0x46	0x20	0x2E
0x68	0xDA	0xF0	0xFD
0x0F	0x20	0xD3	0xE5
0x2E	0x24	0x3F	0xEA

Nb,Nk	8	9	10	11	12	13	14	15	16
rel. time	1.0	1.2	1.4	1.63	1.85	2.11	2.38	2.67	3.0

Nk	4	6	8	12	16	20	24	28	32
rel. time	0.49	0.6	0.69	0.89	1.11	1.27	1.45	1.65	1.85