📄 phelix.tex
字号:
\end{tabbing}
Given the function $H$, the block function, shown in figure~\ref{fig:block},
is computed as follows:
\begin{align*}
(Y^{(i)}_0,Y^{(i)}_1,Y^{(i)}_2,Y^{(i)}_3,Y^{(i)}_4) &:=
H(Z^{(i)}_0,Z^{(i)}_1,Z^{(i)}_2,Z^{(i)}_3,Z^{(i)}_4,&0,&X_{i,0}) \\
(Z^{(i+1)}_0,Z^{(i+1)}_1,Z^{(i+1)}_2,Z^{(i+1)}_3,Z^{(i+1)}_4) &:=
H(Y^{(i)}_0,Y^{(i)}_1,Y^{(i)}_2,Y^{(i)}_3,Y^{(i)}_4,&P_i,&X_{i,1})
\end{align*}
Each block produces one word of keystream $S_i := Y^{(i)}_4 + Z^{(i-4)}_4$.
The ciphertext words are defined by $C_i := P_i \xor S_i$.
In the remainder of this paper, the terms ``block,'' and ``block function''
are used interchangeably.
\subsection{Key Words for Each Block}\label{sec:BlockKeys}
The expanded key words are derived from the working key $K_0,\ldots,K_7$,
the nonce $N_0,\ldots,N_3$, the input key length $\ell(U)$, and the block
number $i$. We first extend the nonce to 8 words by defining $N_k := (k
\bmod 4) - N_{k-4} \pmod{2^{32}}$ for $k=4,\ldots,7$. The key words for
block $i$ are then defined by
\begin{align*}
X_{i,0} &:= K_{i \bmod 8}\\
% Took out a superfluous mod 2^{32}. --JMK
X_{i,1} &:= K_{(i+4) \bmod 8} + N_{i \bmod 8}+ X'_i + i + 8 \\
X'_i &:=
\begin{cases}
\lfloor (i+8)/2^{31} \rfloor & \text{if $i \bmod 4 = 3$}\\
4\cdot\ell(U) & \text{if $i \bmod 4 = 1$}\\
0&\text{otherwise}
\end{cases}
\end{align*}
where all additions are taken modulo $2^{32}$. Note that $X'_i$ encodes bits
31 to 62 of the value $i+8$; this is not the same as the upper 32 bits of
$i+8$.
\subsection{Initialization}\label{sec:Initialization}
A Phelix encryption is started by setting
\begin{tabular}{lllllrr} \\ [-4pt] % align at a bunch of points
\hspace{0.5in}
&$Z^{(-8)}_j$ & := & $K_{j+3} \xor N_j$ &for $j=$& $0,\ldots,$&$ 3$ \\ [3pt]
&$Z^{(-8)}_4$ & := & $K_7$ \\ [3pt]
&$Z^{(i)}_4$ & := \quad & 0 \hspace{1.0in} &for $i=$&$-12,\ldots,$&$-9$ \\ [3pt]
&$P_{i}$ & := & 0 &for $i=$&$ -8,\ldots,$&$-1$ \\ [6pt]
\end{tabular}
Eight blocks are then applied, using block number $i = -8,\ldots,-1$. For
these blocks, the generated keystream words are discarded.
\subsection{Encryption}
After the initialization, the plaintext is encrypted. Let $k := \lfloor
(\ell(P)+3)/4\rfloor$ be the number of words in the plaintext. The
encryption consists of $k$ blocks numbered 0 to $k-1$. Each block generates
one word of keystream, which is used to encrypt one word of the plaintext.
Depending on $\ell(P) \bmod 4$, between 1 and 4 of the bytes of the last
keystream word are used.
\subsection{Computing the MAC}\label{sec:ComputeMAC}
Just after the block that encrypted the last plaintext byte, one of the state
words is modified. The internal state word $Z^{(k)}_0$ is \xored with the value
0x912d94f1.\footnote{This constant is constructed by taking the six least
significant bits of each of the ASCII characters of the string ``Helix,'' and
setting the bits before and after these 30 bits.} Using this modified state,
eight blocks, numbered $k,\ldots,k+7$ are applied for post-mixing. For these
blocks, the plaintext word $P_i$ is defined as $\ell(P) \bmod 4$, and
the generated keystream is discarded . After the post-mixing, four more blocks,
numbered $k+8,\ldots,k+11$, are applied, using the same plaintext input word
(i.e., $\ell(P) \bmod 4$). The keystream words generated by these four
blocks form the MAC tag.
\subsection{Key mixing}\label{sec:keymixing}
The key mixing converts a variable-length input key $U$ to the fixed-length
working key, $K$.
First, the Phelix block function is used to create a round function $R$ that
maps 128 bits to 128 bits, as follows:
\begin{tabbing}
\hspace{1.0in} \= Function $R(w_0,w_1,w_2,w_3)$ \+ \\
Begin\= \+ \\
Local Variable $w_4 $ \hspace{0.25em} \= $:= \ell(U) + 64; $ \\
$ (w_0,w_1,w_2,w_3,w_4) $ \> $:= H(w_0,w_1,w_2,w_3,w_4,0,0); $ \\
$ (w_0,w_1,w_2,w_3,w_4) $ \> $:= H(w_0,w_1,w_2,w_3,w_4,0,0); $ \\
Return $(w_0,w_1,w_2,w_3);$ \- \\
End.
\end{tabbing} % need to take out some vertical space here...
The input key $U$ is first extended with $32 - \ell(U)$ zero bytes. The 32
key bytes are converted to 8 words $K_{32},\ldots,K_{39}$. Further key words
are defined by the equation
\begin{equation*}
(K_{4i}, \ldots, K_{4i+3}) := R( K_{4i+4},\ldots,K_{4i+7} ) \xor
(K_{4i+8},\ldots,K_{4i+11})
\end{equation*}
for $i=7,\ldots,0$. The words $K_0,\ldots,K_7$ form the working key of the
cipher. (This recursion defines a Feistel-type cipher on 256-bit blocks.)
\subsection{Decryption}
Decryption is almost identical to encryption. The only differences are:
\begin{itemize}
\item The keystream $S_i$ generated after the first application of the $H$
function in each block is used to decrypt the ciphertext, producing the
plaintext word that is used in the second application of the $H$ function
within the block. The implementation must insure that any unused bytes of
the final plaintext word are taken as zero for purposes of computing the block
function, regardless of value of the extra keystream bytes.
\item Once the tag has been generated, it is compared to the tag provided. If
the two values are not identical, all generated data (i.e., the keystream,
plaintext, and tag) is destroyed.
\end{itemize}
\section{Implementation}\label{sec:implementation}
Compared to many ciphers, Phelix is relatively easy to implement in software.
If 32-bit addition, exclusive or, and rotation functions are available, all
the functions are easily implemented. Phelix is also fast. A single round
takes only one clock cycle to compute on most current Pentium CPUs, because the
superscalar architecture can perform an addition or \xor simultaneously with
a 32-bit rotation. A block of Phelix takes 20 cycles plus some overhead for
the handling of the plaintext, keystream, and ciphertext. Our assembly language
implementation requires less than 7 clock cycles per byte on a Pentium M CPU,
and under 14 clocks per byte using optimized C (gcc version 3.2).
This compares to about 14 clock cycles per byte for the best known AES
implementation (in assembler) on the same platform. \footnote{This is a somewhat unfair
comparison. Most block cipher modes only provide encryption \emph{or}
authentication, so two passes over the message are required. The alternative is
to use one of the new authenticated encryption modes, such as
\cite{Jutla:IACBC01}, but they are all patented and require a license.}
The per-packet processing overhead of Phelix is due mainly to the
eight words of initialization and twelve words of MAC processing. To
first order, this overhead can be thought of as processing an extra
20 words (80 bytes) per packet. In assembler on the Pentium M, there
is also roughly an additional 200 clocks of general setup overhead
(including nonce mixing into $X_{i,1}$), for a total of 750-850 clocks
per packet. By comparison, this overhead is approximately the time
required to process a single 64-byte block of SHA-1 or three 16-byte
blocks of AES on the same platforms, but Phelix performs both
encryption and MAC in the same operation.
\vspace{6 pt}
\newcommand\T{\rule{0pt}{2.6ex}} % top strut (vertical spacing in table below)
\newcommand\B{\rule[-1.2ex]{0pt}{0pt}} % bottom strut
\begin{tabular}{|c|c|r|r|r|rcrl|} \hline
\multicolumn{1}{|c}{}&\multicolumn{1}{|c}{}&
\multicolumn{3}{|c}{Packet Size ($N$)\T\B}&\multicolumn{4}{|c|}{Approximate}\\ \cline{3-5}
\, Operation \, &\,Version\T\B& \multicolumn{1}{r|}{64 bytes\,} &
\multicolumn{1}{r|}{\,256 bytes\,} &\,1024 bytes\, & \multicolumn{4}{c|}{Equation (clks)} \\ \hline
Encrypt& C &\,41.6 cpb \,& 20.3 cpb \quad& 15.0 cpb \quad&\ 1810 &+& 13.2 &$N$ \ \T\\ [3pt]
Decrypt& C & 42.3 cpb \,& 21.1 cpb \quad& 15.8 cpb \quad& 1610 &+& 14.0 &$N$ \ \\ [3pt]
Encrypt&ASM& 18.5 cpb \,& 9.8 cpb \quad& 7.4 cpb \quad& 810 &+& 6.6 &$N$ \ \\ [3pt]
Decrypt&ASM& 18.2 cpb \,& 9.6 cpb \quad& 7.4 cpb \quad& 750 &+& 6.7 &$N$ \ \\ \hline
\end{tabular}
\vspace{12pt}
Empirical Phelix software speeds on a Pentium M CPU are given in the table
above, where ``cpb'' indicates clocks per byte. These speeds are for an
``all-in-one'' call, where the entire packet is processed with a single call.
The basic software model for Phelix is a slope-intercept equation, where the
slope is under 7 clocks/byte and the intercept is about 800 clocks per packet,
in assembler on a Pentium M. Minor speed differences occur due to encryption
vs. decryption, and to various compilers and CPU versions, For example, on a
Pentium III CPU, the slope is about 7.5 cpb, and the intercept is about 850
clocks. Using an ``incremental'' API, where each phase of the operation
(e.g., nonce setup, data encryption/decryption, MAC output) is performed with
a separate call, adds an extra per-packet overhead of about 300 CPU clocks in
assembler and 100 clocks in C.
On a Pentium 4 CPU (P4), the same Phelix assembler code runs in about
$1100 + 10 N$ clocks per packet, and the C code runs in about $3100 +
26 N$ clocks per packet. Both of these numbers are considerably
slower than the Pentium M benchmarks. Perhaps recoding Phelix
specifically for a P4 CPU could improve these numbers, but the P4 has
a (well deserved) reputation for very high clock frequencies and
relatively low instructions per clock, compared to the Pentium III
and M CPUs.
The key mixing only needs to be done once for each key value and
implements a Feistel cipher, so it can be done in place. The key
mixing function takes about 400 clocks in assembler on a Pentium M,
and under 600 clocks in C (gcc version 3.2). Within a packet, the
$X_{i,1}$ key words can be mostly precomputed, with only the block
number being added every block. Implementations that limit the
plaintext size to $2^{32}$ bytes can ignore the upper bits of the
block number in the definition of $X'_i$, because these bits will
always be zero.
No algorithm setup is required for Phelix. There are no tables to be
built. The code size on a Pentium is under 5K bytes for the
optimized assembler version, including both all-in-one and
incremental APIs, and under 6K bytes for a C version. The code size
obviously depends on the amount of optimization and loop unrolling
used. For example, the all-in-one assembler version could be coded in
under 2K bytes at some cost in performance, although we have not
implemented such a version.
Phelix is also fast in hardware. The rotations incur no gate delays, only
wiring delays, although they do consume routing resources in chip layouts.
The critical path through the block function consists of 6 additions and 5
\xors. As the critical path contains no rotations, a certain amount of
ripple of the 32-bit adders can be overlapped, with the lower bits being produced
and used before the upper bits are available. A more detailed analysis of
this overlapping is required for any high-speed implementation. A
conservative estimate for a relatively low-cost ASIC layout is 2.5 ns per
32-bit adder and 0.5 ns per \xorComma which adds up to 17.5 ns/block. This
translates to more than 200 MByte per second, or just under 2 Gbit per
second. We roughly estimate that such a design would consume fewer
than 20,000 gates. Such a design would incur about a 20 clock overhead
per packet in order to process the initialization and MAC blocks.
We are not aware of any special efforts required to avoid implementation
weaknesses, other than standard practices required for dealing with
cryptographic primitives.
\comment{ (DLW)
FWIW, this gate estimate was derived as follows:
13 adders/block ~= 13*32*10 gates = 4160
12 xors /block ~= 12*32* 3 gates = 1152
9 state words ~= 9*32*10 gates = 2880
8 key words ~= 8*32*10 gates = 2560
2 i/o words ~= 2*32*10 gates = 640
other muxes/adders/xors ~= 1000
control logic ~= 5000
------------------------------------------
Total ~= 18K
}
\section{Use}\label{sec:use}
One of the dangers of a stream cipher is that the keystream will be reused.
To avoid this problem, Phelix imposes a few restrictions on the sender and
receiver:
\begin{itemize}
\item The sender \emph{must} ensure that each ($K$,$N$) pair is used at most
once to encrypt a message. A single sender must use a new and unique nonce for
each message. Multiple senders that want to use the same key must employ
a scheme that divides the nonce space into non-overlapping sets, in order
to ensure that the same nonce is \emph{never} used twice.
If two different messages are ever encrypted with the
same ($K$,$N$) pair, Phelix loses most of its security properties.
We claim, however, that even in such a case it remains infeasible to recover
the key.\footnote{Note that Helix actually was vulnerable to a key recovery
attack if (Key,Nonce) pairs where reused \cite{Muller:Helix04}.}
\item The receiver may not release the plaintext $P$, or the keystream,
until it has verified the tag successfully. In most situations, this requires
the receiver to buffer the entire plaintext before it is released.
\end{itemize}
These requirements seem restrictive, but they are in fact
required by all stream ciphers and many block cipher modes (e.g.,
OCB \cite{RBBK:OCB01,RBBK:OCB01sep} and CCM \cite{WHF:CCM02}).
Although Phelix allows the use of short keys, we strongly recommend the use
of keys of at least 128 bits, preferably 256 bits.
⌨️ 快捷键说明
复制代码
Ctrl + C
搜索代码
Ctrl + F
全屏模式
F11
切换主题
Ctrl + Shift + D
显示快捷键
?
增大字号
Ctrl + =
减小字号
Ctrl + -