[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: HAVAL (was Re: crypto benchmarks)
On Sat, 20 Jan 1996, John Lull wrote:
> I didn't see your original post, but did see Deranged's response. I
> would be interested to see whatever you come up with.
I ended up doing it like this:
for (i=0; i<4; i++)
{
FF_42(t7, t6, t5, t4, t3, t2, t1, t0, w[wi2[8*i+0]], mc2[8*i+0]);
FF_42(t6, t5, t4, t3, t2, t1, t0, t7, w[wi2[8*i+1]], mc2[8*i+1]);
FF_42(t5, t4, t3, t2, t1, t0, t7, t6, w[wi2[8*i+2]], mc2[8*i+2]);
FF_42(t4, t3, t2, t1, t0, t7, t6, t5, w[wi2[8*i+3]], mc2[8*i+3]);
FF_42(t3, t2, t1, t0, t7, t6, t5, t4, w[wi2[8*i+4]], mc2[8*i+4]);
FF_42(t2, t1, t0, t7, t6, t5, t4, t3, w[wi2[8*i+5]], mc2[8*i+5]);
FF_42(t1, t0, t7, t6, t5, t4, t3, t2, w[wi2[8*i+6]], mc2[8*i+6]);
FF_42(t0, t7, t6, t5, t4, t3, t2, t1, w[wi2[8*i+7]], mc2[8*i+7]);
}
This allows all the transform functions to fit into L1 cache, but at a
cost. Besides the overhead of the for loop, each macro call now does two
extra table lookups (in wi2 and mc2). The net result is a ~100% speedup
over the reference implementation.
Also, FYI, the boolean functions used in the reference implementation can be
optimized. Thanks to Deranged Mutant for these:
/*
#define f_2(x6, x5, x4, x3, x2, x1, x0) \
((x2) & ((x1) & ~(x3) ^ (x4) & (x5) ^ (x6) ^ (x0)) ^ \
(x4) & ((x1) ^ (x5)) ^ (x3) & (x5) ^ (x0))
*/
#define f_2(x6, x5, x4, x3, x2, x1, x0) \
(((x4&x5)|x2) ^ (x0|x2) ^ x2&(x1&(~x3)^x6) ^ x3&x5 ^ x1&x4)
/*
#define f_4(x6, x5, x4, x3, x2, x1, x0) \
((x4) & ((x5) & ~(x2) ^ (x3) & ~(x6) ^ (x1) ^ (x6) ^ (x0)) ^ \
(x3) & ((x1) & (x2) ^ (x5) ^ (x6)) ^ \
(x2) & (x6) ^ (x0))
*/
#define f_4(x6, x5, x4, x3, x2, x1, x0) \
((((~x2&x5)^(x3|x6)^x1^x0)&x4) ^ ((x1&x2^x5^x6)&x3) ^ (x2&x6) ^ x0)
/*
#define f_5(x6, x5, x4, x3, x2, x1, x0) \
((x0) & ((x1) & (x2) & (x3) ^ ~(x5)) ^ \
(x1) & (x4) ^ (x2) & (x5) ^ (x3) & (x6))
*/
#define f_5(x6, x5, x4, x3, x2, x1, x0) \
((((x0&x2&x3)^x4)&x1) ^ ((x0^x2)&x5) ^ (x3&x6) ^ x0)
Wei Dai