NEON SHA3 2x

NEON ARMv8 Keccak2x Implementation.

Since there is no SIMD128 for ARMv8, so I decide to implement one.

The result is not impressive, due to 2 reasons:

SHA3 uses native bit-wise operation like AND, NOT, XOR, those operation only take about 1 cycle in CPU, therefore:

  • No pipeline happen

  • No significant improvement if SIMD bitwidth is 128-bit, ARMv8 native register width is 64-bit, I suppose frequency in NEON mode is slower than Scalar mode. (I don’t know the term for this, please let me know)

This code can be faster than this benchmark if:

  • SIMD register bitwidth is wider: e.g 256, 512, …​

  • Frequency in NEON mode is at least > 0.5 * (Scalar frequency)

What is inside this package?

NEON (ASIMD) ARMv8 implementation of:

  • KeccakP-1600

  • SHAKE128 : Absorb, Squeeze

  • SHAKE256 : Absorb, Squeeze

  • SHA3_256

  • SHA3_512

Result

System Information

Here is my benchmark on ARMv8 Raspberry Pi 64-bit Majaro:

OS
Distributor ID: Manjaro-ARM
Description:    Manjaro ARM Linux
Release:        20.10
CPU
Architecture:                    aarch64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
CPU(s):                          4
On-line CPU(s) list:             0-3
Thread(s) per core:              1
Core(s) per socket:              4
Socket(s):                       1
Vendor ID:                       ARM
Model:                           3
Model name:                      Cortex-A72
Stepping:                        r0p3
CPU max MHz:                     1900.0000
CPU min MHz:                     600.0000
Flags:                           fp asimd evtstrm crc32 cpuid

I overclocked Raspberry Pi to 1900 Mhz. The default CPU frequency is 1500 Mhz.

Result

All benchmarks were run via this command:

make all
taskset 0x1 ./benchmark_SHAKE128_256_1000.bin

taskset command pin process to only 1 CPU, avoid cost in switching CPU

Table 1. Result

Output Length

Input Length

FIPS202x2 NEON

FIPS202x2 C

42

672

487

514

294

336

390

413

1008

42

586

606

2772

1008

2228

2287

3318

504

2230

2286

4074

1008

3004

3099

The result above iterate 1000 time. As set in #define TESTS 1000

You can view the full result, iterate 1,000 or 1,000,000 times in: data/

Graph

If the data/ is confuse to you, here is some graphs:

shake128

shake256

  • The orange line is the differences between C reference code and NEON implementation

  • The green line is average of 24 samples for C_ref - NEON

    • Orange: C_ref - NEON

    • Green: average of C_ref - NEON

You can notice that in some case, C Ref is better than NEON. For small output length, NEON is better than C Ref at about 5%.

Conclusion

The Keccak2x NEON version is always faster than 2 times Keccak C version. See bench() function

  • If you only call Keccak once, use C version, it’s faster

  • If you call Keccak multiple times, use NEON version, it’s save sometimes.

Duc Tri Nguyen
Duc Tri Nguyen
Graduate Research Assistant

My research interests include implementation of Post-Quantum Cryptography using High Level Synthesis in FPGA and NEON instruction in ARM platform, beside, sometimes I play CTFs, Crypto and Reverse Engineering are my favorite categories.

Related