)
*Duc T. Nguyen*, Viet B. Dang and Kris Gaj <dnguye69,vdang6,[email protected]>
Post-Quantum Cryptography (PQC) is cryptography scheme run in classical computers, replacement of modern Cryptography such as RSA, ECC… due to the threat of Quantum Computer
12 Lattice-base PQC candidates
5 of them implement Number Theoretic Transform (NTT) for polynomial multiplication
Polynomial multiplication is major execution function
⇒ We implement NTT as hardware accelerator for NewHope and Kyber.
NTT is similar to FFT and has complexity of O(n log n)
To multiply 2 polynomials, we do:
\$\circ\$ point-wise multiplication ⇒ straight forward
\$\text{NTT}, \text{NTT}^{-1}\$ Forward and Inverse NTT operation ⇒ not easy
Pragmas are heavily depend on the code structure and algorithm, most important pragmas:
Pipeline
Unroll
Array Partition
In [12] Kawamura, K., et. al., 8 strategies were applied at 1st, 2nd, 3rd loops.
Develop presumed optimal hardware architecture, write HLS code follow this block diagram.
Use design space exploration based on HLS directives, the final hardware architecture is unknown until the best result is achieved
Shorter development and verification time compare to traditional RTL.
[12] Kawamura, K., et. al.
Lower is better
[12] Kawamura, K., et. al.
Lower is better
Bare Metal versus SDSoc transfer overhead affect over total execution time.
Bare Metal used in-house DMA kernel
SDSoC transfer driver handled by tool
Same transfer size between CPU and FPGA in Bare Metal and SDSoc
Smaller is better
Smaller is better
RTL and HLS implementation use the same number of DSP and BRAM
RTL achieved 500 Mhz in Kyber-R1 and Kyber-R2 implementation
Result benchmark on Zynq UltraScale ZCU104
HLS increases resources in range 3-70% in LUT, Slices, FF and decrease frequency about 5-11%.
RTL and SDSoC implementation run at their maximum frequency
Both has similar clock cycles
RTL + Bare Metal: High frequency and minimal transfer overhead
SDSoC + HLS: Slower frequency and higher transfer overhead
Both approaches give very close result.
Higher is better
Higher is better
BD/HLS approach give similar speedup as traditional RTL implementation.
BD/HLS has smaller in term of resources utilization and faster in term of cycle count than SE/HLS
BD/HLS is much more efficient than SE/HLS in NTT implementation
[1] F. Farahmand, V. B. Dang, D. T. Nguyen, and K. Gaj, “Evaluating the potential for hardware acceleration of Four NTRU-based key encapsulation mechanisms using Software/Hardware Codesign,” in PQCrypto, 2019, pp. 23–43.
[2] E. C.-h. Chu and A. George, Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms, ser. Computational Mathematics Series.
[3] C. Du, G. Bai, and X. Wu, “High-Speed Polynomial Multiplier Architecture for Ring-LWE Based Public Key Cryptosystems,”
[4] P. Longa, M. Naehrig, P. Longa, and M. Naehrig, “Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography,” in Cryptology and Network Security - CANS 2016, vol. 10052.
[5] T. Poppelmann and T. Guneysu, “Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware,” in LATINCRYPT 2012.
[6] T. Oder and T. Guneysu, “Implementing the NewHope-Simple Key Exchange on Low-Cost FPGAs,” in LATINCRYPT 2017, Havana, Cuba, Sep. 2017
[7] P.-C. Kuo et al., “High Performance Post-Quantum Key Exchange on FPGAs,”Cryptology ePrint Archive 2017/690, Feb. 2018
[8] E. Homsirikamol and K. Gaj, “Hardware Benchmarking of Cryptographic Algorithms Using HLS Tools: The SHA-3 Contest Case Study” in ARC 2015
[9] E. Homsirikamol and K. Gaj, "A new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: The CAESAR contest case study” inFPT 2017