)

# Implementing and Benchmarking Number Theoretic Transform in Lattice-based Post-Quantum Cryptography using SW/HW Codesign and High Level Synthesis

## Introduction

• Post-Quantum Cryptography (PQC) is cryptography scheme run in classical computers, replacement of modern Cryptography such as RSA, ECC…​ due to the threat of Quantum Computer

NIST PQC Round 2
• 12 Lattice-base PQC candidates

• 5 of them implement Number Theoretic Transform (NTT) for polynomial multiplication

• Polynomial multiplication is major execution function

• ⇒ We implement NTT as hardware accelerator for NewHope and Kyber.

## Number Theoretic Transform (NTT)

• NTT is similar to FFT and has complexity of O(n log n)

• To multiply 2 polynomials, we do:

C = A \times B = \text{NTT}^{-1}(\text{NTT}(A) \circ \text{NTT}(B))

\circ point-wise multiplication ⇒ straight forward

\text{NTT}, \text{NTT}^{-1} Forward and Inverse NTT operation ⇒ not easy

## High Level Synthesis (HLS)

Pragmas are heavily depend on the code structure and algorithm, most important pragmas:

• Pipeline

• Unroll

• Array Partition

## Implement Number Theoretic Transform in HLS

• In [12] Kawamura, K., et. al., 8 strategies were applied at 1st, 2nd, 3rd loops.

## Block Diagram vs. Space Exploration Development

Block Diagram/HLS (BD/HLS)

Develop presumed optimal hardware architecture, write HLS code follow this block diagram.

Space Exploration/HLS (SE/HLS)

Use design space exploration based on HLS directives, the final hardware architecture is unknown until the best result is achieved

SE/HLS and BD/HLS inherit the advantages of HLS

Shorter development and verification time compare to traditional RTL.

## BD/HLS vs. SE/HLS: Resources

[12] Kawamura, K., et. al.

Lower is better

## BD/HLS vs. SE/HLS: Speedup

[12] Kawamura, K., et. al.

Lower is better

Bare Metal versus SDSoc transfer overhead affect over total execution time.

• Bare Metal used in-house DMA kernel

• SDSoC transfer driver handled by tool

• Same transfer size between CPU and FPGA in Bare Metal and SDSoc

## Transfer Overhead: Encapsulation

Smaller is better

## Transfer Overhead: Decapsulation

Smaller is better

## Resources utilization for HLS and RTL

• RTL and HLS implementation use the same number of DSP and BRAM

• RTL achieved 500 Mhz in Kyber-R1 and Kyber-R2 implementation

• Result benchmark on Zynq UltraScale ZCU104

## HLS vs RTL resources utilization

• HLS increases resources in range 3-70% in LUT, Slices, FF and decrease frequency about 5-11%.

## Speedup of Software/Hardware Co-Design vs Software Only

RTL and SDSoC implementation run at their maximum frequency

• Both has similar clock cycles

• RTL + Bare Metal: High frequency and minimal transfer overhead

• SDSoC + HLS: Slower frequency and higher transfer overhead

• Both approaches give very close result.

Higher is better

Higher is better

## Conclusion

• BD/HLS approach give similar speedup as traditional RTL implementation.

• BD/HLS has smaller in term of resources utilization and faster in term of cycle count than SE/HLS

• BD/HLS is much more efficient than SE/HLS in NTT implementation

## References

• [1] F. Farahmand, V. B. Dang, D. T. Nguyen, and K. Gaj, “Evaluating the potential for hardware acceleration of Four NTRU-based key encapsulation mechanisms using Software/Hardware Codesign,” in PQCrypto, 2019, pp. 23–43.

• [2] E. C.-h. Chu and A. George, Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms, ser. Computational Mathematics Series.

• [3] C. Du, G. Bai, and X. Wu, “High-Speed Polynomial Multiplier Architecture for Ring-LWE Based Public Key Cryptosystems,”

## References

• [4] P. Longa, M. Naehrig, P. Longa, and M. Naehrig, “Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography,” in Cryptology and Network Security - CANS 2016, vol. 10052.

• [5] T. Poppelmann and T. Guneysu, “Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware,” in LATINCRYPT 2012.

• [6] T. Oder and T. Guneysu, “Implementing the NewHope-Simple Key Exchange on Low-Cost FPGAs,” in LATINCRYPT 2017, Havana, Cuba, Sep. 2017

## References

• [7] P.-C. Kuo et al., “High Performance Post-Quantum Key Exchange on FPGAs,”Cryptology ePrint Archive 2017/690, Feb. 2018

• [8] E. Homsirikamol and K. Gaj, “Hardware Benchmarking of Cryptographic Algorithms Using HLS Tools: The SHA-3 Contest Case Study” in ARC 2015

• [9] E. Homsirikamol and K. Gaj, "A new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: The CAESAR contest case study” inFPT 2017