Implementing and Benchmarking Number Theoretic Transform in Lattice-based Post-Quantum Cryptography using SW/HW Codesign and High Level Synthesis

*Duc T. Nguyen*, Viet B. Dang and Kris Gaj <dnguye69,vdang6,[email protected]>


  • Post-Quantum Cryptography (PQC) is cryptography scheme run in classical computers, replacement of modern Cryptography such as RSA, ECC…​ due to the threat of Quantum Computer

NIST PQC Round 2
  • 12 Lattice-base PQC candidates

  • 5 of them implement Number Theoretic Transform (NTT) for polynomial multiplication

  • Polynomial multiplication is major execution function

  • ⇒ We implement NTT as hardware accelerator for NewHope and Kyber.

Number Theoretic Transform (NTT)

  • NTT is similar to FFT and has complexity of O(n log n)

  • To multiply 2 polynomials, we do:

\$C = A \times B = \text{NTT}^{-1}(\text{NTT}(A) \circ \text{NTT}(B))\$

\$\circ\$ point-wise multiplication ⇒ straight forward

\$\text{NTT}, \text{NTT}^{-1}\$ Forward and Inverse NTT operation ⇒ not easy

High Level Synthesis (HLS)

Pragmas are heavily depend on the code structure and algorithm, most important pragmas:

  • Pipeline

  • Unroll

  • Array Partition

Implement Number Theoretic Transform in HLS

  • In [12] Kawamura, K., et. al., 8 strategies were applied at 1st, 2nd, 3rd loops.


Block Diagram vs. Space Exploration Development

Block Diagram/HLS (BD/HLS)

Develop presumed optimal hardware architecture, write HLS code follow this block diagram.

Space Exploration/HLS (SE/HLS)

Use design space exploration based on HLS directives, the final hardware architecture is unknown until the best result is achieved

SE/HLS and BD/HLS inherit the advantages of HLS

Shorter development and verification time compare to traditional RTL.



BD/HLS vs. SE/HLS: Resources

[12] Kawamura, K., et. al.

Lower is better

BD/HLS vs. SE/HLS: Speedup

[12] Kawamura, K., et. al.

Lower is better

Transfer Overhead

Bare Metal versus SDSoc transfer overhead affect over total execution time.

  • Bare Metal used in-house DMA kernel

  • SDSoC transfer driver handled by tool

  • Same transfer size between CPU and FPGA in Bare Metal and SDSoc

Transfer Overhead: Encapsulation

Smaller is better

Transfer Overhead: Decapsulation

Smaller is better

Resources utilization for HLS and RTL

  • RTL and HLS implementation use the same number of DSP and BRAM

  • RTL achieved 500 Mhz in Kyber-R1 and Kyber-R2 implementation

  • Result benchmark on Zynq UltraScale ZCU104

HLS vs RTL resources utilization

  • HLS increases resources in range 3-70% in LUT, Slices, FF and decrease frequency about 5-11%.

Speedup of Software/Hardware Co-Design vs Software Only

RTL and SDSoC implementation run at their maximum frequency

  • Both has similar clock cycles

  • RTL + Bare Metal: High frequency and minimal transfer overhead

  • SDSoC + HLS: Slower frequency and higher transfer overhead

  • Both approaches give very close result.

Speedup: Encapsulation

Higher is better

Speedup: Decapsulation

Higher is better


  • BD/HLS approach give similar speedup as traditional RTL implementation.

  • BD/HLS has smaller in term of resources utilization and faster in term of cycle count than SE/HLS

  • BD/HLS is much more efficient than SE/HLS in NTT implementation


  • [1] F. Farahmand, V. B. Dang, D. T. Nguyen, and K. Gaj, “Evaluating the potential for hardware acceleration of Four NTRU-based key encapsulation mechanisms using Software/Hardware Codesign,” in PQCrypto, 2019, pp. 23–43.

  • [2] E. C.-h. Chu and A. George, Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms, ser. Computational Mathematics Series.

  • [3] C. Du, G. Bai, and X. Wu, “High-Speed Polynomial Multiplier Architecture for Ring-LWE Based Public Key Cryptosystems,”


  • [4] P. Longa, M. Naehrig, P. Longa, and M. Naehrig, “Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography,” in Cryptology and Network Security - CANS 2016, vol. 10052.

  • [5] T. Poppelmann and T. Guneysu, “Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware,” in LATINCRYPT 2012.

  • [6] T. Oder and T. Guneysu, “Implementing the NewHope-Simple Key Exchange on Low-Cost FPGAs,” in LATINCRYPT 2017, Havana, Cuba, Sep. 2017


  • [7] P.-C. Kuo et al., “High Performance Post-Quantum Key Exchange on FPGAs,”Cryptology ePrint Archive 2017/690, Feb. 2018

  • [8] E. Homsirikamol and K. Gaj, “Hardware Benchmarking of Cryptographic Algorithms Using HLS Tools: The SHA-3 Contest Case Study” in ARC 2015

  • [9] E. Homsirikamol and K. Gaj, "A new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: The CAESAR contest case study” inFPT 2017