*Duc T. Nguyen*, Viet B. Dang and Kris Gaj <dnguye69,vdang6,[email protected]>

Post-Quantum Cryptography (PQC) is cryptography scheme run in classical computers,

**replacement**of modern Cryptography such as RSA, ECC… due to the threat of Quantum Computer

NIST PQC Round 2

12 Lattice-base PQC candidates

5 of them implement Number Theoretic Transform (NTT) for polynomial multiplication

Polynomial multiplication is major execution function

⇒ We implement NTT as hardware accelerator for NewHope and Kyber.

NTT is similar to FFT and has complexity of

**O(n log n)**To multiply 2 polynomials, we do:

\$C = A \times B = \text{NTT}^{-1}(\text{NTT}(A) \circ \text{NTT}(B))\$

\$\circ\$ point-wise multiplication ⇒ **straight forward**

\$\text{NTT}, \text{NTT}^{-1}\$ Forward and Inverse NTT operation ⇒ **not easy**

Pragmas are heavily depend on the code structure and algorithm, most important **pragmas**:

Pipeline

Unroll

Array Partition

In [12] Kawamura, K., et. al., 8 strategies were applied at

**1st, 2nd, 3rd**loops.

- Block Diagram/HLS (BD/HLS)
Develop presumed optimal hardware architecture, write HLS code follow this block diagram.

- Space Exploration/HLS (SE/HLS)
Use design space exploration based on HLS directives, the final hardware architecture is unknown until the best result is achieved

- SE/HLS and BD/HLS inherit the advantages of HLS
Shorter development and verification time compare to traditional RTL.

[12] Kawamura, K., et. al.

Lower is better

[12] Kawamura, K., et. al.

Lower is better

Bare Metal versus SDSoc **transfer overhead** affect over **total execution time**.

Bare Metal used in-house DMA kernel

SDSoC transfer driver handled by tool

Same transfer size between CPU and FPGA in Bare Metal and SDSoc

Smaller is better

Smaller is better

RTL and HLS implementation use the same number of DSP and BRAM

RTL achieved 500 Mhz in Kyber-R1 and Kyber-R2 implementation

Result benchmark on Zynq UltraScale ZCU104

HLS increases resources in range 3-70% in LUT, Slices, FF and decrease frequency about 5-11%.

RTL and SDSoC implementation run at their maximum frequency

Both has similar clock cycles

RTL + Bare Metal: High frequency and minimal transfer overhead

SDSoC + HLS: Slower frequency and higher transfer overhead

Both approaches give

**very close**result.

Higher is better

Higher is better

BD/HLS approach give similar speedup as traditional RTL implementation.

BD/HLS has

**smaller**in term of resources utilization and**faster**in term of cycle count than SE/HLSBD/HLS is much more

**efficient**than SE/HLS in NTT implementation

[1] F. Farahmand, V. B. Dang, D. T. Nguyen, and K. Gaj, “Evaluating the potential for hardware acceleration of Four NTRU-based key encapsulation mechanisms using Software/Hardware Codesign,” in PQCrypto, 2019, pp. 23–43.

[2] E. C.-h. Chu and A. George, Inside the FFT Black Box: Serial and Parallel Fast Fourier Transform Algorithms, ser. Computational Mathematics Series.

[3] C. Du, G. Bai, and X. Wu, “High-Speed Polynomial Multiplier Architecture for Ring-LWE Based Public Key Cryptosystems,”

[4] P. Longa, M. Naehrig, P. Longa, and M. Naehrig, “Speeding up the Number Theoretic Transform for Faster Ideal Lattice-Based Cryptography,” in Cryptology and Network Security - CANS 2016, vol. 10052.

[5] T. Poppelmann and T. Guneysu, “Towards Efficient Arithmetic for Lattice-Based Cryptography on Reconfigurable Hardware,” in LATINCRYPT 2012.

[6] T. Oder and T. Guneysu, “Implementing the NewHope-Simple Key Exchange on Low-Cost FPGAs,” in LATINCRYPT 2017, Havana, Cuba, Sep. 2017

[7] P.-C. Kuo et al., “High Performance Post-Quantum Key Exchange on FPGAs,”Cryptology ePrint Archive 2017/690, Feb. 2018

[8] E. Homsirikamol and K. Gaj, “Hardware Benchmarking of Cryptographic Algorithms Using HLS Tools: The SHA-3 Contest Case Study” in ARC 2015

[9] E. Homsirikamol and K. Gaj, "A new HLS-based methodology for FPGA benchmarking of candidates in cryptographic competitions: The CAESAR contest case study” inFPT 2017