# **Two-Electron Integral Evaluation on FPGA, Cell and GPU accelerators** Guochun Shi, Volodymyr Kindratenko (University of Illinois), Ivan Ufimtsev, Todd Martinez (Stanford University)

$$(\mu\nu|\lambda\sigma) = \sum_{p=1}^{N_{\mu}} \sum_{q=1}^{N_{\nu}} \sum_{r=1}^{N_{\lambda}} \sum_{s=1}^{N_{\sigma}} d_{\mu p} d_{\nu q} d_{\lambda r} d_{\sigma s} [pq|rs]$$

$$[s_{1}s_{2}|s_{3}s_{4}] = \frac{\pi^{3}}{AB\sqrt{A+B}} K_{12}(\vec{\mathbf{R}}_{12}) K_{34}(\vec{\mathbf{R}}_{34}) F_{0}\left(\frac{AB}{A+B}[\vec{\mathbf{R}}_{\mathbf{P}}-\vec{\mathbf{R}}_{\mathbf{Q}}]^{2}\right)$$

$$A = \alpha_{1} + \alpha_{2}, B = \alpha_{3} + \alpha_{4}, F_{0}(t) = \frac{erf(\sqrt{t})}{\sqrt{t}},$$

$$\vec{\mathbf{R}}_{\mathbf{k}\mathbf{l}} = \vec{\mathbf{R}}_{\mathbf{k}} - \vec{\mathbf{R}}_{\mathbf{l}}, \vec{\mathbf{R}}_{\mathbf{P}} = \frac{\alpha_{1}\vec{\mathbf{R}}_{1} + \alpha_{2}\vec{\mathbf{R}}_{2}}{A}, \vec{\mathbf{R}}_{\mathbf{Q}} = \frac{\alpha_{3}\vec{\mathbf{R}}_{3} + \alpha_{4}\vec{\mathbf{R}}_{4}}{B},$$

$$K_{ij}\left(\vec{\mathbf{R}}_{ij}\right) = exp\left(-\frac{\alpha_{i}\alpha_{j}}{\alpha_{i} + \alpha_{i}}\left[\vec{\mathbf{R}}_{i} - \vec{\mathbf{R}}_{j}\right]^{2}\right)$$







| GFLOPS                   | 3 |
|--------------------------|---|
| Bandwidth between host   | 2 |
| memory and PE memory     |   |
| (GB/s)                   |   |
| Local memory to PE       | 8 |
| bandwidth (GB/s)         |   |
| Frequency (GHz)          | 1 |
| # of processing elements | 1 |
|                          |   |



| GFLOPS                   | 24 |
|--------------------------|----|
| Bandwidth between host   | 1. |
| memory and PE memory     |    |
| (GB/s)                   |    |
| Local memory to PE       | 4. |
| bandwidth (GB/s)         |    |
| Frequency (GHz)          | 0. |
| # of processing elements |    |
|                          |    |

## **GPU Implementation**

| apping to GPU                                                      | description                                     | grainulatiry                                 | Load balance      |
|--------------------------------------------------------------------|-------------------------------------------------|----------------------------------------------|-------------------|
| ICI                                                                | One thread block <-> one<br>contracted integral | Intemediate                                  | intemediate       |
| 1CI                                                                | One thread <-> one<br>contracted integral       | Coare-grained                                | Not balanced      |
| 1PI                                                                | One thread <-> one primitive integral           | Fine-grained                                 | Perfectly balance |
| oad data from the<br>nput files, allocate<br>CPU and GPU<br>nemory |                                                 | [ket <sub>00</sub> ]<br>[ket <sub>01</sub> ] |                   |



### Results

Evaluation of two-electron repulsion integrals for a 64 H atom system using STO-6G basis set

| Accelera<br>type     | tor   | Kernel<br>run time | CPU pre-<br>caculation | GPU-CPU<br>transfer | Overall<br>runtime | Overall<br>speedup+ |
|----------------------|-------|--------------------|------------------------|---------------------|--------------------|---------------------|
| FPGA (SR             | (C-6) |                    |                        |                     | 42.85              | 2.6                 |
| GPU<br>(8800<br>GTX) | 1B1CI | 1.608              |                        |                     | 1.632              | 68                  |
|                      | 1T1CI | 1.099              |                        |                     | 1.123              | 100                 |
|                      | 1T1PI | 2.863              | 0.012                  | 0.012               | 2.88               | 39                  |
| Cell BE              |       | 1.25               | 0.089                  | 0*                  | 1.34               | 84                  |

\* On Cell B/E double buffering makes the communication hidden within computation. +The results are compared with the CPU code running on a 2.33 GHz Intel Xeon, compiled with icc. The CPU code runs in 112.55 seconds.

### Conclusions

Best performance improvements were obtained on a GPU platform, closely followed by the Cell B/E based system. Performance improvements obtained on an FPGA system were marginal.

FPGA's performance is limited by the low operational frequency (100 MHz) and limited raw resources available to implement multiple instances of the compute kernel

Higher percentage of the peak FLOPS utilization is achieved on the Cell processor than on the GPU.

Programming efforts for GPU and Cell B/E platforms center on parallelizing the code to take advantage of SIMD architecture. Programming efforts on the FPGA platform center on merging nested loops and pipelining the innermost loop. All three implementations resulted in a considerable expansion of the code base.







