General execution flow (Tim's understanding as of Sept 19, 2008):
- ComputeFStatBatch (new function)
- Marshall data as arrays: SFTs and related parameters. Might have to create new data structures, but everything is fairly obvious; most things can be flattened, and those things that wouldn't have coalesced memory accesses would be accessed fairly infrequently. Question is: do we marshall on the CPU or make lots of cudaMemcpys to the GPU? Something to look at.
- Allocate data on the GPU, copy from the marshalled (or straight arrays) region
- For each skypoint in skypoints:
- Precompute AE coefficients and SSB times. We could in theory do this on the card, but it's a problem because it's so much more to port (XLALBarycenter and such), and if we use streams appropriately it's relatively free to do this on the CPU
- Copy AE coeffs and SSB times to the GPU
- Run the kernel. Inside the kernel:
- Load as many SFTs as desired (a parameter) into shared memory
- n threads operate on one SFT in parallel (ComputeFaFb). Basically, moves across horizontally and stores result in a partial output buffer.
- We can keep using the same partial output buffer as an accumulation buffer if we only do one SFT at a time, otherwise we'll have to do an additional reduction. This depends on the number of stacks
- Mapping is basically one thread per element in the output buffer, one block per stack (if enough stacks) or one block per SFT (and then do an additional reduction at the end). Since there are 2000 elements in the output vector, we'll have to iterate in each thread (so one thread is actually responsible for more than one element).
- Perform reductions as needed, copy results.
Additional work:
- Write a stripped-down ComputeFStatFreqBand that calls ComputeFStatBatch
- Maybe check CFSv2 to see if this can be used there as well
- Check precision requirements
- Build a test harness that can call ComputeFStatBatch on a very simple test case
Bruce's cost estimates of a CUDA-enabled cluster
This ignores the practical issue of whether or not we can cram boards into our compute nodes, or carry away the heat.
- Current compute nodes: 800 Euros / 40 Gflops = 20 Euros/Gflop. This ignores the fact that we also get 500 GB of storage and 8 GB of RAM. These are easy to program for and develop on: vanilla C/C++/Matlab.
- NVIDIA C870, cost 1000 Euros/500 Gflops = 2 Euros/Gflop.
- NVIDIA D870, cost 5000 Euros/1000 Gflops = 5 Euros/Gflop
- NVIDIA S870, cost 8000 Euros/2000 Gflops = 4 Euros/Gflop
Ideal would be to purchase relatively inexpensive NVIDIA cards, such as
* NVIDIA GTX 260 (1 GB memory, 500 Gflops) 200 Euros = 0.40 Euros/Gflop.