General execution flow (Tim's understanding as of Sept 19, 2008):

  • ComputeFStatBatch (new function)
    • Marshall data as arrays: SFTs and related parameters. Might have to create new data structures, but everything is fairly obvious; most things can be flattened, and those things that wouldn't have coalesced memory accesses would be accessed fairly infrequently. Question is: do we marshall on the CPU or make lots of cudaMemcpys to the GPU? Something to look at.
    • Allocate data on the GPU, copy from the marshalled (or straight arrays) region
    • For each skypoint in skypoints:
      • Precompute AE coefficients and SSB times. We could in theory do this on the card, but it's a problem because it's so much more to port (XLALBarycenter and such), and if we use streams appropriately it's relatively free to do this on the CPU
      • Copy AE coeffs and SSB times to the GPU
      • Run the kernel. Inside the kernel:
        • Load as many SFTs as desired (a parameter) into shared memory
        • n threads operate on one SFT in parallel (ComputeFaFb). Basically, moves across horizontally and stores result in a partial output buffer.
        • We can keep using the same partial output buffer as an accumulation buffer if we only do one SFT at a time, otherwise we'll have to do an additional reduction. This depends on the number of stacks
        • Mapping is basically one thread per element in the output buffer, one block per stack (if enough stacks) or one block per SFT (and then do an additional reduction at the end). Since there are 2000 elements in the output vector, we'll have to iterate in each thread (so one thread is actually responsible for more than one element).
      • Perform reductions as needed, copy results.
Additional work:
  1. Write a stripped-down ComputeFStatFreqBand that calls ComputeFStatBatch
  2. Maybe check CFSv2 to see if this can be used there as well
  3. Check precision requirements
  4. Build a test harness that can call ComputeFStatBatch on a very simple test case

Bruce's cost estimates of a CUDA-enabled cluster

This ignores the practical issue of whether or not we can cram boards into our compute nodes, or carry away the heat.

  • Current compute nodes: 800 Euros / 40 Gflops = 20 Euros/Gflop. This ignores the fact that we also get 500 GB of storage and 8 GB of RAM. These are easy to program for and develop on: vanilla C/C++/Matlab.
  • NVIDIA C870, cost 1000 Euros/500 Gflops = 2 Euros/Gflop.
  • NVIDIA D870, cost 5000 Euros/1000 Gflops = 5 Euros/Gflop
  • NVIDIA S870, cost 8000 Euros/2000 Gflops = 4 Euros/Gflop

Ideal would be to purchase relatively inexpensive NVIDIA cards, such as

* NVIDIA GTX 260 (1 GB memory, 500 Gflops) 200 Euros = 0.40 Euros/Gflop.
Topic attachments
I Attachment Action Size Date Who Comment
cuda_1.jpgjpg cuda_1.jpg manage 184 K 19 Sep 2008 - 13:06 UnknownUser CUDA on Blackboard 1
cuda_2.jpgjpg cuda_2.jpg manage 192 K 19 Sep 2008 - 13:06 UnknownUser CUDA on Blackboard 2
cuda_3.jpgjpg cuda_3.jpg manage 189 K 19 Sep 2008 - 13:07 UnknownUser CUDA on Blackboard 3
Topic revision: r5 - 13 Jul 2018, OliverBock
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback