Table of Contents
Arecibo Binary Pulsar Search "CUDAfication"
General
The code base is available in separate branches of my private ABP git repo (as the official is still private, too). I'll merge the code back into mainline as soon as the experimental phase is completed. The "Basic" approach (see below) can already be found in the official repo (
src/cuda/app
).
Status
The following overview covers the approaches listed below
Linux version (Timing done on
gpu06 using a "Tesla C1060" with 933.12 GFLOPS):
Approach |
Revision |
Builds? |
Runs? |
Validates? |
Host Memory |
CPU Timing |
GPU Timing |
Gain factor |
CPU only |
n/a |
yes |
yes |
yes |
107 MB |
371m17.769s |
n/a |
n/a |
FTTW3 / CUFFT |
Basic |
yes |
yes |
yes |
102 MB |
n/a |
174m18.538s |
~2.1 |
FTTW3 / CUFFT |
Batch (chunks of 5) |
yes |
yes |
yes |
294 MB |
n/a |
171m41.472s |
~2.2 |
FTTW3 / CUFFT |
Batch (chunks of 25) |
yes |
yes |
yes |
1.2 GB |
n/a |
174m23.515s |
~2.1 |
FTTW3 / CUFFT |
Batch (chunks of 50) |
yes |
yes |
yes |
2.4 GB |
n/a |
187m50.079s |
~2.0 |
FTTW3 / CUFFT |
Batch + GPU Resampling |
n/a |
n/a |
n/a |
n/a |
n/a |
n/a |
n/a |
Approach |
Revision |
Builds? |
Runs? |
Validates? |
Host Memory |
CPU Timing |
GPU Timing |
Gain factor |
CPU only (O3 + SSE) |
n/a |
yes |
yes |
yes |
111 MB |
311m35.854s |
n/a |
~1.2 (non-SSE CPU) |
FTTW3 / CUFFT / O3 + SSE |
Basic + PS-Kernel |
yes |
yes |
yes |
pending |
n/a |
166m15.787s |
~1.9 |
Windows version (Timing done on
gpu08 using a "GeForce GTX 285"):
Mac OS X (Intel) version (Timing done on
douglas
using a "GeForce GT 120" with 52.80 GFLOPS):
Approaches
The following sections describe roughly the different approaches planned or tried and tested
FTTW3 / CUFFT substitution
An easy way to CUDAfy
demod_binary.c
is to substitute the currently used FFT library (
FFTW3
) with the CUDA equivalent (
CUFFT
).
Same basic algorithm - Status: tested
This is probably the fasted way to incorporate CUDA support into our current search code but it can only serve as a first example (for testing puposes).
As we keep all the main loops as they're used in the original version not much gain in performance is expected due to many inefficient PCI bus transfers. In fact this version could even result in a larger wall clock time than the CPU version.
Anyway, it's a good starting point to get the hands dirty
Algorithm:
- Loop over all templates (process single templates - serial)
- Resampling on host
- Mean padding on host
- Copy single resampled time series from host to device
- FFT on device
- Copy single FFT data from device to host
- Harmonic summing on host
- Update screensaver power spectrum on host (could be omitted on non-BOINC version!)
- Store candidates on host
Changed algorithm part I (FFT batch processing) - Status: tested
A first improvement to the previous approach could be to operate in batch mode. CUDA provides means to run 1D-FFTs in batches. This mean we could compile a number resampled time series, transfer them en-bloc to the device and FFT them in parallel (in addition to the FFT itself running in parallel!). When on set is finished the whole set is transferred back to the host where harmonic summing and candidate selection follows. The actual batch/chunk size (number of templates, number of resampled time series) depends obviously on the available memory, hence should determined dynamically based on the given hardware capabilities/resources. Note: a single resampled time series required approx. 12 MB.
Algorithm:
- Loop over all templates (process templates in chunks - serial)
- Loop over template chunk (process single template - serial)
- Resampling on host
- Mean padding on host
- Copy resampled time series of current chunk from host to devicepending
- Batch FFT of current chunk on device
- Copy FFT data of current chunk from device to host
- Loop over template chunk (process single template - serial)
- Harmonic summing on host
- Update screensaver power spectrum on host (could be omitted on non-BOINC version!)
- Store candidates on host
This approach has one major drawback: the time series chunks as well as the FFT chunks have to be stored on the host which increases the memory footprint significantly! This is a problem for usage in a BOINC environment where the user can impose RAM constraints. These limits determine the actual chunk size. To be clarified:
- Are these limits defined on a per-client basis (as opposed to a per-task basis)?
- Can an app find out its own RAM allowance so that the app can adjust to it dynamically (hence optimize efficiency)?
- Related: how does the CUFFT plan memory consumption scale with increased FFT/batch sizes?
Findings:
- This is not faster than the trivial approach without batch FFTs
- Different batch sizes don't change the situation significantly
- Second loop (harmonic summing, candidate selection) seems to be the limiting factor now
- The previous findings indicate that the following improvement (batch + resampling kernel) won't improve the situation either
Changed algorithm part II (FFT batch processing incl. time series resampling) - Status: ruled out (for the time being)
A further improved version of the previous approach could run the resampling part of the algorithm on the device. This would require a custom
kernel
which wasn't required by the previous approches. However, this way data transfer could be reduced significantly as the original time series data would be transferred once as opposed to once per each template (22161 times!). We need to find out whether the actual FFT would be part of the kernel (custom implementation) or if we could still use the
CUFFT
implementation (see below, in
italics). I'd prefer the latter.
Algorithm:
- Copy original time series from host to device
- Copy template bank from host to device
- Loop over all templates (process templates in chunks - serial)
- Loop over template chunk (process single templates - parallel)
- Run kernel on device
- Resampling on device
- Mean padding on device
- FFT on device (custom FFT?)
- FFT on device (CUFFT operating on pre-processed kernel data?)
- Copy FFT chunk data from device to host
- Harmonic summing on host
- Update screensaver power spectrum on host (could be omitted on non-BOINC version!)
- Store candidates on host
Custom Kernel (implementing whole template processing chain)
This approach would try to implement the whole template processing chain as a custom kernel. The goal to achieve is to be able to process each template from A-Z independently, and thus process as many templates in parallel as possible (how to store intermediate candidates / main the candidate list?).
How to proceed in this direction
- (Replace CUFFT calls with custom FFT kernel based on http://forums.nvidia.com/index.php?showtopic=69801)
- This might speed up the FFT part by a factor of 3 compared to CUFFT
- This gives us experience in using custom kernels (in particular regarding the build toolchain and potential cross-compile problems)
-
This is not going to happen as this FFT implementation supports only up 8k and is no longer maintained
- If the FFT-only kernel builds and runs on all platform we can move more and more parts of the current template loop into the kernel
- Even serial code is recommended to be run on the device
- Thus let's start by "just" moving resampling, power spectrum and harmonic summing to device (one by one)
- Parallelize the device code itself where possible (there are a number of loops)
Notes
- Use in-place FFT to save device memory (done!)
- Use cudaHostAlloc (for input/output buffers) to increase host<->device bandwidth (done!, including
write-combined
mode where useful)
- Both batched FFT approaches need to take into account the available global device memory to determine the chunk size
- Use OpenGL buffer objects to pass power spectrum data from device to screensaver
- I tested CUDA 2.3 because CUFFT has reportedly received significant performance improvements - unfortunately not for our current setup (gain: ~2.2)
Relevant known CUDA issues
OS independent
- OpenGL interoperability
- OpenGL cannot access a buffer that is currently mapped. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer.
- Deleting a buffer while it is mapped for CUDA results in undefined behavior.
- Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided.
- Interoperability will use a software path on SLI
- Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview).
Windows XP / Vista
- Malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix.
Windows XP
- Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.
Windows Vista
Linux
- Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display.
- In order to run CUDA applications, the CUDA module must be loaded and the entries in /dev created. This may be achieved by initializing X Windows, or by creating a script to load the kernel module and create the entries (see release notes for details)
Mac OS X
- OpenGL interop will always use a software path leading to reduced performance when compared to interop on other platforms.
- CUDA kernels which do not terminate or run without interruption for several tens of seconds may trigger the GPU to reset causing a disruption of any attached displays. This may cause display image to become corrupted, which will disappear upon a reboot.
- If a GPU is used without a display attached it may not exit a reduced power state, causing CUDA programs to perfom poorly when run on that GPU. Cycling the system's power saving state or rebooting should reset the GPU. In general it is best to use a GPU with a display attached.
- The kernel driver may leak wired (i.e. unpageable memory) if CUDA applications terminate in unexpected ways. Continued leaks will lead to severely degraded system performance and requires a reboot to fix.
- On systems with multiple GPUs installed or systems with multiple monitors connected to a single GPU, OpenGL interoperability always copies shared buffers through host memory.
- The MacBook Pro currently presents both GPUs as available for use in Performance mode. This is incorrect behavior, as only one GPU is available at a time. CUDA applications that try to run on the second GPU (device ID 1) will potentially hang. This hang may be terminated by pressing ctrl-C or closing the offending application.
- The following APIs exhibit high CPU utilization if they wait for the hardware for a significant amount of time. As a workaround, apps may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether the GPU is busy and yield the thread as desired.
- cuCtxSynchronize
- cuEventSynchronize
- cuStreamSynchronize
- cudaThreadSynchronize
- cudaEventSynchronize
- cudaStreamSynchronize