Table of Contents

Arecibo Binary Pulsar Search "CUDAfication"

General

The code base is available in separate branches of my private ABP git repo (as the official is still private, too). I'll merge the code back into mainline as soon as the experimental phase is completed. The "Basic" approach (see below) can already be found in the official repo (src/cuda/app).

Status

The following overview covers the approaches listed below

Linux version (Timing done on gpu06 using a "Tesla C1060" with 933.12 GFLOPS):
Approach Revision Builds? Runs? Validates?Sorted ascending Host Memory CPU Timing GPU Timing Gain factor
FTTW3 / CUFFT Batch + GPU Resampling n/a n/a n/a n/a n/a n/a n/a
Approach Revision Builds? Runs? Validates? Host Memory CPU Timing GPU Timing Gain factor
CPU only n/a yes yes yes 107 MB 371m17.769s n/a n/a
FTTW3 / CUFFT Basic yes yes yes 102 MB n/a 174m18.538s ~2.1
FTTW3 / CUFFT Batch (chunks of 5) yes yes yes 294 MB n/a 171m41.472s ~2.2
FTTW3 / CUFFT Batch (chunks of 25) yes yes yes 1.2 GB n/a 174m23.515s ~2.1
FTTW3 / CUFFT Batch (chunks of 50) yes yes yes 2.4 GB n/a 187m50.079s ~2.0
CPU only (O3 + SSE) n/a yes yes yes 111 MB 311m35.854s n/a ~1.2 (non-SSE CPU)
FTTW3 / CUFFT / O3 + SSE Basic + PS-Kernel yes yes yes pending n/a 166m15.787s ~1.9

Windows version (Timing done on gpu08 using a "GeForce GTX 285"):
Approach Revision Builds? Runs? Validates? Host Memory CPU Timing GPU Timing Gain factor
CPU only n/a yes yes yes pending 08:00 hours n/a n/a
FTTW3 / CUFFT Basic yes yes yes pending n/a 15:38 hours ~2.0

Mac OS X (Intel) version (Timing done on douglas using a "GeForce GT 120" with 52.80 GFLOPS):
Approach Revision Builds? Runs? Validates? Host Memory CPU Timing GPU Timing Gain factor
CPU only n/a yes yes yes 106 MB 327m40.132s n/a n/a
FTTW3 / CUFFT Basic yes yes yes 101 MB n/a 159m50.491s ~2.1

Approaches

The following sections describe roughly the different approaches planned or tried and tested

FTTW3 / CUFFT substitution

An easy way to CUDAfy demod_binary.c is to substitute the currently used FFT library (FFTW3) with the CUDA equivalent (CUFFT).

Same basic algorithm - Status: tested

This is probably the fasted way to incorporate CUDA support into our current search code but it can only serve as a first example (for testing puposes). As we keep all the main loops as they're used in the original version not much gain in performance is expected due to many inefficient PCI bus transfers. In fact this version could even result in a larger wall clock time than the CPU version. Anyway, it's a good starting point to get the hands dirty smile

Algorithm:
  • Loop over all templates (process single templates - serial)
    • Resampling on host
    • Mean padding on host
    • Copy single resampled time series from host to device
    • FFT on device
    • Copy single FFT data from device to host
    • Harmonic summing on host
    • Update screensaver power spectrum on host (could be omitted on non-BOINC version!)
    • Store candidates on host

Changed algorithm part I (FFT batch processing) - Status: tested

A first improvement to the previous approach could be to operate in batch mode. CUDA provides means to run 1D-FFTs in batches. This mean we could compile a number resampled time series, transfer them en-bloc to the device and FFT them in parallel (in addition to the FFT itself running in parallel!). When on set is finished the whole set is transferred back to the host where harmonic summing and candidate selection follows. The actual batch/chunk size (number of templates, number of resampled time series) depends obviously on the available memory, hence should determined dynamically based on the given hardware capabilities/resources. Note: a single resampled time series required approx. 12 MB.

Algorithm:
  • Loop over all templates (process templates in chunks - serial)
    • Loop over template chunk (process single template - serial)
      • Resampling on host
      • Mean padding on host
    • Copy resampled time series of current chunk from host to devicepending
    • Batch FFT of current chunk on device
    • Copy FFT data of current chunk from device to host
    • Loop over template chunk (process single template - serial)
      • Harmonic summing on host
      • Update screensaver power spectrum on host (could be omitted on non-BOINC version!)
      • Store candidates on host

This approach has one major drawback: the time series chunks as well as the FFT chunks have to be stored on the host which increases the memory footprint significantly! This is a problem for usage in a BOINC environment where the user can impose RAM constraints. These limits determine the actual chunk size. To be clarified:
  • Are these limits defined on a per-client basis (as opposed to a per-task basis)?
  • Can an app find out its own RAM allowance so that the app can adjust to it dynamically (hence optimize efficiency)?
  • Related: how does the CUFFT plan memory consumption scale with increased FFT/batch sizes?

Findings:
  • This is not faster than the trivial approach without batch FFTs
  • Different batch sizes don't change the situation significantly
  • Second loop (harmonic summing, candidate selection) seems to be the limiting factor now
  • The previous findings indicate that the following improvement (batch + resampling kernel) won't improve the situation either

Changed algorithm part II (FFT batch processing incl. time series resampling) - Status: ruled out (for the time being)

A further improved version of the previous approach could run the resampling part of the algorithm on the device. This would require a custom kernel which wasn't required by the previous approches. However, this way data transfer could be reduced significantly as the original time series data would be transferred once as opposed to once per each template (22161 times!). We need to find out whether the actual FFT would be part of the kernel (custom implementation) or if we could still use the CUFFT implementation (see below, in italics). I'd prefer the latter.

Algorithm:
  • Copy original time series from host to device
  • Copy template bank from host to device
  • Loop over all templates (process templates in chunks - serial)
    • Loop over template chunk (process single templates - parallel)
      • Run kernel on device
        • Resampling on device
        • Mean padding on device
        • FFT on device (custom FFT?)
      • FFT on device (CUFFT operating on pre-processed kernel data?)
      • Copy FFT chunk data from device to host
      • Harmonic summing on host
      • Update screensaver power spectrum on host (could be omitted on non-BOINC version!)
      • Store candidates on host

Custom Kernel (implementing whole template processing chain)

This approach would try to implement the whole template processing chain as a custom kernel. The goal to achieve is to be able to process each template from A-Z independently, and thus process as many templates in parallel as possible (how to store intermediate candidates / main the candidate list?).

How to proceed in this direction

  1. (Replace CUFFT calls with custom FFT kernel based on http://forums.nvidia.com/index.php?showtopic=69801)
    • This might speed up the FFT part by a factor of 3 compared to CUFFT
    • This gives us experience in using custom kernels (in particular regarding the build toolchain and potential cross-compile problems)
    • updated.png This is not going to happen as this FFT implementation supports only up 8k and is no longer maintained
  2. If the FFT-only kernel builds and runs on all platform we can move more and more parts of the current template loop into the kernel
    • Even serial code is recommended to be run on the device
    • Thus let's start by "just" moving resampling, power spectrum and harmonic summing to device (one by one)
  3. Parallelize the device code itself where possible (there are a number of loops)

Notes

  • Use in-place FFT to save device memory (done!)
  • Use cudaHostAlloc (for input/output buffers) to increase host<->device bandwidth (done!, including write-combined mode where useful)
  • Both batched FFT approaches need to take into account the available global device memory to determine the chunk size
  • Use OpenGL buffer objects to pass power spectrum data from device to screensaver
  • I tested CUDA 2.3 because CUFFT has reportedly received significant performance improvements - unfortunately not for our current setup (gain: ~2.2)

Relevant known CUDA issues

OS independent

  • OpenGL interoperability
    • OpenGL cannot access a buffer that is currently mapped. If the buffer is registered but not mapped, OpenGL can do any requested operations on the buffer.
    • Deleting a buffer while it is mapped for CUDA results in undefined behavior.
    • Attempting to map or unmap while a different context is bound than was current during the buffer register operation will generally result in a program error and should thus be avoided.
    • Interoperability will use a software path on SLI
    • Interoperability will use a software path if monitors are attached to multiple GPUs and a single desktop spans more than one GPU (i.e. WinXP dualview).

Windows XP / Vista

  • Malloc may fail due to running out of virtual memory space. The address space limitation is fixed by a Microsoft issued hotfix. Please install the patch located at http://support.microsoft.com/kb/940105 if this is an issue. Windows Vista SP1 includes this hotfix.

Windows XP

  • Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit usually causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second runtime restriction. For this reason it is recommended that CUDA be run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

Windows Vista

  • In order to run CUDA on a non-TESLA GPU, either the Windows desktop must be extended onto the GPU, or the GPU must be selected as the PhysX GPU.
  • Individual kernels are limited to a 2-second runtime by Windows Vista. Kernels that run for longer than 2 seconds will trigger the Timeout Detection and Recovery (TDR) mechanism. TDR can be disabled in the registry
        Windows Registry Editor Version 5.00
    
        [HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\GraphicsDrivers]
        "TdrLevel"=dword:00000000

Linux

  • Individual GPU program launches are limited to a run time of less than 5 seconds on a GPU with a display attached. Exceeding this time limit causes a launch failure reported through the CUDA driver or the CUDA runtime. GPUs without a display attached are not subject to the 5 second run time restriction. For this reason it is recommended that CUDA is run on a GPU that is NOT attached to an X display.
  • In order to run CUDA applications, the CUDA module must be loaded and the entries in /dev created. This may be achieved by initializing X Windows, or by creating a script to load the kernel module and create the entries (see release notes for details)

Mac OS X

  • OpenGL interop will always use a software path leading to reduced performance when compared to interop on other platforms.
  • CUDA kernels which do not terminate or run without interruption for several tens of seconds may trigger the GPU to reset causing a disruption of any attached displays. This may cause display image to become corrupted, which will disappear upon a reboot.
  • If a GPU is used without a display attached it may not exit a reduced power state, causing CUDA programs to perfom poorly when run on that GPU. Cycling the system's power saving state or rebooting should reset the GPU. In general it is best to use a GPU with a display attached.
  • The kernel driver may leak wired (i.e. unpageable memory) if CUDA applications terminate in unexpected ways. Continued leaks will lead to severely degraded system performance and requires a reboot to fix.
  • On systems with multiple GPUs installed or systems with multiple monitors connected to a single GPU, OpenGL interoperability always copies shared buffers through host memory.
  • The MacBook Pro currently presents both GPUs as available for use in Performance mode. This is incorrect behavior, as only one GPU is available at a time. CUDA applications that try to run on the second GPU (device ID 1) will potentially hang. This hang may be terminated by pressing ctrl-C or closing the offending application.
  • The following APIs exhibit high CPU utilization if they wait for the hardware for a significant amount of time. As a workaround, apps may use cu(da)StreamQuery and/or cu(da)EventQuery to check whether the GPU is busy and yield the thread as desired.
    • cuCtxSynchronize
    • cuEventSynchronize
    • cuStreamSynchronize
    • cudaThreadSynchronize
    • cudaEventSynchronize
    • cudaStreamSynchronize
I Attachment Action Size Date Who Comment
app_info.xmlxml app_info.xml manage 1 K 21 Jul 2009 - 13:06 UnknownUser Example of app_info.xml (required to run CUDA version as beta/anonymous app under BOINC control)
Topic revision: r55 - 13 Jul 2018, OliverBock
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback