splitSFTs
The program was originally written to prepare SFT data for Einstein@home, so I'll refer to this project. It was later used in the
PowerFlux search, too.
Purpose
The current Einstein@home search of S5 data (
S5R4) uses SFT data of ~10.000 30min time segments from two detectors, 1450 Hz wide (50-1500 Hz), making > 200 GB in total (now arranged in 58.000 0.05 Hz wide SFT files, each about 4MB). One limiting factor of Einstein@home is the bandwidth of the clients, so we can't easily ship all the data to the clients at once. However we don't need to do this. A single program run on the client (which we call a "workunit") only analyzes data of a very narrow frequency band. So for such a workunit we only need to ship these few frequency bins of each SFT. As handling thousands of small files is very inefficient (on both server and client side), we defined a new data file format, the "merged SFT files". This is nothing more than a concatenation of binary SFT files that conform to certain rules:
- same timebase
- same detector
- increasing timestamps
This format is now specified in the "binary SFT file format specification v2", and routines for reading and writing are implemented in the
sft reference library of Bruce Allen.
old way
Previously the merged SFT data files for Einstein@home were contructed from ordinary SFT data using a script that repeatedly called convertToSFTv2 (in
LALApps) for every single narrow-band SFT segement, i.e.
- for all frequency segements
- for all SFTs
- call convertToSFTv2:
- open input SFT
- seek to first desired frequency bin
- read narrow range of frequency bins
- write them out to a SFT file
- close files
- concatenate all narrow-band SFT files into a merged SFT file
This was very inefficient and at the end literally took weeks to finish. I thought this time was better spent on developing a program that could do this faster.
new idea
The idea of the program is
- for all input SFTs
- read the input SFT file (once!)
- for all frequency segments
- write out the desired frequency range to the output SFT file (appending(!) if existing)
Benefits
This has several advantages
- saves I/O bandwidth by reading the SFTs only once
- saves the step of concatenating (saving I/O-time and space)
- allows to assure valid merged SFT files
- narrow-band generation can be split up to multiple, possibly parallel calls to the program (generate n chunks that just need to be concatenated at the end)
- sped up the "narrowbanding" for Einstein@home from few weeks to few days (on a single machine)
Though originally meant for Einstein@home this program turned out to be useful to pre-distribute the data on clusters (ATLAS), too, where the bandwidth of the fileservers is limiting the speed of the analysis, so some features (e.g. throtteling) were added to the program to better suit this purpose, too.
Code
possible Improvements
- keep output files open (if there aren't too many)
- if an output SFT file already exists, get the parameters of the last SFT in there to make sure to produce merged SFT files conforming to the specs