This page has moved here : Einstein@Home generic postprocessing tools

General Remarks on the Prototype for a new, generic EAH postprocessing Pipeline Software

The general idea behind this software is to provide tools to perform E@H postprocessing in a more generic way, where code doesn't need to be re-written from scratch again and again for each analysis of each run.

To achieve this, the following ideas were explored in a prototype

  • Common Configuration file: All parameters and constants that are relevant for the postprocessing should be held in a single configuration file. The file is then passed on to each program/script that is part of the pipeline, so it can access the setup of the run and of the analysis (see description of configuration file below)

  • Flexible input format: The prototype currently supports two input file formats: ASCII files (with fields separated by a single space) in gzip compressed(endig in *.gz) or uncompressed format. It might be worth looking into more formats (e.g. FITS ) in the future. The number and meaning of the fields (columns) in the input candidate files is not fixed in the code, instead it is part of the configuration in the configuration file (see below)

  • More general, resusable code: For example, the code shpuld be prepared to work with higher order (> 2) of frequency derivative.

  • Separation of "Plumbing" and "Science" code: The new postprocessing pipeline software collection provides a "plumbing" toolbox of ready made classes that implement the generic workflow such as : reading candidate files, filtering out candidates, combining and resorting several input streams of candidates, etc. In a plug-in fashion, the scientists in charge of postprocessing will use this building blocks to build complex scripts, extending the pre-made components where necessary (e.g. by providing code for a predicate that decides whether a candidate should be filtered out or not).

  • High performance: The goal is to minimize the tasks that need to be run distributed via Condor on the cluster, in order to allow a more exploratory postprocessing where needed: run pipline stage, inspect results, rethink the choice of parameters, run again ....

  • Visualization: To assist with the evaluation of results and as a quality control measure, the software will provide some convenience scripts to quickly plot the raw output data if needed

  • The prototype was implemented in JAVA, however the final software will be done in Python for easy scripting by the scientists in charge of the postprocessing. Note, however, that only the "Science" code segments of the scripts need to be touched (if at all) for a new analysis project, the "plumbing" part of the code should not need any modification unless some new functionality is required.

Configuration File

The configuration file is XML format. A more human friendly could be supported as well if needed. The configuration file has three sections, each section contains data that is configured or needs to be edited at a specific time of a E@H project:

Constants Section

This section contains (in most cases astrophysical) constants that should be defined identically across the scripts doing the postprocessing, so they should be kept in one place instead of being hard-coded in sveral places. The contenets of this section should, in most cases, never be changed during the course of the postprocessing, except for adding new constants as needed when writing new postprocessing pipeline components. The following snippet lists the part of the configuration file with constants for the S5GC1 postprocessing

   <constants>
      <constant name="c">299792458.0</constant>     <!-- m/s -->
      <constant name="vOrbMax">2.9785e04</constant> <!-- m/s -->
      <constant name="vSpinMax">465.10</constant>   <!-- m/s -->
   </constants>

Note: text between <!-- and --> is treated as a comment in XML. This example defines three constants with the names "c", "vOrbMax" and "vSpinMax" respectively, which can then be used by software of the postprocessing pipeline. You can add aditional constants by inserting a new line like this

   <constants>
...
      <constant name="myNewConstantName">some value</constant>   <!-- m/s -->
   </constants>

Run-Setup Section

This section covers settings that are fixed by the design of the E@H run, so this section in most cases needs to be edited only once at the beginning of a postprocessing project.

Currently the following values are supported, more will follow when more pipleine stages are implemented::
   <run-setup>
      <run-name>S6Bucket</run-name> <!-- symbolic name for the run, e.g. S5R5, S6Bucket etc Must match result file name prefix --> 
      <obs-datetime-start>818845553</obs-datetime-start>  <!-- start of observation in sec, GPS time-->
      <obs-datetime-ref>847063082.5</obs-datetime-ref>    <!-- reference time of candidates in sec, GPS time-->
      <obs-datetime-end>875280612</obs-datetime-end> <!-- end of observation in sec, GPS time-->
      <parameter-space>
         <!-- boundaries of the overall parameter space, e.g. for range checks 
            etc -->
         <limits>
            <freq-min>50</freq-min>
            <freq-max>1400</freq-max>
            <f1dot-min>-1.20794e-9</f1dot-min>
            <f1dot-max>9.44829e-10</f1dot-max>
            <f2dot-min>0</f2dot-min>
            <f2dot-max>0</f2dot-max>
            <f3dot-min>0</f3dot-min>
            <f3dot-max>0</f3dot-max>
         </limits>
         <sky type="ALL"></sky>
      </parameter-space>
   </run-setup>

Postprocessing Analysis-Setup Section

This section covers settings that are configuring the postprocessing itself but are not already defined by the setup of the run (see previous section). The following code is an example for the currently defined settings, more will be added later as more postprocessing stages are implemented.

    <!-- all parameters that are specific to the way we analyse the data should 
      go into this section -->
   <analysis-setup>
      <analysis-name>PP1S5GC1</analysis-name> <!-- just some symbolic name. E.g. there could be more than one postprocessing pipelines per run -->
      <input-data>
         <candidate-format>
            <type>ascii</type>
            <columns> <!-- names for each column in the candiadte files that are relevant. Note column indexing starts with 1 (one) -->    
               <column nr="1">f</column>
               <column nr="2">alpha</column>
               <column nr="3">delta</column>
               <column nr="4">f1dot</column>
               <column nr="5">nc</column>
               <column nr="6">avg2F</column>
               <column nr="7">avg2Ffg</column>
               <column nr="8">avg2Fdet1</column>
               <column nr="9">svg2Fdet2</column>
            </columns> 
                                <!-- Some fields are mandatory for EVERY E@H candidate, here we define in which column each can be found -->
            <sky-alpha>2</sky-alpha>   <!-- equatorial  sky coordinates, RA, in rad . This defines that this coordinate can be found in the 3rd column--> 
            <sky-delta>3</sky-delta>     <!-- equatorial  sky coordinates, declination, in rad -->
            <frequency order="0">1</frequency> <!-- frequency (SSB frame at Tref) , in Hz, "zero-order derivative" of frequency -->
            <frequency order="1">4</frequency> <!-- first order derivative of freq., in Hz/s . negative means spin-down, positive spin-up -->
             <!-- add higher orders of spindown here if needed --> 
            <primary-detection-stat>6</primary-detection-stat>  <!-- The primary detection statistic. It is assumed that candidate files are sorted in descending order of this field  -->
                                <!-- additional detection statistics can be defined as well : -->
            <aux-detection-stat nr="1">5</aux-detection-stat>
            <aux-detection-stat nr="2">7</aux-detection-stat>
                                <aux-detection-stat nr="3">8</aux-detection-stat> 
                                <aux-detection-stat nr="4">9</aux-detection-stat>          
         </candidate-format>
      </input-data>
   </analysis-setup>

Note: When implementing a postprocessing pipeline, make sure to add the config.xml file along with other source code that you used to the revison control system to make te postprocessing repeatable by others.

Ready-to-use pipeline tools (so far, stay tuned for more to come )

LineRemovalLoudestNBatch (="Find loundest N candidates and remove candidates potentially affected by instrument lines")

The command line options are as follows (see next sections for detailed description)
    LineRemovalLoudestNBatch {config-file} {input-file-listfile} {output-file} {instrument-lines-file} {N} {use-sky-specific-doppler-drift} {remove-lines-first}

The command line parameters are explained below:

  • {config-file} (mandatory): the configuration file as described above

  • {input-file-listfile}{mandatory}: a file that lists the input files, one per line, containing the candidates. Optionally the file contains (seperated by a space) lower and upper bounds for the frequency of a candidate (at time Tref) (Rational: lower and upper bounds on the frequency allows the code to pre-filter the list of line bands that are applicable for this file, this increases performance). The format of each line in this file is therefore
{inputfile-path (in double quotes)} {lower bound on freq. in Hz (optional)} {upper bound on freq. in Hz (optional)}
Here is an example:
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.539.band.gz" 59.5390 59.5900
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.589.band.gz" 59.5890 59.6400
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.639.band.gz" 59.6390 59.6900
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.689.band.gz" 59.6890 59.7400
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.739.band.gz" 59.7390 59.7900
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.789.band.gz" 59.7890 59.8400
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.839.band.gz" 59.8390 59.8900
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.889.band.gz" 59.8890 59.9400
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.939.band.gz" 59.9390 59.9900
"/home/einstein/EinsteinAtHome_Runs/S6Bucket/results/sorted_canonical/0050/S6Bucket_0059.989.band.gz" 59.9890 60.0400

Note1: Input files ending in .gz are assumed to be in gzipped format and are decompressed on the fly while reading from them.

Note 2: The software assumes that each input file is sorted on the primary detection statistic (as configured in the configuration file, see above) in descending order.

Note 3: A simple script to help in generating such fillelists is provided (collect_filelist.sh)

Note 4: Using lower and upper bound sof the frequency range to filter the instrument lines requires that you properly set the limits of the parameter space on fdot,f2dot etc in the configuration file (see above).

  • {output-file} (mandatory): The path to a file where the surviving candidates will be written to. The outfile consists of lines copied from the input files (after decompression if necessary), therefore it has the same format as the input files.

  • {instrument-lines-file} (mandatory): An ASCII file containing instrument line bands, one per line. Here is an example:
1.000000000 1000.000000000  0.999919400  1.000080600 
46.700000000  1.000000000 46.693200000 46.706800000 
48.000000000  1.000000000 47.960000000 48.040000000 
The meaning of the individual fields is as follows:
  • Field 1: center frequency in Hz (currently ignored)
  • Field 2: Number of harmonics (positive integer): a number n means that up to the nth harmonic of the band should be considered. If lo and hi are the bounds of a band, (see the next two fields), this means that the bands [2*lo,2*hi], [3*lo,3*hi] .... [n*lo,n*hi] are also considered to be instrument line bands.
  • Field 3: lower bound of the band
  • Field 4: upper bound of the band

The entries in this file need not to be sorted.

Note: An empty file (no instrument lines) is allowed as input. However, in this case there are probably more efficient ways to calculate the desired output (loudest N candidates, see below)

  • {N} (mandatory): The number of loudest candidates to compute from the candidates in teh input file

  • {use-sky-specific-doppler-drift} (0 or 1, optional): This flag switches toggles between two available algorithms to compute whether a candiate signal could be affected by a given instrument line band: either 0 ("don't use sky-specific-doppler-drift formula" = compute the Doppler drift of the signal in the detector frame, use the worst case Doppler drift ) or 1 ("use sky-specific-doppler-drift formula" = take the candidate signal's sky position into account when computing the maximum Doppler drift that this signal experiences in the detector frame.) The later setting is more selective in rejecting candidates.

  • {remove-lines-first} (0 or 1, optional): This flag toggles between two alternative filter algorithms, differeing in the order of the steps "eliminate candidates affected by line bands" and "filter out loudest N signals. The value 0 will cause the software to first identify the loudest N candidates from all input files, and will only then remove candidates affected by lines, therefore the number of candidates in the final output can be less than then parameter N. The argument value reverses this order: it produces the (up to) N loudest candidates that are not affected by instrument lines. In this case, the only case where the output can have less than N candidates will be a situation when there are in total less than N candidates that are unaffected by lines, otherwise the output will always have exactly N elements.

Note: In order for the instrument line filtering to work properly, make sure to set up menaningful values for the constants "c" (speed of light in vacuum), "vOrbMax" (maximum orbital velocity of Earth in orbit around the sun and "vSpinMax" (max. velocity of a point on the surface of earth at any of the detector sites in a frame centered on the barycenter of Earth).

Note: TODO Maybe it would be good to have a third option where the number N is scaled by the relative overlap of the frequency interval by the vetoed bands, e.g. if we are looking for loudest 100 candidates per 0.5Hz bands, but 50% of a band is vetoed by lines , we would instead look for the loudest 50 candidates only, to avoid a bias towards quiter signals in bands heavily overlapped by instrument lines. It is not immediately clear, tho, how to compute this "overlap" as the effective width of vetoed band depends on the spindown of the candidates (and potentially on the sky position as well, see previous option). More thinking required....

S6BLineRemovalLoudestNBatchConsistVeto (="Find loundest N candidates and remove candidates potentially affected by instrument lines and apply 2F consistency veto")

This script is an extension of LineRemovalLoudestNBatch specifically for the S6Bucket run where recalculated avg2F values for the individual detectors were added to the result for later applying a consistency veto in the postprocessing (the veto isn't applied in the search application itself).

The command line options are as follows (see next sections for detailed description)
    S6BLineRemovalLoudestNBatchConsistVeto {config-file} {input-file-listfile} {output-file} {instrument-lines-file} {N} {use-sky-specific-doppler-drift} {remove-lines-first} {apply-consistency-veto}

The meaning of the parameters is exactly the same as in LineRemovalLoudestNBatch if the last boolean parameter is omitted or set to 0 (=false).

* {apply-consistency-veto} (0 or 1, optional): If set to 0, candiadtes will be filtered out that satisfy the following condition: the recomputed, fine grid avg2F value is less than at least one of teh recomputed single detector avg2F values ("consistency veto").

Note: If the parameter {remove-lines-first} is set to 1, the consistency veto is also applied first, then the {N} loudest candidates surviving line removal and veto are produced. Otherwise ({remove-lines-first} is set to 0, the veto is applied after selecting the N loudest candidates.

collect-filelist.sh

A tiny helper script to generate input-listfiles (see descripytion of second parameter of LineRemovalLoudestNBatch). The standard usage is as follows: Use ls -1 {some file mask} to select the paths of the input files you want to process , and pipe them into teh helper script collect-filelist.sh The first argument is the configuration file for the pipeline (see above) and the second is the offset to add to the frequency in the filename to compute the upper limit of candidate frequency contained in the file.

Example:
ls -1 /home/einstein/EinsteinAtHome_Runs/S5GC1/results/sorted_canonical/1210/*.gz  | ./collect_filelist.sh config.xml  0.051

Note: If files have bandwidth 0.05 Hz, a file named (say) S5GC1_1219.267.band.gz will contain frequencies of at least 1219.267 Hz, but might (because of rounding) contain candidates slightly above 1219.267 + 0.05 Hz, therefore it's necessary to add 0.051 as an offset in this example to get a valid upper limit.

Note: Make sure the configuration setting <run-name> is set to match the prefix of the files (S5GC1 in this example).

quick_plot_cand.sh (="generate a 'quick look' scatter plot from a file of candidates")

Syntax:
quick_plot_cand.sh {config-file} {candidate-file} {output-file} {aux-axis}
where the first command line argument is the pipeline configuration file (see above), the second is a file containing candidates in the format configured in the configuration file. The third argument is the filename of the output file of the plot, the format of the output is deduced from the file extension (supported extensions are jpg, jpeg, gif, png, pdf, eps, implying the respective graphics formats). The optional forth argument selects a color coding option for the plot, the following values are supported:
  • f1dot : spindown

  • dec : abs(declination), so that candidates at the (equatorial) poles will be mapped to the maximum value of pi/2 and will be color coded in red, and candidates at the equator will be color coded black

  • LAT : abs(latitude) in ecliptic coordinates, , so that candidates at the (ecliptic) poles will be mapped to the maximum value of pi/2 and will be color coded in red, and candidates in the ecliptic plane will be color coded black
While these plots do not replace a computational analysis, especially the color coded plots can help to detect clustering of candidates by visual inspection.

Getting the prototype software

Currently the prototype is located in a git repository that can be cloned like this :
git clone git://gitmaster.atlas.aei.uni-hannover.de/einsteinathome/genericpostprocessing.git

For convenience, the compiled files, scripts and examples are stored in a tarball in prototype/latest_build/EAH_PP_proto.tar.gz in this repository. Unpack to a location you prefer for your experiments.

Some scripts require JAVA 1.5 or higher. It appears that the JAVA installation on some ATLAS head nodes is broken, however, atlas3 at least seems to work. Also requires is xmllint, GNU awk and bash.

Licence info

The prototype currently makes uses of the following (open source) JAVA packages

-- HeinzBerndEggenstein - 20 Sep 2012

Topic revision: r13 - 28 Apr 2017, HeinzBerndEggenstein
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback