Directory hierarchy for LSC files
Storage structure for S4/S5/S6 data (past)
In the past we used paths like these
H/H1/RDS/C03/L1/H-H1_RDS_C03_L1-822092472-60.gwf
H/97161/H-H1_RDS_R_L1_971619968-64.gwf
SFT/B/H-1_H1_1800SFT_C03hoftCorrectSegsFmin38Hz-874533574-1800.sft
CIT uses this structure in its
archive
:
S5/strain-L1/LLO/L-L1_RDS_C03_L1-8193/L-L1_RDS_C03_L1-819365103-128.gwf
As things stand now, it's a mess.
Future proposed hierarchy for both HSM and file server storage
Time based data products
The following has been proposed to address the following problems:
- standardized storage hierarchy for file servers as well as our archive on the HSM - beneficially side-effect, we can use autofs to transparantly serve files from the HSM if a file server is not reachable/under too much stress
- not too many files/directory entries per directory (max should be of the order of 1000)
- files should also be findable even if we don't have access to LDR's database
- based on discussions with Dan, Stuart, Ed, Jeff, Scott and Martin, we adjusted the layout a bit to also include the "Frame Type"
Thus we came up with the following for time based data products
Observatory/Type/GPSlead/GPSrem/file
where
- Observatory
- Any Observatory abbreviation (H/I/K/L/V/...) and any combination there-of, e.g. GHLV
- Type
- Type of file, e.g.
H1_RDS_R_L1
, H1_RDS_C03_L1
, 1_H1_1800SFT_C03hoftCorrectSegsFmin38Hz
, ...
- GPSlead
- first 4 digits of 10 digit GPS time with leading zeros.
- GPSrem
- remainder of 10 digit GPS time rounded down to full 1000, e.g. 1234567890 -> 1234 / 567000
- file
- Full regular file name according to convention
Thus the examples from above would now have the following structure:
H/H1_RDS_C03_L1/0822/092000/H-H1_RDS_C03_L1-822092472-60.gwf
H/H1_RDS_R_L1/0971/619000/H-H1_RDS_R_L1_971619968-64.gwf
H/1_H1_1800SFT_C03hoftCorrectSegsFmin38Hz/T/0874/533000/H-1_H1_1800SFT_C03hoftCorrectSegsFmin38Hz-874533574-1800.sft
L/L1_RDS_C03_L1/0819/365000/L-L1_RDS_C03_L1-819365103-128.gwf
frequency based data products
Mostly the same as before, but as GPS times are rather static, we simply but those into 100Hz frequency bands (and adding the
F
directory level). The proposed structure is
Observatory/Type/subtype/GPSstart/band
where
- Observatory
- Any Observatory abbreviation (H/I/K/L/V/...) and any combination there-of, e.g. GHLV
- Type
- is the data product type (SFT)
- Subtype
-
F
for frequency based data product
- GPSstart
- full 10 digit GPS start second of data products (as we expect many files with the exact same starting time, this can be used as a reference to S5a S5b, S6a, ...)
- band
- 4 digit, zero prefixed 100 Hz band, e.g. 0200
i.e.
H/SFT/F/1234567890/0200/H-H1_SFT_0245Hz-1234567890-17284566.sft
Note this might be changed once a community standard has been proposed?
Exceptions et al.
- The subtype for the
SFT
type should be T
for time based files, i.e. full bandwidth files and F
for frequency based files, i.e. long duration but very limited frequency band.
Distribution
At the time of writing (2013) we have 37 data servers in operation where is striped across. This is done by taking the GPS time, dividing by 1000, truncating the result, take the modulus w.r.t to 37 and add 1, i.e.:
target server = printf "d%02d" $((1+(GPSTIME/1000) % 37 ))
Example:
echo H-H1_NINJA2_GAUSSIAN-875698108-4096.gwf | gawk -F"-" '{FRONT=int($3/1e6); printf "d%02d:/data/LSC/%s/%s/%04d/%06d/%s-%s-%s-%s\n",1+int($3/1000)%37,$1,$2,FRONT,1000*int(($3-1e6*FRONT)/1000),$1,$2,$3,$4}'
d20:/data/LSC/H/H1_NINJA2_GAUSSIAN/0875/698000/H-H1_NINJA2_GAUSSIAN-875698108-4096.gwf
--
CarstenAulbert - 10 Apr 2012