Principle

The node number a particular file lives on is determined by its filename: take the first six digits of the md5sum of the name, take this as a hex number and take this modulo the number of nodes. Here's a line of unix commands that does this:
echo filename | md5sum | gawk '{printf "n%04d\n", ( strtonum ( "0x" substr($1,1,6) ) % 1342 ) + 1 }'
Not that
  • the "echo" command adds a linefeed to the filename the md5sum is calculated from
  • on ATLAS the first node is n0001, therefore the 1 we add

The S5R3 results of Einstein@home have been distributed slightly different: the name that has been taken the md5 sum of is not the name of the whole file, but that of the "datafile" of that "resultsfile" (i.e. the part before the double underscore "__"). This way the resultfiles of a frequency band are kept together on one node.

For a little bit of redundancy, the same file is also put on the node "halfway round" the cluster, i.e. half of the total number of nodes away. On an ATLAS node the name of this partner node can be printed with
hostname | awk '{ printf "n%0.4d\n", ( ( substr($1,length($1)-3) + 670 ) % 1342 ) + 1 }'

As the location of a file is derived from its name and there is no such thing as a single directory anymore, it's necessary to keep track of which files are actually there. Here this is done by keeping "indexfiles" and "cachefiles".

Usage

Access to data that is distributed that way is provided by means of two programs: "cache_length" and "get_file". For a given "cachefile", first call "cache_length" to determine the number of files in this cache. Then call "get_file" successively with indexes 0...n to get an absolute path to the actual file in the cache. The cachefiles should be generated by the admin who then takes responsibility to publish where they can be accessed.

For Einstein@home results on ATLAS the cachefiles are grouped by the base frequency of the data files of the results and named "cache_", e.g. "cache_100.0" (for 100 Hz), "cache_998.5" (for 998.5 Hz) asf. The cachefiles for S5R1 results are on every node in /local/distributed/clone/spray/EatH/S5R1/ (and on d03 in /atlas/data/d03/EatH/S5R1_dist/index). The cache files for S5R3 are currently also on d03, accessible from the nodes in /atlas/data/d03/EatH/S5R3_dist/caches, and will be distributed to the nodes in /local/distributed/clone/spray/EatH/S5R3 .

The following sample sh script collects the md5 sums of all files in all caches of the Einstein@home S5R1 results:
# loop through all local S5R1 cache files
for cache in /local/distributed/clone/spray/EatH/S5R1/cache_* ; do
  files=`/home/bema/bin/cache_length "$cache"`
  # loop through all files in that cache, starting with 0
  file=0
  while [ $file -lt $files ]; do
    # that's the get_file call 
    if path=`/home/bema/bin/get_file -m "EatH/S5R1" -c "$cache" -i "$file"`; then
      md5sum "$path" >>/local/bema/S5R1_md5.txt
    else
      echo "get_file failed on path '$path' for cache '$cache' file '$file'" >&2
    fi
    file=`expr $file + 1`
  done 
done

During development, the tools mentioned can be found in /home/bema/bin, they will be distributed over the nodes when development has settled a bit.

The source code is available from my personal git repository
git-clone git+ssh://n0.aei.uni-hannover.de/srv/git/public/bema/atlas.git

Distributing Data (for Admins)

  • find a distribution pattern that fits the access pattern of the application
  • generate "cache files", i.e. groups of data files that will be accessed in sequence. The syntax is one line for each datafile in the cache:
    data_file_name;/mount/path/to/first/instance;/mount/path/to/second/instance;/mount/path/to/fallback
    • for the EatH/S5R3 distribution scheme, pipe a list of filenames (even with full paths) into the sort_caches program (git repository). For each input line it writes out a line filename;firstnode;secondnode;fallbackdir which could easily extended to a cachefile line e.g. by piping the output through
      awk -F\; '{print $1 ";/atlas/node/" $2 "/distributed/spray/data/EatH/S5R3n;/atlas/node/" $3 "/distributed/spray/data/EatH/S5R3n;/atlas/data/d04/distributed/fallback/EatH/S5R3/" $4}'
    • for a new distibution it's probably easiest to use the nodename of the first instance of the file for the fallback directory (for S5R3 the directoryname is calculated in sort_caches only because the fallback directories were already there from previous obsolete distributions)
  • create the fallback directories on the data server and (hard-) link the files that belong there according to the entries in the cache files
  • make sure the target directories for files and caches exist on all nodes. This would be /local/distributed/spray/data/EatH/S5R1 and /local/distributed/clone/spray/EatH/S5R1 respectively, if EatH/S5R1 is the module.
  • let the nodes fetch the data files from the fallback directories
    • if you followed the advice above for new distributions, nodes < n0672 can simply fetch them from the fallback directories corresponding to their hostname, other nodes should try to fetch them from their partners in a second step, and must substract 671 from their number to get the fallback directory name on the data server to fetch files from if the partners aren't available)
  • distribute the cache files to all nodes
  • some inspiration for command line to do this can be found in Bernds Sandbox

Useful scripts for maintenance

The scripts are located in
ssh+git://name@n0.aei.uni-hannover.de/srv/git/shared/cluster/scripts.git

in DataDistribution

The most useful are:
check_complete_S5R1.sh
It checks whether S5R1 result files are complete by comparing what is written in the cache files with the file located physically on the harddrive.
#!/bin/sh

cache=/local/distributed/clone/spray/EatH/S5R1
files=/local/distributed/spray/data/EatH/S5R1n
log_s=/root/log/check_compleateS5R1_s.log
log_h=/root/log/check_compleateS5R1_h.log

dsh -g nodes -c -M "cd $files; ls | wc -l" | tee $log_h
dsh -g  nodes -c -M "cd $cache; grep -h \"node/\`hostname\`\" *| cut -d \";\" -f 1| sort| uniq|wc -l"    | tee $log_s

sort $log_s > /tmp/foo
mv /tmp/foo $log_s
sort $log_h > /tmp/foo
mv /tmp/foo $log_h
check_complete_S5R3.sh
Does almost the same as check_complete_S5R1.sh, the output is slightly different.
#!/bin/sh

cache=/local/distributed/clone/spray/EatH/S5R3
files=/local/distributed/spray/data/EatH/S5R3
log=/root/log/check_compleateS5R3.log

dsh -g nodes -c -M "if [ \"\`hostname| sed \\\"s/n//\\\"| bc\`\" -gt 671 ]; then sn=\`echo \"\$HOSTNAME-671\"| sed \"s/n/1/\"| bc| sed \"s/^1/n/\"\` ;else sn=\$HOSTNAME ;fi;  cd $cache ;j=0;for i in \`head -n 1 *| grep -B1 \$sn|grep cach| cut -d \" \" -f 2\`;do let j=\$j+\`wc -l \$i| cut -d \" \" -f 1\`;done; r=\`ls $files | wc -l | cut -d \" \" -f 1\`; echo \$j-\$r"| tee $log

validate_S5R1.sh
Validates all result files on the nodes in parallel and reports an error. /usr/local/bin/validator_test has to exist which is not a part of the FAI installation.
#!/bin/sh

files=/local/distributed/spray/data/EatH/S5R1n
log=/root/log/validate_S5R1.log

dsh -g nodes -c -M  "cd $files;for i in \`ls \`;do /usr/local/bin/validator_test \$i || echo \$i;done 2>/dev/null"| tee $log
get_missing_files_S5R1.sh
It logs which result files are actually missing on a list of nodes.
#!/bin/sh

nodes=list
cache=/local/distributed/clone/spray/EatH/S5R1
files=/local/distributed/spray/data/EatH/S5R1n
log=/root/log/missingfiles_S5R1.log

dsh -g $nodes -c -M "cd $cache; grep -h \"node/\`hostname\`\" *| cut
-d \";\" -f 1| sort| uniq> /tmp/list1"
dsh -g $nodes -c -M "cd $files; ls > /tmp/list2"
dsh -g $nodes -c -M "cd /tmp; diff list1 list2| grep \"<\"| cut -d \"
\" -f 2"| tee $log

This topic: ATLAS > AtlasDistributedFiles > DataSpray
Topic revision: 23 Jan 2009, BerndMachenschalk
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding Foswiki? Send feedback