Cleaning your $HOME
Quick steps for those who have no time to spare
- Locate directory which you don't need right now but which produced a lot of small Condor log files. For now assuming this directory is named
ProjectA
- run
screen -d -m tar --remove-files -czf ProjectA.tar.gz ProjectA
(this will start screen
in the background)
- That's all, once
tar
is done, the screen
will terminate. In case you want to attach to screen
, run screen -r
.
That's all, after a while all files will be in the tarball and the directory should vanish, saving precious space and making your home file server faster (a tiny bit a least).
Longer story
After working on several projects at the same time and several months, it's inevitable you leave a lot of files behind you don't really need anymore but want to keep "just in case". This, however, can cause quite a bit of pain for the admins, especially when moving a home file server to a new venue (this can take days up to more than a week!).
To elevate this problem a bit which is mostly caused by a very large number of small files, please consider putting your files into an "archive", e.g. run
tar
on a directory you don't need anymore. For many more details look at the final section at the bottom for a rough idea about the potential costs/benefits.
Compressing old directories
If you have such old analysis directories you don't want to lose, but you want to help to make the world a happier place, you can just run (assuming the directory of the old analysis is called
oldbreakthru
)
tar czf oldbreakthru.tar.gz oldbreakthru
and afterwards delete the directory with
rm -rf oldbreakthru
(but make sure to NOT delete the
tar.gz
file).
tar
can do all the same in one go with the following command (now with long arguments, pleae have a look at
tar
's man page for more information and how to use the other compression algorithms:
tar --create --gzip --remove-files --file oldbreakthru.tar.gz oldbreakthru
If you happen to have one directory full of directories of old analyses, you can do the same for every directory there with a little bit of shell magic (using bash style; assuming all directories start with OLD):
for dir in OLD*; do tar --create --gzip --remove-files --file $dir.tar.gz $dir; done
Why should you, the admins or just anyone care?
You might wonder why we (the admins) care about this so much, and the answer is just performance and costs involved. if you take the trial dataset considered below, the "real" size of this data set is 729.57MB which you get if you just add up all file sizes as shown by
ls
. However, this is only partially true. The underlying file system stores data in multiples of some
blocksize which speeds up the access overall, but wastes space. Thus on a file system with 512 byte blocks this particular data set would use 736.34MB, on a Sun Thumper (s01..s13) with a block size of 32k this is already 906.75MB and finally on the HSM (where
$HOME
is on a file system with
qfs in the name), the block size is already at 256k as this data is striped across
many disk sets. Here, the same data set suddenly consumes 5443.5MB(!).
Thus, just by copying data from one server to another, we increase the data size by a factor of about 6. This is only due to rounding of different blocksizes governed by the underlying file systems.
Trial data set
The trial dataset was involuntarily provided by an anonymous user (i.e. the user did not know that ;)), contains 74484 files and has a total size of 929 MByte (mostly Condor logs). The following table shows results obtained in compressing this data set into a tarball. To ignore file system related effects, the data set resided on a ramdisk and was also written to a ramdisk: