A Cloud-Hating Curmudgeon’s Unofficial Manual for a Grid-less Workflow on UCLA’s Hoffman2 Cluster
| Gabriel |
A couple years ago UCLA’s pop center migrated our statistical computing from our own server to the university’s Hoffman2 cluster. When this happened I tried out the cluster and hated the recommended “Grid” browser-based GUI, with the single biggest aggravation being that it requires you to transfer files one at a time through a clunky upload/download wizard. As such, I paid for my own Stata MP license (which even as part of a lab volume purchase wasn’t cheap) and since the migration I’ve just done all my statistics locally on my MacBook.
I’ve recently given Hoffman2 another try and realized that I can just ignore “Grid” and do my regular workflow when dealing with a server:
- write code with a good local text editor (preferably one that is SFTP compatible)
- sync scripts, data, and output between the local and remote file systems with an SFTP client
- batch jobs on the server through SSH
- (as a last resort) run GUI apps through X11
Pretty plain vanilla stuff but it’s actually much simpler in practice than a (broken) browser-based GUI.
Now that I’ve gotten this worked out I’m a big fan of Hoffman2 for big jobs because it’s extremely fast. For instance, a simulation that takes Stata MP about seven hours on my MacBook took just an hour and twenty minutes on Hoffman2. As such I’m writing up some notes on how I use it, in part so I remember and in part so I can recommend the cluster to colleagues and students.
File management. Use a dedicated FTP client like Filezilla or Cyberduck. (For some reason the Finder/Pathfinder “Connect to Server” command doesn’t work with Hoffman2). The connection type should be “SFTP”. The URL is “hoffman2.idre.ucla.edu”. Your name and password are the same logins you use for an SSH terminal session (or as the documentation calls it, a “node” session). Use your FTP client to upload data and scripts (which you will probably write locally on a text editor) and download output. Here’s what my configuration window looks like in Cyberduck.
Coding. Either do this locally and sync it through SFTP (see above) or use a text editor with integrated SFTP. On a Mac, TextWrangler/BBEdit has great SFTP support (in addition to other notable features such as really good regular expressions support and Stata syntax highlighting). I can also recommend the cross-platform program Komodo Edit. Or if you’re into that sort of thing you can use Vim or emacs through SSH.
Connecting to SSH. Open your “Terminal” (on Mac/Linux) or an SSH client (on Windows). Type “ssh hoffman2.idre.ucla.edu”. If it didn’t guess your username correctly you need to write “ssh hoffman2.idre.ucla.edu -l username“. You now have a bash session. You can do all the usual stuff, but mostly you’re just going to batch jobs.
Batching a Job. If you just want to put a job in the queue you simply type “program.q script“. For instance, to do the Stata script “foo.do” you’d make sure you’re in the right directory and type:
The documentation makes it sound much more complicated than this, but 9 times out of 10 that’s all you need to do. The system will email you when your job starts and finishes and you can use SFTP to retrieve the output and log. However if you want to kill a job or something, you just type program.q without arguments and then follow the instructions.
Importing your Stata ado-files
Unlike R (where you have to put “library()” at the start of your source files), Stata’s use of libraries is so transparent that you can forget they’re not part of the stock Stata installation. (My first batch crashed twice because I forgot to install some of my commands). On your own computer, remind yourself what ado-files you have installed with these Stata commands.
disp "`c(sysdir_plus)'" disp "`c(sysdir_personal)'"
On a Mac, both of these folders are in “~/Library/Application Support/Stata/ado”
Once you remember what ado-files you want, write yourself a do-file that will install them and batch it. For instance, I did:
ssc install fs ssc install fsx ssc install gllamm ssc install estout ssc install stata2pajek ssc install shufflevar
The ado files go in “~/ado” which has the practical upshot that you don’t need admin permission to install them and they persist between sessions.
Interactive GUI Usage. Do it on your own computer. If that’s not possible (perhaps because you don’t have a personal license for a particular piece of software) use X11 rather than Grid. When I experimented with Grid’s browser-based VNC session it took forever to load the Java Virtual Machine, it refreshed at about 10 frames per second, and worst of all it wouldn’t capture keyboard input.
The results are much better if you use a real X11 client rather than Grid’s JVM. To do this you first connect through X11 (in Mac this means using X11.app rather than the Terminal.app) and add the flag “-X” to your ssh session (eg, “ssh hoffman2.idre.ucla.edu -l rossman -X”). As always you can test it with “xeyes” command. You then type “xstata” and follow the instructions carefully. (It bounces you to an interactive node and makes you type back a fairly lengthy command to actually launch the session). It’s a pretty fair amount of work to get an X11 session but unlike the browser version it is useable. (For more instructions on X11 sessions for Stata and other software see the links labeled “How to run on ATS-Hosted Clusters” in this table). Try to avoid this though as it’s faster and less work to just script and batch it with the “stata.q” command described above.
Finally, if you just want an interactive command-line session you can use ssh and issue the “qrsh” command. This actually works really well. Remember that you don’t need to see a graph to make a graph but can use the “graph export” command in Stata and the “pdf()” function in R to write graphs to disc and then retrieve them through your SFTP client.