Sydney Clusters

Madsen Cluster

If you would like access to the Madsen cluster, email/skype/IM your public SSH key to Tim (details here). The username to log in to any of the machines in this cluster is cluster.

This cluster consists of 39 machines, each with a dual core 3GHz CPU, 4GB RAM, and 7GB local disk space. Each of the machines are accessible via

$ ssh cluster@<machine name>.ug.it.usyd.edu.au

The cluster account has a folder ~/mnt which is network mounted via CIFS to an external account, which has a larger disk space (around 30ish GB). Consequentially, any access to data within this folder is pretty damn slow, especially if you are reading and writing (extracting a tar file to and from somewhere in ~/mnt for example). Thus, when working with data, one should work in /tmp so that you get local disk access. This might involve copying files from ~/mnt, or extracting a tar file into /tmp.

Since each machine only has a measly 7GB of local disk space for use, we will assign each person in our group one machine to work with so that everyone has up to 7GB to work with. However, when running our distributed training or feature extracting, we require around 2.5GB of space on each node. Before we start a new run on the cluster, an email will be sent out warning people to move anything they want kept from /tmp into somewhere in ~/mnt (preferably ~/mnt/personal/<your name> ) if you're currently using more than 4.5GB of space on the local drive.

MachineHuman
pc-lg32-10Byung-Gyu
pc-lg32-11James H
pc-lg32-12Aurelie
pc-lg32-13Yue
pc-lg32-14Jessi
pc-lg32-15Jono
pc-lg32-16Tim
pc-lg32-17Curt

The complete list of Madsen cluster machines in MPI machinefile format is in ~/mnt/madsen.hosts. Note that at the moment we are unable to ssh to pc-lg31-5, pc-lg31-6, pc-lg31-8, pc-lg31-9 and pc-lg32-15.

Current Usage

Go here:

http://www.ug.it.usyd.edu.au/~jkum0593/machines/usage.cgi

And select the machines / machine groups you are about to start using, then enter your name and hit submit. Once done do the same to remove your name from the machines.

Data

Various versions of wikipedia are stored in ~/mnt/data, all in .tgz form. At the moment you can find:

  • Raw wikipedia text, in a folder with one file per page
  • CCGbank format parses
  • Grammatical Relations

Also, for convenience when doing further processing, there is a directory /tmp/data on each machine. The wikipedia data has been divided into 78 ~equal parts and stored here in the following forms:

  • Raw, tarred and gzipped
  • Tokenised, plain text, using the nltk tokeniser mentioned elsewhere in the wiki
  • POS-tagged, plain text, using the candc-1.00 POS tagger

Also in /tmp/data is the models directory that is available from this website.

An example script that iterates through each machine+section pair and prints the first tokenised sentence can be found here.

Student Cluster

This cluster consists of various honours, PhD and postdoc students' personal desktop computers within the School of IT. There are 10 more machines with dual core 3GHz CPUs, 4GB RAM, and around 120GB of free HDD space. It is preferable not to use these machines for heavy processing during the day time (Sydney time) due to the fact they are students personal desktop machines.

The complete list of Student cluster machines in MPI machinefile format is in ~/mnt/student.hosts.

Carslaw Cluster

Does not currently exist, but hopefully will consist of 60 machines of exactly the same specs as the Madsen cluster. Hopefully this will exist sometime before the end of the workshop.

NLP Cluster

This is the "old" 8 node cluster used for all previous experiments prior to the start of the workshop. Currently this is only accessible to Sydney people but can be setup so that others can access it too if need be. This cluster has access to a larger number of resources such as many corpora.