Harvard RC server
=======================
The server uses the slurm scheduler `official tutorial `_.
- `User ID for Harvard RC `_
- Login into RC
.. code-block:: none
ssh {username}@login.rc.fas.harvard.edu
- machine partitions option (``-p``): use ``cox``, ``seas_dgx1``, ``gpu_requeue``
- check machine availability in a partition: use ``showq -o -p ``
- script to find the best partition to use given your sbatch file: `link
`_
- useful ``slurm`` commands
- ``squeue -u ${username}``: check job status
- ``scancel -u ${username}``: cancel all your jobs
- ``scancel ${jobid}``: cancel a specific job
- ``srun``: Get an interactive bash from a machine for debugging
- parameters: ${1}=memory in MB, ${2}=# of CPUs, ${3}=# of GPUs
- request CPU machines::
srun --pty -p cox -t 7-00:00 --mem ${1} -n ${2} /bin/bash
- request GPU machines::
srun --pty -p cox -t 7-00:00 --mem ${1} -n ${2} --gres=gpu:${3} /bin/bash
- ``sbatch``: Submit batch of jobs in the background
.. code-block:: none
#!/bin/bash
#SBATCH -n 1 # Number of cores
#SBATCH -N 1 # Ensure that all cores are on one machine
#SBATCH -t 2-00:00 # Runtime in D-HH:MM, minimum of 10 minutes
#SBATCH -p cox # Partition to submit to
#SBATCH --gres=gpu:1 # Number of GPUs
#SBATCH --mem=16000 # Memory pool for all cores (see also --mem-per-cpu)
#SBATCH -o OUTPUT_FILENAME_%j.out # %j inserts jobid
#SBATCH -e OUTPUT_FILENAME_%j.err # %j inserts jobid
YOUR COMMAND HERE
.........
- ``sbatch``: use python to generate/submit jobs
.. code-block:: python
import sys
opt = sys.argv[1]
Do = 'db/slurm/'
def get_pref(mem=10000,do_gpu=False):
pref = '#!/bin/bash\n'
pref+='#SBATCH -N 1 # number of nodes\n'
pref+='#SBATCH -p cox\n'
pref+='#SBATCH -n 1 # number of cores\n'
pref+='#SBATCH --mem '+str(mem)+' # memory pool for all cores\n'
if do_gpu:
pref+='#SBATCH --gres=gpu:1 # memory pool for all cores\n'
pref+='#SBATCH -t 4-00:00 # time (D-HH:MM)\n'
return pref
cmd=[]
mem=10000
do_gpu= False
if opt =='0':
fn='aff' # output file name
suf = ' \n'
num = 25;cn = 'classify4-jwr_20um.py'
cmd+=['source activate pipeline \n'] # activate your conda env
cmd+=['python '+cn+' %d '+str(num)+suf]
pref=get_pref(mem, do_gpu)+"""
#SBATCH -o """+Do+"""slurm.%N.%j.out # STDOUT
#SBATCH -e """+Do+"""slurm.%N.%j.err # STDERR
"""
for i in range(num):
a=open(Do + fn+'_%d.sh'%(i),'w')
a.write(pref)
for cc in cmd:
if '%' in cc:
a.write(cc%i)
else:
a.write(cc)
a.close()
# code to run on bash
print ('for i in {0..%d};do sbatch '+Do+'%s_${i}.sh && sleep 1;done')%(num-1, fn)
- ssh tunnel for port forwarding (e.g. tensorboard display)
- Parameters:
- p1: port number you want to display on localhost
- p2: port number on RC login server
- p3: port number on RC compute server (6006 for tensorboard)
- m1: server name, e.g. coxgpu06
- Local machine -> RC login server::
ssh -L p1:localhost:p2 xx@login.rc.fas.harvard.edu
- RC login server -> RC server::
ssh -L p2:localhost:p3 m1
- On RC server::
tensorboard --logdir OUTPUT_FOLDER
- Load cuda on rc cluster::
module load cuda/9.0-fasrc02 cudnn/7.0_cuda9.0-fasrc01
- `Harvard VPN `_