Harvard RC serverΒΆ
The server uses the slurm scheduler official tutorial.
Login into RC
ssh {username}@login.rc.fas.harvard.edu
machine partitions option (
-p
): usecox
,seas_dgx1
,gpu_requeue
check machine availability in a partition: use
showq -o -p <partition>
script to find the best partition to use given your sbatch file: link
useful
slurm
commandssqueue -u ${username}
: check job statusscancel -u ${username}
: cancel all your jobsscancel ${jobid}
: cancel a specific job
srun
: Get an interactive bash from a machine for debuggingparameters: ${1}=memory in MB, ${2}=# of CPUs, ${3}=# of GPUs
request CPU machines:
srun --pty -p cox -t 7-00:00 --mem ${1} -n ${2} /bin/bash
request GPU machines:
srun --pty -p cox -t 7-00:00 --mem ${1} -n ${2} --gres=gpu:${3} /bin/bash
sbatch
: Submit batch of jobs in the background#!/bin/bash #SBATCH -n 1 # Number of cores #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 2-00:00 # Runtime in D-HH:MM, minimum of 10 minutes #SBATCH -p cox # Partition to submit to #SBATCH --gres=gpu:1 # Number of GPUs #SBATCH --mem=16000 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH -o OUTPUT_FILENAME_%j.out # %j inserts jobid #SBATCH -e OUTPUT_FILENAME_%j.err # %j inserts jobid YOUR COMMAND HERE .........
sbatch
: use python to generate/submit jobsimport sys opt = sys.argv[1] Do = 'db/slurm/' def get_pref(mem=10000,do_gpu=False): pref = '#!/bin/bash\n' pref+='#SBATCH -N 1 # number of nodes\n' pref+='#SBATCH -p cox\n' pref+='#SBATCH -n 1 # number of cores\n' pref+='#SBATCH --mem '+str(mem)+' # memory pool for all cores\n' if do_gpu: pref+='#SBATCH --gres=gpu:1 # memory pool for all cores\n' pref+='#SBATCH -t 4-00:00 # time (D-HH:MM)\n' return pref cmd=[] mem=10000 do_gpu= False if opt =='0': fn='aff' # output file name suf = ' \n' num = 25;cn = 'classify4-jwr_20um.py' cmd+=['source activate pipeline \n'] # activate your conda env cmd+=['python '+cn+' %d '+str(num)+suf] pref=get_pref(mem, do_gpu)+""" #SBATCH -o """+Do+"""slurm.%N.%j.out # STDOUT #SBATCH -e """+Do+"""slurm.%N.%j.err # STDERR """ for i in range(num): a=open(Do + fn+'_%d.sh'%(i),'w') a.write(pref) for cc in cmd: if '%' in cc: a.write(cc%i) else: a.write(cc) a.close() # code to run on bash print ('for i in {0..%d};do sbatch '+Do+'%s_${i}.sh && sleep 1;done')%(num-1, fn)
ssh tunnel for port forwarding (e.g. tensorboard display)
Parameters:
p1: port number you want to display on localhost
p2: port number on RC login server
p3: port number on RC compute server (6006 for tensorboard)
m1: server name, e.g. coxgpu06
Local machine -> RC login server:
ssh -L p1:localhost:p2 xx@login.rc.fas.harvard.edu
RC login server -> RC server:
ssh -L p2:localhost:p3 m1
On RC server:
tensorboard --logdir OUTPUT_FOLDER
Load cuda on rc cluster:
module load cuda/9.0-fasrc02 cudnn/7.0_cuda9.0-fasrc01