Harvard RC serverΒΆ
The server uses the slurm scheduler official tutorial.
Login into RC
ssh {username}@login.rc.fas.harvard.edumachine partitions option (
-p): usecox,seas_dgx1,gpu_requeuecheck machine availability in a partition: use
showq -o -p <partition>script to find the best partition to use given your sbatch file: link
useful
slurmcommandssqueue -u ${username}: check job statusscancel -u ${username}: cancel all your jobsscancel ${jobid}: cancel a specific job
srun: Get an interactive bash from a machine for debuggingparameters: ${1}=memory in MB, ${2}=# of CPUs, ${3}=# of GPUs
request CPU machines:
srun --pty -p cox -t 7-00:00 --mem ${1} -n ${2} /bin/bashrequest GPU machines:
srun --pty -p cox -t 7-00:00 --mem ${1} -n ${2} --gres=gpu:${3} /bin/bash
sbatch: Submit batch of jobs in the background#!/bin/bash #SBATCH -n 1 # Number of cores #SBATCH -N 1 # Ensure that all cores are on one machine #SBATCH -t 2-00:00 # Runtime in D-HH:MM, minimum of 10 minutes #SBATCH -p cox # Partition to submit to #SBATCH --gres=gpu:1 # Number of GPUs #SBATCH --mem=16000 # Memory pool for all cores (see also --mem-per-cpu) #SBATCH -o OUTPUT_FILENAME_%j.out # %j inserts jobid #SBATCH -e OUTPUT_FILENAME_%j.err # %j inserts jobid YOUR COMMAND HERE .........
sbatch: use python to generate/submit jobsimport sys opt = sys.argv[1] Do = 'db/slurm/' def get_pref(mem=10000,do_gpu=False): pref = '#!/bin/bash\n' pref+='#SBATCH -N 1 # number of nodes\n' pref+='#SBATCH -p cox\n' pref+='#SBATCH -n 1 # number of cores\n' pref+='#SBATCH --mem '+str(mem)+' # memory pool for all cores\n' if do_gpu: pref+='#SBATCH --gres=gpu:1 # memory pool for all cores\n' pref+='#SBATCH -t 4-00:00 # time (D-HH:MM)\n' return pref cmd=[] mem=10000 do_gpu= False if opt =='0': fn='aff' # output file name suf = ' \n' num = 25;cn = 'classify4-jwr_20um.py' cmd+=['source activate pipeline \n'] # activate your conda env cmd+=['python '+cn+' %d '+str(num)+suf] pref=get_pref(mem, do_gpu)+""" #SBATCH -o """+Do+"""slurm.%N.%j.out # STDOUT #SBATCH -e """+Do+"""slurm.%N.%j.err # STDERR """ for i in range(num): a=open(Do + fn+'_%d.sh'%(i),'w') a.write(pref) for cc in cmd: if '%' in cc: a.write(cc%i) else: a.write(cc) a.close() # code to run on bash print ('for i in {0..%d};do sbatch '+Do+'%s_${i}.sh && sleep 1;done')%(num-1, fn)
ssh tunnel for port forwarding (e.g. tensorboard display)
Parameters:
p1: port number you want to display on localhost
p2: port number on RC login server
p3: port number on RC compute server (6006 for tensorboard)
m1: server name, e.g. coxgpu06
Local machine -> RC login server:
ssh -L p1:localhost:p2 xx@login.rc.fas.harvard.edu
RC login server -> RC server:
ssh -L p2:localhost:p3 m1
On RC server:
tensorboard --logdir OUTPUT_FOLDER
Load cuda on rc cluster:
module load cuda/9.0-fasrc02 cudnn/7.0_cuda9.0-fasrc01