Was this page helpful?

RCAC clusters

    Servers Available

    You need an account.  This takes only seconds to set up (once i find the web page).  See gribskov.

    Server Name
    .rcac.purdue.edu
    Scratch
    Filesystem
    Queue Name Resource Nodes Cores/node Cores Memory/Node
    conte lustreD not avail Conte 0      
    carter lustreA mgribsko-c Carter-C 4 16 (8 dual core) 64 256G
    scholar (Agry600 only) lustreA scholar Carter-B 8 16 128 64G
    hansen lustreC mgribsko Hansen-B 1 48 48 192G
    rossmann lustreA mgribsko-b Rossmann-B 2 24 48 96G
    rossmann lustreA mgribsko-d Rossmann-D 3 24 72 192G
    coates lustreA mgribsko Coates-E 8 16 128 128G

    General information

    • RCAC user information
    • Perl - the generic RCAC perl installation is very old, 5.8.8.  Newer installations are in /apps/groups/bioinformatics/apps (5.14.2 and 5.16.0).  You may want to make an alias for this as this location is subject to change and you do not want to rewrite the shebang line every time it changes.

    Getting Started

    All the RCAC systems run a Linux operating system, the exact revision depends on the server, in July 2014 the current OS versions were Red Hat Enterprise Linux Server release 5.10, aka RHEL5.10, (Tikanga) on Coates, and RHEL6.5 (Santiago)  on Hansen, Rossmann, and Carter.  You will probably never need to know this.  If you need to know the current version check

    cat /etc/redhat-release

    When you log onto any one of the RCAC servers, you will be logged into a frontend node, also called a head node, (not  a compute node).  The frontend nodes are shared by all users of the systems and are not for performing heavy computation, they are intended to be used for editing files, copying files, saving and restoring files from fortress, and submitting batch jobs.  If you try to run large jobs on the frontend nodes you affect everyone else who is logged in, and you job will generally fail.  The frontend nodes have a cpu time limit (30 min), so for any extensive work, you will want to either run a batch job, or start an interactive session on a compute node.  

    The amount of space available in your home directory is also limited, so you will generally need to run big jobs using the scratch file system.  Note that the various clusters use different scratch file systems, with different limitations on the space and the number of files allowed.  All scratch file systems are available from the frontends, but only one is usually available from a compute node.  To find out what scratch files systems are available form the front ends, type myquota. the scratch file system for the clusters is shown in the table above.  Files cannot remain on the scratch file system indefinitely, learn how to transfer your files to fortress for long term storage.

    Unix tutorials

     

    The following books are available on Purdue Safari

    • Unix in a nutshell
    • SAMS Teach Yourself Unix in 24 Hours
    • Unix power tools
    • Learning the Unix Operating System
    • A Practical Guide to Linux® Commands, Editors, and Shell Programming
    • Learning the bash Shell
    • Using csh & tcsh
    • Linux Pocket Guide

    Interactive sessions

    Interactive sessions can be obtained with a qsub command such as 

    % qsub -I -l nodes=1:ppn=1,walltime=8:00:00 -q mgribsko
    

    note that the -I indicates an interactive session.  Use the command qlist to see what queues (-q in the qsub command above) are available on the server.  While you will generally use only a single node, you can request more nodes (ppn) if you need more memory or are running threaded programs. Using more processors does not automatically make your program run faster - undrstand what you are doing.

    % qlist
                              Current Number of Cores
    Queue                 Total    Queue      Run     Free      Max Walltime
    ===============    ====================================    ==============
    mgribsko                 48        0        8       40         720:00:00
    standby               8,928        0      624    5,690           4:00:00
    

    If you do not specify either a queue or a time limit, you will probably get a 4 hour session on the standby queue (this is good!  it means you are running on someone else's compute nodes).  If you run on one of our queues, whcih are typically called something like mgribsko, mgribsko-a, mgribsko-x, you can request up to one month of cpu time (720 hours).

    Batch Jobs

    Most larger jobs on the RCAC clusters should be run as batch jobs.  this means that you create a command file, i.e., a script, and submit the job using the qsub command.  The script can be a shell script, or a perl or python language script.  Commands for the PBS/Torque system can be embedded in the script as comments.  For me, this is usually more convenient than putting the commands on the command line (example below).  Note on Torque commands to request nodes and processors, see

    #PBS -q mgribsko
    #PBS -l nodes=1:ppn=1 
    #PBS -l walltime=24:00:00
    

    A number of useful symbols are predefined in the PBS/Torque environment including

    PBS_ENVIRONMENT   This is set to PBS_BATCH for batch jobs and to PBS_INTERACTIVE for interactive jobs.
    PBS_O_HOST        The host machine on which the qsub command was run. 
    PBS_O_WORKDIR     The working directory from which the qsub was run. 
    PBS_O_LOGNAME     The login name on the machine on which the qsub was run. 
    PBS_O_HOME        The home directory from which the qsub was run. 
    PBS_O_QUEUE       The original queue to which the job was submitted. 
    PBS_JOBID         The identifier that PBS assigns to the job. 
    PBS_JOBNAME       The name of the job. 
    PBS_NODEFILE      The file containing the list of nodes assigned to a parallel job. 

    $PBS_O_WORKDIR is particularly useful in moving to the directory where the job was submitted since your job files will typically be in the directory where you are working.  $PBS_NODEFILE can be used to obtain a list of the nodes that your job has been allocated.  This is useful for writing parallel programs.  $PBS_ENVIRONMENT allows scripts to determine if they are running as a normal UNIX job, or as a PBS job, making it possible to have a single script for both environments.  For a brief summary of PBS/Torque commands see

    Checking jobs that are running

    Many times you will want to check how many jobs are currently running in order to decide which cluster and queue you want to submit your job to.  Some of the commands you can use are

    • qlist - shows the number of free, queued, and running cpus (cores) on each queue
      > qlist
                                Current Number of Cores
      Queue                 Total    Queue      Run     Free      Max Walltime
      ===============    ====================================    ==============
      mgribsko                 48       21       48        0         720:00:00
      standby               9,120      357    1,936    3,148           4:00:00
    • qstat -q <queue_name> - shows the jobs on a particular queue
      > qstat -q mgribsko
      
      server: hansen-adm.rcac.purdue.edu
      
      Queue            Memory CPU Time Walltime Node  Run Que Lm  State
      ---------------- ------ -------- -------- ----  --- --- --  -----
      mgribsko           --      --    720:00:0   --   37  21 --   E R
                                                     ----- -----
                                                        37    21
    • qstat -u <user_name> - shows your jobs on all queues (on the particular cluster)
      > qstat -u mgribsko
      
      hansen-adm.rcac.purdue.edu: 
                                                                                     Req'd    Req'd      Elap
      Job ID               Username    Queue    Jobname          SessID NDS   TSK    Memory   Time   S   Time
      -------------------- ----------- -------- ---------------- ------ ----- ------ ------ -------- - --------
      4535066.hansen-a     mgribsko    mgribsko STDIN              4532     1      1    --  24:00:00 R 00:15:08
      4535067.hansen-a     mgribsko    mgribsko STDIN              4703     1      1    --  24:00:00 R 00:15:07
      4535109.hansen-a     mgribsko    mgribsko STDIN               --      1      1    --  24:00:00 Q      -- 
      4535110.hansen-a     mgribsko    mgribsko STDIN               --      1      1    --  24:00:00 Q      -- 
    • qstat -a [optional_queue_name]- shows all jobs on all queues or a particular queue (on a particular cluster
      > qstat -a mgribsko
      
      hansen-adm.rcac.purdue.edu: 
                                                                                     Req'd    Req'd      Elap
      Job ID               Username    Queue    Jobname          SessID NDS   TSK    Memory   Time   S   Time
      -------------------- ----------- -------- ---------------- ------ ----- ------ ------ -------- - --------
      4386055.hansen-a     jhengeni    mgribsko STDIN              8906     1     12    --  700:00:0 R 354:56:4
      4535066.hansen-a     mgribsko    mgribsko STDIN              4532     1      1    --  24:00:00 R 00:15:08
      4535067.hansen-a     mgribsko    mgribsko STDIN              4703     1      1    --  24:00:00 R 00:15:07

    If you suspect something is not working correctly with your job, it is sometimes usful to login directly to the comput node the job is running on.  You can discover what compute node using qstat -f (see red shaded part).  The job below is running on carter-a548.  I can login using ssh carter-a548. The top command gives useful information about your job

    qstat -f scholar
    Job Id: 5540928.carter-adm.rcac.purdue.edu
        Job_Name = abyss119_monascus
        Job_Owner = mgribsko@carter-fe01.rcac.purdue.edu
        resources_used.cput = 1867:36:31
        resources_used.mem = 23523044kb
        resources_used.vmem = 28335012kb
        resources_used.walltime = 116:46:34
        job_state = R
        queue = scholar
        server = carter-adm.rcac.purdue.edu
        Checkpoint = u
        ctime = Thu Aug 21 16:32:54 2014
        Error_Path = carter-fe01.rcac.purdue.edu:/scratch/carter/m/mgribsko/monasc
            us/dna_assembly/abyss/k119/abyss119_monascus.e5540928
        exec_host = carter-a548/0+carter-a548/1+carter-a548/2+carter-a548/3+carter
            -a548/4+carter-a548/5+carter-a548/6+carter-a548/7+carter-a548/8+carter
            -a548/9+carter-a548/10+carter-a548/11+carter-a548/12+carter-a548/13+ca
            rter-a548/14+carter-a548/15
        exec_port = 15003+15003+15003+15003+15003+15003+15003+15003+15003+15003+15
            003+15003+15003+15003+15003+15003
        group_list = scholar-queue
        Hold_Types = n
        Join_Path = n
        Keep_Files = n
        Mail_Points = a
        mtime = Thu Aug 21 16:36:59 2014
        Output_Path = carter-fe01.rcac.purdue.edu:/scratch/carter/m/mgribsko/monas
            cus/dna_assembly/abyss/k119/abyss119_monascus.o5540928
        Priority = 0
        qtime = Thu Aug 21 16:32:54 2014
        Rerunable = True
        Resource_List.ncpus = 1
        Resource_List.nodect = 1
        Resource_List.nodes = 1:ppn=16
        Resource_List.walltime = 168:00:00
        session_id = 52776
        Variable_List = PBS_O_QUEUE=scholar,PBS_O_HOME=/home/mgribsko,
            ...
        etime = Thu Aug 21 16:32:54 2014
        submit_args = abyss.job
        start_time = Thu Aug 21 16:37:00 2014
        Walltime.Remaining = 184224
        start_count = 1
        fault_tolerant = False
        job_radix = 0
        submit_host = carter-fe01.rcac.purdue.edu

    Transferring files

    Your home directory is shared across the RCAC servers so you can transfer to any server we have nodes on, e.g., coates, rossman, hansen, or carter.  Similarly, since our home directories are shared across the genomics servers, you can use any one of the servers, e.g., ser, pro, gln, etc. .  Several methods are available:

    scp - preferred method

    Secure copy is convenient way to copy one or many files between unix systems (such as RCAC and the genomics servers).  This method does not require using your desktop as an intermediary.  The following examples assume you are runnning scp while logged into an RCAC server.

    (to a remote system from local)
    $ scp sourcefilename myusername@hostname:somedirectory/destinationfilename
    $ scp trinity.pl gribskov@ser.genomics.purdue.edu/.        (dot means same filename)
    
    (from a remote system to local)
    $ scp myusername@hostname:somedirectory/sourcefilename destinationfilename
    $ scp gribskov@ser.genomics.purdue.edu/trinity.pl .
    
    (recursive directory copy to a remote system from local)
    $ scp NGS/ gribskov@ser.genomics.purdue.edu/NGS/

    sftp

    A secure ftp client, such as SecureFX can also be used for transferring files.  If you use SecureCRT for a terminal emulator, just open SecureFX in the normal way and connect to any one of the RCAC servers on which we have queues. sftp is available on our genomics servers, i haven't tried it.

    samba

    You can remotely mount your RCAC dirctory on your desktop pc using samba.

    1. Open "computer" from the start menu
    2. Near the top of the window is a menu bar, select "map network drive"
    3. Choose a symbol for the mapped drive from the pulldown, e.g. "H" or "X"
    4. Enter the address of your RCAC home directory in the "Folder" box, something like
      \\samba.rcac.purdue.edu\username, where username is your career account user name
      note that PC slashes are the opposite of unix
    5. click "finish"

    Fortress / htar / hsi

    Since your disk storage is fairly limited, and files cannot be stored indefinitely in the scratch space, you will have to use the Fortress system for archiving files. Fortress is an archival storage systems where very large datasets can be stored.  See the fortress/hpss manual.  Fortress can be accessed in several ways

    • htar - user guide
      The maximum size of a single member file within an HTAR archive is 68 Gbyte.  The workaround is to use tar and pipe the result to fortress using hsi
    • tar cf - . | hsi options "put - : someTarFile"
    • hsi - user guide
    • samba (fortress-smb.rcac.purdue.edu)

    Module Command

    RCAC uses the module command to manage access to installed packages and programs. module is used to load applications and compilers along with neccesary libraries and paths. Many modules are available in multiple versions.  Make sure you understand and load the one you need.

    Command Purpose
    $ module avail list available modules
    $ module load <modulename> load a module
    f.ex. module load blast
    some useful modules: amber, blast, comsol, eclipse, gcc, glib, gnuplot,
    gromacs, imagemagick, libxml, matlab, mpi, namd, pgplot, povray, python,
    sas, swig, tecplot
    $ module unload <modulename> unload a module, duh
    should remove all symbols and paths associated with the module
    $ module list show a list of currently loaded modules
    $ module show <modulename> show details of symbols loaded for a module
    $ module use <app_path>

    Adds a preconfigured list of modules to the available modules list.  
    ls /apps/group for a list.  The ones you probably want are in /apps/group/bioinformatics/modules

    f.ex. $module use /apps/group/bioinformatics/modules
            $ load trimmomatic

    putting the 'module use' statement in your .cshrc or .bash_profile may be useful

     

    Bioinformatics modules (gribskov 15 June 2012)

    454          CAP3           CASAVA          EMBOSS       GeneMark      GenomeTools 
    Glimmer      GlimmerHMM     R-bioconductor  RADtools     RepeatMasker  SHRiMP  
    SOAPaligner  SOAPdenovo     SOAPfusion      SOAPsnp      SOAPsplice    SOAPsv     
    ViennaRNA    abyss          allpathslg      amos         beagle        bfast           
    bioinfo      biopython      blast           blat         bowtie        bowtie2
    bwa          clustalo       clustalw        cross_match  cufflinks     fastagrep    
    fastqc       fastx          freec           geneid       gmap          gmp     
    htseq        icommands      igv             igvtools     java          jellyfish  
    khmer        macs           maq             mirdeep      mrbayes       muscle
    novocraft    oases          perl            phrap        picard-tools  pysam
    quake        quest          randfold        rmblast      rsem          samtools
    sparsehash   splicegrapher  squid           stacks       tlex          tophat
    trans-abyss  trf            trimmomatic     trinity      uclust        velvet
    Was this page helpful?
    Tag page (Edit tags)
    • No tags

    Files 1

    FileSizeDateAttached by 
     manual_v1b.docx
    Gribskov Essential Unix Manual
    222.03 kB11:11, 19 Jun 2015gribskovActions
    You must login to post a comment.