Browse Source

Update README.md

Javier Jorge Cano 2 years ago
parent
commit
9f995e3f12
1 changed files with 379 additions and 0 deletions
  1. 379 0
      README.md

+ 379 - 0
README.md

@@ -1,2 +1,381 @@
 # SGE6.2_Ubuntu_20.04_Installation_guide
 # SGE6.2_Ubuntu_20.04_Installation_guide
 
 
+Example SGE master node installation 
+
+* Hostname: NODENAME
+
+## Pre-requisites
+
+Remove entry 127.0.0.1 (or 127.0.1.1) pointing to the hostname from ```/etc/hosts/```
+
+```bash
+127.0.0.1	localhost
+#127.0.1.1	NODENAME
+
+# The following lines are desirable for IPv6 capable hosts
+::1     ip6-localhost ip6-loopback
+fe00::0 ip6-localnet
+ff00::0 ip6-mcastprefix
+ff02::1 ip6-allnodes
+ff02::2 ip6-allrouters
+```
+
+## SGE master installation
+
+Copy the content of this repo in the host, and with ```sudo```, do:
+
+```bash
+mkdir /opt/sge6-2/
+cp ge62u5.tar.gz install_sge.sh /opt/sge6-2/
+cd /opt/sge6-2/
+chmod +x install_sge.sh
+./install_sge.sh
+```
+
+We have in the folder the following structure:
+
+```bash
+.
+├── 3rd_party
+├── bin
+├── catman
+├── ckpt
+├── doc
+├── dtrace
+├── examples
+├── ge6.2u5
+├── ge62u5.tar.gz
+├── include
+├── install_execd
+├── install_qmaster
+├── install_sge.sh
+├── inst_sge
+├── lib
+├── man
+├── mpi
+├── pvm
+├── qmon
+├── start_gui_installer
+├── util
+└── utilbin
+```
+
+Now we proceed with the SGE master installation:
+
+```bash
+cd /opt/sge6-2/
+./install_qmaster
+```
+
+The questions answers' sequence is the following:
+
+```bash
+Do you agree with that license? (y/n) [n] >> y
+Hit <RETURN> to continue >> 
+Do you want to install Grid Engine
+under an user id other than >root< (y/n) [y] >> n
+Hit <RETURN> to continue >> 
+If this directory is not correct (e.g. it may contain an automounter
+prefix) enter the correct path to this directory or hit <RETURN>
+to use default [/opt/sge6-2] >> /opt/sge6-2
+Hit <RETURN> to continue >> 
+(default: 2) >> 2
+Hit <RETURN> to continue >> 
+(default: 2) >> 
+Hit <RETURN> to continue >> 
+Enter cell name [default] >> default
+Enter new cluster name or hit <RETURN>
+to use default [p6444] >> cluster_name      
+Hit <RETURN> to continue >> 
+Enter a qmaster spool directory [/opt/sge6-2/default/spool/qmaster] >> 
+Hit <RETURN> to continue >> 
+Are you going to install Windows Execution Hosts? (y/n) [n] >> n
+Did you install this version with >pkgadd< or did you already verify
+and set the file permissions of your distribution (enter: y) (y/n) [y] >> y
+We do not verify file permissions. Hit <RETURN> to continue >> 
+Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
+Hit <RETURN> to continue >> 
+Do you want to enable the JMX MBean server (y/n) [y] >> y
+Enter JAVA_HOME (use "none" when none available) [] >> none
+Please enter additional JVM arguments (optional, default is [-Xmx256m]) >> -Xmx256m
+Please enter an unused port number for the JMX MBean server [6446] >> 6446
+Enable JMX SSL server authentication (y/n) [y] >> y
+Enable JMX SSL client authentication (y/n) [y] >> y
+Enter JMX SSL server keystore path [/var/sgeCA/sge_qmaster/default/private/keystore] >> /var/sgeCA/sge_qmaster/default/private/keystore
+Enter JMX SSL server keystore pw (at least 6 characters) >> ******
+Using the following JMX MBean server settings.
+   libjvm_path              >jvm_missing<
+   Additional JVM arguments >-Xmx256m<
+   JMX port                 >6446<
+   JMX ssl                  >true<
+   JMX client ssl           >true<
+   JMX server keystore      >/var/sgeCA/sge_qmaster/default/private/keystore<
+   JMX server keystore pw   >******<
+Do you want to use these data (y/n) [y] >> y
+Hit <RETURN> to continue >> 
+Hit <RETURN> to continue >> 
+Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> berkeleydb
+Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> n
+Hit <RETURN> to continue >> 
+Default: [/opt/sge6-2/default/spool/spooldb] >> /opt/sge6-2/default/spool/spooldb
+Hit <RETURN> to continue >> 
+Please enter a range [20000-20100] >> 20000-20100
+Using >20000-20100< as gid range. Hit <RETURN> to continue >> 
+Default: [/opt/sge6-2/default/spool] >> /opt/sge6-2/default/spool
+Default: [none] >> none
+Do you want to change the configuration parameters (y/n) [n] >> n
+Hit <RETURN> to continue >> 
+```
+
+The following error is expected, everything is fine:
+
+```bash
+util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
+util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
+Error: Cannot create keystore /var/sgeCA/sge_qmaster/default/private/keystore
+util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
+util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
+./inst_sge: 1204: cannot create /var/sgeCA/sge_qmaster/default/private/keystore.password: Directory nonexistent
+chown: usuario inválido: «default»
+```
+
+To use the cluster commands (qsub, qstat, etc.), some variables should be included in the environment. This should be included in the ```.bashrc``` of the user that will use the cluster.
+
+```bash
+source /opt/sge6-2/default/common/settings.sh
+```
+
+To start the sgemaster process:
+
+```bash
+$:/opt/sge6-2/default/common/sgemaster 
+   starting sge_qmaster
+```
+
+Then, ```qstat -f``` should not show any error:
+
+```bash
+$ qstat -f
+$
+```
+
+### SGE master setup
+
+We will add the host as an execute node:
+
+```bash
+qconf -ae
+```
+
+Include here the hostname:
+
+```bash
+hostname              NODENAME
+load_scaling          NONE
+complex_values        NONE
+user_lists            NONE
+xuser_lists           NONE
+projects              NONE
+xprojects             NONE
+usage_scaling         NONE
+report_variables      NONE
+```
+
+We create now the host list:
+
+```bash
+qconf -ahgrp
+```
+
+Add the host list ```@allhosts``` with this host:
+
+```bash
+group_name @allhosts
+hostlist NODENAME
+```
+
+Add a new queue:
+
+```bash
+qconf -aq
+```
+
+Change the values properly:
+
+```bash
+qname                 all.q
+hostlist              @allhosts
+seq_no                0
+load_thresholds       np_load_avg=1.75
+suspend_thresholds    NONE
+nsuspend              1
+suspend_interval      00:05:00
+priority              19
+min_cpu_interval      00:05:00
+processors            UNDEFINED
+qtype                 BATCH INTERACTIVE
+ckpt_list             NONE
+pe_list               make
+rerun                 FALSE
+slots                 1,[NODENAME=24]
+tmpdir                /tmp
+shell                 /bin/bash
+prolog                NONE
+epilog                NONE
+shell_start_mode      posix_compliant
+starter_method        NONE
+suspend_method        NONE
+resume_method         NONE
+terminate_method      NONE
+notify                00:00:60
+owner_list            NONE
+user_lists            NONE
+xuser_lists           NONE
+subordinate_list      NONE
+complex_values        NONE
+projects              NONE
+xprojects             NONE
+calendar              NONE
+initial_state         default
+s_rt                  INFINITY
+h_rt                  INFINITY
+s_cpu                 INFINITY
+h_cpu                 INFINITY
+s_fsize               INFINITY
+h_fsize               INFINITY
+s_data                INFINITY
+h_data                INFINITY
+s_stack               INFINITY
+h_stack               INFINITY
+s_core                INFINITY
+h_core                INFINITY
+s_rss                 INFINITY
+h_rss                 INFINITY
+s_vmem                INFINITY
+h_vmem                INFINITY
+```
+
+Add user, in this case, ```NEWUSER```:
+
+```bash
+qconf -auser
+```
+
+```
+name NEWUSER
+oticket 0
+fshare 0
+delete_time 0
+default_project NONE
+```
+
+Add a new user list, with this ```NEWUSER```:
+
+```bash
+qconf -au NEWUSER NEWUSERS
+```
+
+We should modify the queue to include this user list:
+
+```bash
+qconf -mq all.q
+```
+
+```bash
+...
+notify                00:00:60
+owner_list            NONE
+user_lists            NEWUSERS
+xuser_lists           NONE
+subordinate_list      NONE
+...
+```
+
+Launch the execute daemon (```sgeexcd```):
+
+```bash
+$:/opt/sge6-2/default/common# ./sgeexecd 
+   starting sge_execd
+```
+
+If the queue is in some weird state, you can run to clean its state:
+
+```bash
+qmod -c all.q@NODENAME
+```
+
+
+We can restart the services, just in case:
+
+```bash
+$: cd /opt/sge6-2/default/common# 
+$: ./sgeexecd stop
+   Shutting down Grid Engine execution daemon
+$: cd /opt/sge6-2/default/common
+$: ./sgemaster stop
+   shutting down Grid Engine qmaster
+$: cd /opt/sge6-2/default/common
+$: ./sgemaster start
+   starting sge_qmaster
+$: cd /opt/sge6-2/default/common
+$: ./sgeexecd start
+   starting sge_execd
+```
+
+Add the host to the submit hosts' list:
+
+```bash
+qconf -as NODENAME
+NODENAME added to submit host list
+```
+
+## Useful commands
+
+Cluster overview
+
+```bash
+qstat -f
+```
+
+Overview of the process of some user
+
+```bash
+qstat -u NEWUSER
+```
+
+Get job information
+
+```bash
+qstat -j JOB_ID
+```
+
+Get information from a finished job
+
+```bash
+qacct -j JOB_ID
+```
+
+Cancel job
+
+```bash
+qdel JOB_ID
+```
+
+Hold a queued job
+
+```bash
+qhold JOB_ID
+```
+
+Release a job in hold state
+
+```bash
+qrls JOB_ID
+```
+
+Modify requirements from a job in the queue
+
+```bash
+#When launched, gpu_slots=2, change it to 1
+qalter -l 'cpu_slots=1,h_vmem=infinity,virtual_free=51200M' JOB_ID
+```