|
@@ -1,2 +1,381 @@
|
|
|
# SGE6.2_Ubuntu_20.04_Installation_guide
|
|
|
|
|
|
+Example SGE master node installation
|
|
|
+
|
|
|
+* Hostname: NODENAME
|
|
|
+
|
|
|
+## Pre-requisites
|
|
|
+
|
|
|
+Remove entry 127.0.0.1 (or 127.0.1.1) pointing to the hostname from ```/etc/hosts/```
|
|
|
+
|
|
|
+```bash
|
|
|
+127.0.0.1 localhost
|
|
|
+#127.0.1.1 NODENAME
|
|
|
+
|
|
|
+# The following lines are desirable for IPv6 capable hosts
|
|
|
+::1 ip6-localhost ip6-loopback
|
|
|
+fe00::0 ip6-localnet
|
|
|
+ff00::0 ip6-mcastprefix
|
|
|
+ff02::1 ip6-allnodes
|
|
|
+ff02::2 ip6-allrouters
|
|
|
+```
|
|
|
+
|
|
|
+## SGE master installation
|
|
|
+
|
|
|
+Copy the content of this repo in the host, and with ```sudo```, do:
|
|
|
+
|
|
|
+```bash
|
|
|
+mkdir /opt/sge6-2/
|
|
|
+cp ge62u5.tar.gz install_sge.sh /opt/sge6-2/
|
|
|
+cd /opt/sge6-2/
|
|
|
+chmod +x install_sge.sh
|
|
|
+./install_sge.sh
|
|
|
+```
|
|
|
+
|
|
|
+We have in the folder the following structure:
|
|
|
+
|
|
|
+```bash
|
|
|
+.
|
|
|
+├── 3rd_party
|
|
|
+├── bin
|
|
|
+├── catman
|
|
|
+├── ckpt
|
|
|
+├── doc
|
|
|
+├── dtrace
|
|
|
+├── examples
|
|
|
+├── ge6.2u5
|
|
|
+├── ge62u5.tar.gz
|
|
|
+├── include
|
|
|
+├── install_execd
|
|
|
+├── install_qmaster
|
|
|
+├── install_sge.sh
|
|
|
+├── inst_sge
|
|
|
+├── lib
|
|
|
+├── man
|
|
|
+├── mpi
|
|
|
+├── pvm
|
|
|
+├── qmon
|
|
|
+├── start_gui_installer
|
|
|
+├── util
|
|
|
+└── utilbin
|
|
|
+```
|
|
|
+
|
|
|
+Now we proceed with the SGE master installation:
|
|
|
+
|
|
|
+```bash
|
|
|
+cd /opt/sge6-2/
|
|
|
+./install_qmaster
|
|
|
+```
|
|
|
+
|
|
|
+The questions answers' sequence is the following:
|
|
|
+
|
|
|
+```bash
|
|
|
+Do you agree with that license? (y/n) [n] >> y
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Do you want to install Grid Engine
|
|
|
+under an user id other than >root< (y/n) [y] >> n
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+If this directory is not correct (e.g. it may contain an automounter
|
|
|
+prefix) enter the correct path to this directory or hit <RETURN>
|
|
|
+to use default [/opt/sge6-2] >> /opt/sge6-2
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+(default: 2) >> 2
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+(default: 2) >>
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Enter cell name [default] >> default
|
|
|
+Enter new cluster name or hit <RETURN>
|
|
|
+to use default [p6444] >> cluster_name
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Enter a qmaster spool directory [/opt/sge6-2/default/spool/qmaster] >>
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Are you going to install Windows Execution Hosts? (y/n) [n] >> n
|
|
|
+Did you install this version with >pkgadd< or did you already verify
|
|
|
+and set the file permissions of your distribution (enter: y) (y/n) [y] >> y
|
|
|
+We do not verify file permissions. Hit <RETURN> to continue >>
|
|
|
+Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Do you want to enable the JMX MBean server (y/n) [y] >> y
|
|
|
+Enter JAVA_HOME (use "none" when none available) [] >> none
|
|
|
+Please enter additional JVM arguments (optional, default is [-Xmx256m]) >> -Xmx256m
|
|
|
+Please enter an unused port number for the JMX MBean server [6446] >> 6446
|
|
|
+Enable JMX SSL server authentication (y/n) [y] >> y
|
|
|
+Enable JMX SSL client authentication (y/n) [y] >> y
|
|
|
+Enter JMX SSL server keystore path [/var/sgeCA/sge_qmaster/default/private/keystore] >> /var/sgeCA/sge_qmaster/default/private/keystore
|
|
|
+Enter JMX SSL server keystore pw (at least 6 characters) >> ******
|
|
|
+Using the following JMX MBean server settings.
|
|
|
+ libjvm_path >jvm_missing<
|
|
|
+ Additional JVM arguments >-Xmx256m<
|
|
|
+ JMX port >6446<
|
|
|
+ JMX ssl >true<
|
|
|
+ JMX client ssl >true<
|
|
|
+ JMX server keystore >/var/sgeCA/sge_qmaster/default/private/keystore<
|
|
|
+ JMX server keystore pw >******<
|
|
|
+Do you want to use these data (y/n) [y] >> y
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> berkeleydb
|
|
|
+Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> n
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Default: [/opt/sge6-2/default/spool/spooldb] >> /opt/sge6-2/default/spool/spooldb
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+Please enter a range [20000-20100] >> 20000-20100
|
|
|
+Using >20000-20100< as gid range. Hit <RETURN> to continue >>
|
|
|
+Default: [/opt/sge6-2/default/spool] >> /opt/sge6-2/default/spool
|
|
|
+Default: [none] >> none
|
|
|
+Do you want to change the configuration parameters (y/n) [n] >> n
|
|
|
+Hit <RETURN> to continue >>
|
|
|
+```
|
|
|
+
|
|
|
+The following error is expected, everything is fine:
|
|
|
+
|
|
|
+```bash
|
|
|
+util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
|
|
|
+util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
|
|
|
+Error: Cannot create keystore /var/sgeCA/sge_qmaster/default/private/keystore
|
|
|
+util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
|
|
|
+util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
|
|
|
+./inst_sge: 1204: cannot create /var/sgeCA/sge_qmaster/default/private/keystore.password: Directory nonexistent
|
|
|
+chown: usuario inválido: «default»
|
|
|
+```
|
|
|
+
|
|
|
+To use the cluster commands (qsub, qstat, etc.), some variables should be included in the environment. This should be included in the ```.bashrc``` of the user that will use the cluster.
|
|
|
+
|
|
|
+```bash
|
|
|
+source /opt/sge6-2/default/common/settings.sh
|
|
|
+```
|
|
|
+
|
|
|
+To start the sgemaster process:
|
|
|
+
|
|
|
+```bash
|
|
|
+$:/opt/sge6-2/default/common/sgemaster
|
|
|
+ starting sge_qmaster
|
|
|
+```
|
|
|
+
|
|
|
+Then, ```qstat -f``` should not show any error:
|
|
|
+
|
|
|
+```bash
|
|
|
+$ qstat -f
|
|
|
+$
|
|
|
+```
|
|
|
+
|
|
|
+### SGE master setup
|
|
|
+
|
|
|
+We will add the host as an execute node:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -ae
|
|
|
+```
|
|
|
+
|
|
|
+Include here the hostname:
|
|
|
+
|
|
|
+```bash
|
|
|
+hostname NODENAME
|
|
|
+load_scaling NONE
|
|
|
+complex_values NONE
|
|
|
+user_lists NONE
|
|
|
+xuser_lists NONE
|
|
|
+projects NONE
|
|
|
+xprojects NONE
|
|
|
+usage_scaling NONE
|
|
|
+report_variables NONE
|
|
|
+```
|
|
|
+
|
|
|
+We create now the host list:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -ahgrp
|
|
|
+```
|
|
|
+
|
|
|
+Add the host list ```@allhosts``` with this host:
|
|
|
+
|
|
|
+```bash
|
|
|
+group_name @allhosts
|
|
|
+hostlist NODENAME
|
|
|
+```
|
|
|
+
|
|
|
+Add a new queue:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -aq
|
|
|
+```
|
|
|
+
|
|
|
+Change the values properly:
|
|
|
+
|
|
|
+```bash
|
|
|
+qname all.q
|
|
|
+hostlist @allhosts
|
|
|
+seq_no 0
|
|
|
+load_thresholds np_load_avg=1.75
|
|
|
+suspend_thresholds NONE
|
|
|
+nsuspend 1
|
|
|
+suspend_interval 00:05:00
|
|
|
+priority 19
|
|
|
+min_cpu_interval 00:05:00
|
|
|
+processors UNDEFINED
|
|
|
+qtype BATCH INTERACTIVE
|
|
|
+ckpt_list NONE
|
|
|
+pe_list make
|
|
|
+rerun FALSE
|
|
|
+slots 1,[NODENAME=24]
|
|
|
+tmpdir /tmp
|
|
|
+shell /bin/bash
|
|
|
+prolog NONE
|
|
|
+epilog NONE
|
|
|
+shell_start_mode posix_compliant
|
|
|
+starter_method NONE
|
|
|
+suspend_method NONE
|
|
|
+resume_method NONE
|
|
|
+terminate_method NONE
|
|
|
+notify 00:00:60
|
|
|
+owner_list NONE
|
|
|
+user_lists NONE
|
|
|
+xuser_lists NONE
|
|
|
+subordinate_list NONE
|
|
|
+complex_values NONE
|
|
|
+projects NONE
|
|
|
+xprojects NONE
|
|
|
+calendar NONE
|
|
|
+initial_state default
|
|
|
+s_rt INFINITY
|
|
|
+h_rt INFINITY
|
|
|
+s_cpu INFINITY
|
|
|
+h_cpu INFINITY
|
|
|
+s_fsize INFINITY
|
|
|
+h_fsize INFINITY
|
|
|
+s_data INFINITY
|
|
|
+h_data INFINITY
|
|
|
+s_stack INFINITY
|
|
|
+h_stack INFINITY
|
|
|
+s_core INFINITY
|
|
|
+h_core INFINITY
|
|
|
+s_rss INFINITY
|
|
|
+h_rss INFINITY
|
|
|
+s_vmem INFINITY
|
|
|
+h_vmem INFINITY
|
|
|
+```
|
|
|
+
|
|
|
+Add user, in this case, ```NEWUSER```:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -auser
|
|
|
+```
|
|
|
+
|
|
|
+```
|
|
|
+name NEWUSER
|
|
|
+oticket 0
|
|
|
+fshare 0
|
|
|
+delete_time 0
|
|
|
+default_project NONE
|
|
|
+```
|
|
|
+
|
|
|
+Add a new user list, with this ```NEWUSER```:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -au NEWUSER NEWUSERS
|
|
|
+```
|
|
|
+
|
|
|
+We should modify the queue to include this user list:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -mq all.q
|
|
|
+```
|
|
|
+
|
|
|
+```bash
|
|
|
+...
|
|
|
+notify 00:00:60
|
|
|
+owner_list NONE
|
|
|
+user_lists NEWUSERS
|
|
|
+xuser_lists NONE
|
|
|
+subordinate_list NONE
|
|
|
+...
|
|
|
+```
|
|
|
+
|
|
|
+Launch the execute daemon (```sgeexcd```):
|
|
|
+
|
|
|
+```bash
|
|
|
+$:/opt/sge6-2/default/common# ./sgeexecd
|
|
|
+ starting sge_execd
|
|
|
+```
|
|
|
+
|
|
|
+If the queue is in some weird state, you can run to clean its state:
|
|
|
+
|
|
|
+```bash
|
|
|
+qmod -c all.q@NODENAME
|
|
|
+```
|
|
|
+
|
|
|
+
|
|
|
+We can restart the services, just in case:
|
|
|
+
|
|
|
+```bash
|
|
|
+$: cd /opt/sge6-2/default/common#
|
|
|
+$: ./sgeexecd stop
|
|
|
+ Shutting down Grid Engine execution daemon
|
|
|
+$: cd /opt/sge6-2/default/common
|
|
|
+$: ./sgemaster stop
|
|
|
+ shutting down Grid Engine qmaster
|
|
|
+$: cd /opt/sge6-2/default/common
|
|
|
+$: ./sgemaster start
|
|
|
+ starting sge_qmaster
|
|
|
+$: cd /opt/sge6-2/default/common
|
|
|
+$: ./sgeexecd start
|
|
|
+ starting sge_execd
|
|
|
+```
|
|
|
+
|
|
|
+Add the host to the submit hosts' list:
|
|
|
+
|
|
|
+```bash
|
|
|
+qconf -as NODENAME
|
|
|
+NODENAME added to submit host list
|
|
|
+```
|
|
|
+
|
|
|
+## Useful commands
|
|
|
+
|
|
|
+Cluster overview
|
|
|
+
|
|
|
+```bash
|
|
|
+qstat -f
|
|
|
+```
|
|
|
+
|
|
|
+Overview of the process of some user
|
|
|
+
|
|
|
+```bash
|
|
|
+qstat -u NEWUSER
|
|
|
+```
|
|
|
+
|
|
|
+Get job information
|
|
|
+
|
|
|
+```bash
|
|
|
+qstat -j JOB_ID
|
|
|
+```
|
|
|
+
|
|
|
+Get information from a finished job
|
|
|
+
|
|
|
+```bash
|
|
|
+qacct -j JOB_ID
|
|
|
+```
|
|
|
+
|
|
|
+Cancel job
|
|
|
+
|
|
|
+```bash
|
|
|
+qdel JOB_ID
|
|
|
+```
|
|
|
+
|
|
|
+Hold a queued job
|
|
|
+
|
|
|
+```bash
|
|
|
+qhold JOB_ID
|
|
|
+```
|
|
|
+
|
|
|
+Release a job in hold state
|
|
|
+
|
|
|
+```bash
|
|
|
+qrls JOB_ID
|
|
|
+```
|
|
|
+
|
|
|
+Modify requirements from a job in the queue
|
|
|
+
|
|
|
+```bash
|
|
|
+#When launched, gpu_slots=2, change it to 1
|
|
|
+qalter -l 'cpu_slots=1,h_vmem=infinity,virtual_free=51200M' JOB_ID
|
|
|
+```
|