No Description

Javier Jorge Cano 081ac89247 Minor change in README.md 2 years ago
LICENSE c8ae44bad4 Initial commit 2 years ago
README.md 081ac89247 Minor change in README.md 2 years ago

README.md

SGE6.2_Ubuntu_20.04_Installation_guide

Example SGE master node installation

  • Hostname: NODENAME

Pre-requisites

Remove entry 127.0.0.1 (or 127.0.1.1) pointing to the hostname from /etc/hosts/

127.0.0.1	localhost
#127.0.1.1	NODENAME

# The following lines are desirable for IPv6 capable hosts
::1     ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

SGE master installation

Copy the content of this repo in the host, and with sudo, do:

mkdir /opt/sge6-2/
cp ge62u5.tar.gz install_sge.sh /opt/sge6-2/
cd /opt/sge6-2/
chmod +x install_sge.sh
./install_sge.sh

We have in the folder the following structure:

.
├── 3rd_party
├── bin
├── catman
├── ckpt
├── doc
├── dtrace
├── examples
├── ge6.2u5
├── ge62u5.tar.gz
├── include
├── install_execd
├── install_qmaster
├── install_sge.sh
├── inst_sge
├── lib
├── man
├── mpi
├── pvm
├── qmon
├── start_gui_installer
├── util
└── utilbin

Now we proceed with the SGE master installation:

cd /opt/sge6-2/
./install_qmaster

The questions answers' sequence is the following:

Do you agree with that license? (y/n) [n] >> y
Hit <RETURN> to continue >> 
Do you want to install Grid Engine
under an user id other than >root< (y/n) [y] >> n
Hit <RETURN> to continue >> 
If this directory is not correct (e.g. it may contain an automounter
prefix) enter the correct path to this directory or hit <RETURN>
to use default [/opt/sge6-2] >> /opt/sge6-2
Hit <RETURN> to continue >> 
(default: 2) >> 2
Hit <RETURN> to continue >> 
(default: 2) >> 
Hit <RETURN> to continue >> 
Enter cell name [default] >> default
Enter new cluster name or hit <RETURN>
to use default [p6444] >> cluster_name      
Hit <RETURN> to continue >> 
Enter a qmaster spool directory [/opt/sge6-2/default/spool/qmaster] >> 
Hit <RETURN> to continue >> 
Are you going to install Windows Execution Hosts? (y/n) [n] >> n
Did you install this version with >pkgadd< or did you already verify
and set the file permissions of your distribution (enter: y) (y/n) [y] >> y
We do not verify file permissions. Hit <RETURN> to continue >> 
Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
Hit <RETURN> to continue >> 
Do you want to enable the JMX MBean server (y/n) [y] >> y
Enter JAVA_HOME (use "none" when none available) [] >> none
Please enter additional JVM arguments (optional, default is [-Xmx256m]) >> -Xmx256m
Please enter an unused port number for the JMX MBean server [6446] >> 6446
Enable JMX SSL server authentication (y/n) [y] >> y
Enable JMX SSL client authentication (y/n) [y] >> y
Enter JMX SSL server keystore path [/var/sgeCA/sge_qmaster/default/private/keystore] >> /var/sgeCA/sge_qmaster/default/private/keystore
Enter JMX SSL server keystore pw (at least 6 characters) >> ******
Using the following JMX MBean server settings.
   libjvm_path              >jvm_missing<
   Additional JVM arguments >-Xmx256m<
   JMX port                 >6446<
   JMX ssl                  >true<
   JMX client ssl           >true<
   JMX server keystore      >/var/sgeCA/sge_qmaster/default/private/keystore<
   JMX server keystore pw   >******<
Do you want to use these data (y/n) [y] >> y
Hit <RETURN> to continue >> 
Hit <RETURN> to continue >> 
Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> berkeleydb
Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> n
Hit <RETURN> to continue >> 
Default: [/opt/sge6-2/default/spool/spooldb] >> /opt/sge6-2/default/spool/spooldb
Hit <RETURN> to continue >> 
Please enter a range [20000-20100] >> 20000-20100
Using >20000-20100< as gid range. Hit <RETURN> to continue >> 
Default: [/opt/sge6-2/default/spool] >> /opt/sge6-2/default/spool
Default: [none] >> none
Do you want to change the configuration parameters (y/n) [n] >> n
Hit <RETURN> to continue >> 

The following error is expected, everything is fine:

util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
Error: Cannot create keystore /var/sgeCA/sge_qmaster/default/private/keystore
util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
./inst_sge: 1204: cannot create /var/sgeCA/sge_qmaster/default/private/keystore.password: Directory nonexistent
chown: usuario inválido: «default»

To use the cluster commands (qsub, qstat, etc.), some variables should be included in the environment. This should be included in the .bashrc of the user that will use the cluster.

source /opt/sge6-2/default/common/settings.sh

To start the sgemaster process:

$:/opt/sge6-2/default/common/sgemaster 
   starting sge_qmaster

Then, qstat -f should not show any error:

$ qstat -f
$

SGE master setup

We will add the host as an execute node:

qconf -ae

Include here the hostname:

hostname              NODENAME
load_scaling          NONE
complex_values        NONE
user_lists            NONE
xuser_lists           NONE
projects              NONE
xprojects             NONE
usage_scaling         NONE
report_variables      NONE

We create now the host list:

qconf -ahgrp

Add the host list @allhosts with this host:

group_name @allhosts
hostlist NODENAME

Add a new queue:

qconf -aq

Change the values properly:

qname                 all.q
hostlist              @allhosts
seq_no                0
load_thresholds       np_load_avg=1.75
suspend_thresholds    NONE
nsuspend              1
suspend_interval      00:05:00
priority              19
min_cpu_interval      00:05:00
processors            UNDEFINED
qtype                 BATCH INTERACTIVE
ckpt_list             NONE
pe_list               make
rerun                 FALSE
slots                 1,[NODENAME=24]
tmpdir                /tmp
shell                 /bin/bash
prolog                NONE
epilog                NONE
shell_start_mode      posix_compliant
starter_method        NONE
suspend_method        NONE
resume_method         NONE
terminate_method      NONE
notify                00:00:60
owner_list            NONE
user_lists            NONE
xuser_lists           NONE
subordinate_list      NONE
complex_values        NONE
projects              NONE
xprojects             NONE
calendar              NONE
initial_state         default
s_rt                  INFINITY
h_rt                  INFINITY
s_cpu                 INFINITY
h_cpu                 INFINITY
s_fsize               INFINITY
h_fsize               INFINITY
s_data                INFINITY
h_data                INFINITY
s_stack               INFINITY
h_stack               INFINITY
s_core                INFINITY
h_core                INFINITY
s_rss                 INFINITY
h_rss                 INFINITY
s_vmem                INFINITY
h_vmem                INFINITY

Add user, in this case, NEWUSER:

qconf -auser
name NEWUSER
oticket 0
fshare 0
delete_time 0
default_project NONE

Add a new user list, with this NEWUSER:

qconf -au NEWUSER NEWUSERS

We should modify the queue to include this user list:

qconf -mq all.q
...
notify                00:00:60
owner_list            NONE
user_lists            NEWUSERS
xuser_lists           NONE
subordinate_list      NONE
...

Launch the execute daemon (sgeexcd):

$:/opt/sge6-2/default/common# ./sgeexecd 
   starting sge_execd

If the queue is in some weird state, you can run to clean its state:

qmod -c all.q@NODENAME

We can restart the services, just in case:

$: cd /opt/sge6-2/default/common# 
$: ./sgeexecd stop
   Shutting down Grid Engine execution daemon
$: cd /opt/sge6-2/default/common
$: ./sgemaster stop
   shutting down Grid Engine qmaster
$: cd /opt/sge6-2/default/common
$: ./sgemaster start
   starting sge_qmaster
$: cd /opt/sge6-2/default/common
$: ./sgeexecd start
   starting sge_execd

Add the host to the submit hosts' list:

qconf -as NODENAME
NODENAME added to submit host list

Useful commands

Cluster overview

qstat -f

Overview of the process of some user

qstat -u NEWUSER

Get job information

qstat -j JOB_ID

Get information from a finished job

qacct -j JOB_ID

Cancel job

qdel JOB_ID

Hold a queued job

qhold JOB_ID

Release a job in hold state

qrls JOB_ID

Modify requirements from a job in the queue

#When launched, cpu_slots=2, change it to 1
qalter -l 'cpu_slots=1,h_vmem=infinity,virtual_free=51200M' JOB_ID