Javier Jorge Cano 7c5336a9a5 Uploaded config files, binaries and scripts | před 2 roky | |
---|---|---|
LICENSE | před 2 roky | |
README.md | před 2 roky | |
install_sge.sh | před 2 roky |
Example SGE master node installation
Remove entry 127.0.0.1 (or 127.0.1.1) pointing to the hostname from /etc/hosts/
127.0.0.1 localhost
#127.0.1.1 NODENAME
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
Copy the content of this repo in the host, and with sudo
, do:
mkdir /opt/sge6-2/
cp ge62u5.tar.gz install_sge.sh /opt/sge6-2/
cd /opt/sge6-2/
chmod +x install_sge.sh
./install_sge.sh
We have in the folder the following structure:
.
├── 3rd_party
├── bin
├── catman
├── ckpt
├── doc
├── dtrace
├── examples
├── ge6.2u5
├── ge62u5.tar.gz
├── include
├── install_execd
├── install_qmaster
├── install_sge.sh
├── inst_sge
├── lib
├── man
├── mpi
├── pvm
├── qmon
├── start_gui_installer
├── util
└── utilbin
Now we proceed with the SGE master installation:
cd /opt/sge6-2/
./install_qmaster
The questions answers' sequence is the following:
Do you agree with that license? (y/n) [n] >> y
Hit <RETURN> to continue >>
Do you want to install Grid Engine
under an user id other than >root< (y/n) [y] >> n
Hit <RETURN> to continue >>
If this directory is not correct (e.g. it may contain an automounter
prefix) enter the correct path to this directory or hit <RETURN>
to use default [/opt/sge6-2] >> /opt/sge6-2
Hit <RETURN> to continue >>
(default: 2) >> 2
Hit <RETURN> to continue >>
(default: 2) >>
Hit <RETURN> to continue >>
Enter cell name [default] >> default
Enter new cluster name or hit <RETURN>
to use default [p6444] >> cluster_name
Hit <RETURN> to continue >>
Enter a qmaster spool directory [/opt/sge6-2/default/spool/qmaster] >>
Hit <RETURN> to continue >>
Are you going to install Windows Execution Hosts? (y/n) [n] >> n
Did you install this version with >pkgadd< or did you already verify
and set the file permissions of your distribution (enter: y) (y/n) [y] >> y
We do not verify file permissions. Hit <RETURN> to continue >>
Are all hosts of your cluster in a single DNS domain (y/n) [y] >> y
Hit <RETURN> to continue >>
Do you want to enable the JMX MBean server (y/n) [y] >> y
Enter JAVA_HOME (use "none" when none available) [] >> none
Please enter additional JVM arguments (optional, default is [-Xmx256m]) >> -Xmx256m
Please enter an unused port number for the JMX MBean server [6446] >> 6446
Enable JMX SSL server authentication (y/n) [y] >> y
Enable JMX SSL client authentication (y/n) [y] >> y
Enter JMX SSL server keystore path [/var/sgeCA/sge_qmaster/default/private/keystore] >> /var/sgeCA/sge_qmaster/default/private/keystore
Enter JMX SSL server keystore pw (at least 6 characters) >> ******
Using the following JMX MBean server settings.
libjvm_path >jvm_missing<
Additional JVM arguments >-Xmx256m<
JMX port >6446<
JMX ssl >true<
JMX client ssl >true<
JMX server keystore >/var/sgeCA/sge_qmaster/default/private/keystore<
JMX server keystore pw >******<
Do you want to use these data (y/n) [y] >> y
Hit <RETURN> to continue >>
Hit <RETURN> to continue >>
Please choose a spooling method (berkeleydb|classic) [berkeleydb] >> berkeleydb
Do you want to use a Berkeley DB Spooling Server? (y/n) [n] >> n
Hit <RETURN> to continue >>
Default: [/opt/sge6-2/default/spool/spooldb] >> /opt/sge6-2/default/spool/spooldb
Hit <RETURN> to continue >>
Please enter a range [20000-20100] >> 20000-20100
Using >20000-20100< as gid range. Hit <RETURN> to continue >>
Default: [/opt/sge6-2/default/spool] >> /opt/sge6-2/default/spool
Default: [none] >> none
Do you want to change the configuration parameters (y/n) [n] >> n
Hit <RETURN> to continue >>
The following error is expected, everything is fine:
util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
Error: Cannot create keystore /var/sgeCA/sge_qmaster/default/private/keystore
util/sgeCA/sge_ca: 1: eval: lx24-amd64=/opt/sge6-2/lib/lx24-amd64:-amd64: not found
util/sgeCA/sge_ca: 1749: export: lx24-amd64: bad variable name
./inst_sge: 1204: cannot create /var/sgeCA/sge_qmaster/default/private/keystore.password: Directory nonexistent
chown: usuario inválido: «default»
To use the cluster commands (qsub, qstat, etc.), some variables should be included in the environment. This should be included in the .bashrc
of the user that will use the cluster.
source /opt/sge6-2/default/common/settings.sh
To start the sgemaster process:
$:/opt/sge6-2/default/common/sgemaster
starting sge_qmaster
Then, qstat -f
should not show any error:
$ qstat -f
$
We will add the host as an execute node:
qconf -ae
Include here the hostname:
hostname NODENAME
load_scaling NONE
complex_values NONE
user_lists NONE
xuser_lists NONE
projects NONE
xprojects NONE
usage_scaling NONE
report_variables NONE
We create now the host list:
qconf -ahgrp
Add the host list @allhosts
with this host:
group_name @allhosts
hostlist NODENAME
Add a new queue:
qconf -aq
Change the values properly:
qname all.q
hostlist @allhosts
seq_no 0
load_thresholds np_load_avg=1.75
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 19
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make
rerun FALSE
slots 1,[NODENAME=24]
tmpdir /tmp
shell /bin/bash
prolog NONE
epilog NONE
shell_start_mode posix_compliant
starter_method NONE
suspend_method NONE
resume_method NONE
terminate_method NONE
notify 00:00:60
owner_list NONE
user_lists NONE
xuser_lists NONE
subordinate_list NONE
complex_values NONE
projects NONE
xprojects NONE
calendar NONE
initial_state default
s_rt INFINITY
h_rt INFINITY
s_cpu INFINITY
h_cpu INFINITY
s_fsize INFINITY
h_fsize INFINITY
s_data INFINITY
h_data INFINITY
s_stack INFINITY
h_stack INFINITY
s_core INFINITY
h_core INFINITY
s_rss INFINITY
h_rss INFINITY
s_vmem INFINITY
h_vmem INFINITY
Add user, in this case, NEWUSER
:
qconf -auser
name NEWUSER
oticket 0
fshare 0
delete_time 0
default_project NONE
Add a new user list, with this NEWUSER
:
qconf -au NEWUSER NEWUSERS
We should modify the queue to include this user list:
qconf -mq all.q
...
notify 00:00:60
owner_list NONE
user_lists NEWUSERS
xuser_lists NONE
subordinate_list NONE
...
Launch the execute daemon (sgeexcd
):
$:/opt/sge6-2/default/common# ./sgeexecd
starting sge_execd
If the queue is in some weird state, you can run to clean its state:
qmod -c all.q@NODENAME
We can restart the services, just in case:
$: cd /opt/sge6-2/default/common#
$: ./sgeexecd stop
Shutting down Grid Engine execution daemon
$: cd /opt/sge6-2/default/common
$: ./sgemaster stop
shutting down Grid Engine qmaster
$: cd /opt/sge6-2/default/common
$: ./sgemaster start
starting sge_qmaster
$: cd /opt/sge6-2/default/common
$: ./sgeexecd start
starting sge_execd
Add the host to the submit hosts' list:
qconf -as NODENAME
NODENAME added to submit host list
Cluster overview
qstat -f
Overview of the process of some user
qstat -u NEWUSER
Get job information
qstat -j JOB_ID
Get information from a finished job
qacct -j JOB_ID
Cancel job
qdel JOB_ID
Hold a queued job
qhold JOB_ID
Release a job in hold state
qrls JOB_ID
Modify requirements from a job in the queue
#When launched, cpu_slots=2, change it to 1
qalter -l 'cpu_slots=1,h_vmem=infinity,virtual_free=51200M' JOB_ID