Install Slurm on Gentoo OS

High Performance Computing

Posted by JXLIU on May 19, 2022

Gentoo slurm 安装

安装 slurm

在所有节点使用 emerge 安装 sys-cluster/slurm,emerge 将自动安装 munge 并创建固定 uid, gid 的 munge, slurm 用户

1
emerge -av sys-cluster/slurm

配置测试 munged

开启所有节点 munged 服务并设置为开机启动,将管理节点 /etc/munge/munge.key 复制到其他节点相应位置 /etc/munge

1
2
3
rc-service munged start
rc-update add munged default
rsync -auvz --delete /etc/munge/munge.key io01:/etc/munge

测试 munged 是否配置成功,在管理节点运行一下命令

1
2
munge -n |unmunge  #测试管理节点
munge -n |ssh cu201 unmunge  #测试其他(cu201)节点

应得到类似如下输出

1
2
3
4
5
6
7
8
9
10
11
STATUS:           Success (0)
ENCODE_HOST:      cu204 (127.0.0.1)
ENCODE_TIME:      2022-05-19 00:54:32 +0800 (1652892872)
DECODE_TIME:      2022-05-19 00:54:32 +0800 (1652892872)
TTL:              300
CIPHER:           aes128 (4)
MAC:              sha256 (5)
ZIP:              none (0)
UID:              root (0)
GID:              root (0)
LENGTH:           0

配置 slurm.conf

根据软件提供网页工具 /usr/share/doc/slurm-20.11.0.1-r104/html/configurator.html 进行配置,按照实际情况填写并生成 slurm.conf,复制到 /etc/slurm/slurm.conf,可与提供的/etc/slurm/slurm.conf.example 对照填写。

slurm.conf 中 ProctrackType=proctrack/cgroup,则需要创建 /etc/slurm/cgroup.conf ,使用默认的即可。

遇到如下错误

1
2
3
heq /var/log/slurm # sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
regular*     up   infinite      4  drain cu[201-204]

进一步查看得到

1
2
3
4
5
6
heq /var/log/slurm # sinfo -R
REASON               USER      TIMESTAMP           NODELIST
Low RealMemory       root      2022-05-18T17:20:20 cu201
Low RealMemory       root      2022-05-18T17:22:33 cu202
Low RealMemory       root      2022-05-18T17:22:44 cu203
Low RealMemory       root      2022-05-18T17:22:54 cu204
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
heq /var/log/slurm # scontrol show node
NodeName=cu201 Arch=x86_64 CoresPerSocket=24
   CPUAlloc=0 CPUTot=48 CPULoad=0.00
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=(null)
   NodeAddr=cu201 NodeHostName=cu201 Version=20.11.0
   OS=Linux 5.15.32-gentoo-r1-x86_64 #1 SMP Tue May 10 16:55:58 CST 2022
   RealMemory=186000 AllocMem=0 FreeMem=186649 Sockets=2 Boards=1
   State=IDLE+DRAIN ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=regular
   BootTime=2022-05-10T20:04:13 SlurmdStartTime=2022-05-18T19:48:08
   CfgTRES=cpu=48,mem=186000M,billing=48
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   Reason=Low RealMemory [root@2022-05-18T17:20:20]
   Comment=(null)

错误原因是因为 slurm.conf 中设置的 RealMemory 大于 计算几点实际可用内存,减小 slurm.conf 中RealMemory 的值并使其生效即可。使其生效需运行如下命令:

1
2
3
scontrol reconfigure
## 更新计算节点状态
scontrol update nodename=cu201 state=resume

参考链接

[]: https://southgreenplatform.github.io/trainings/hpc/slurminstallation/#part-1 “SouthGreen” []: https://slurm.schedmd.com/quickstart_admin.html “slurm 官方教程” []: https://slurm-dev.schedmd.narkive.com/lnoGic15/unknown “slurm dev”

[]: https://stackoverflow.com/questions/68132982/slurm-says-drained-low-realmemory “stackoverflow”

slurm 官方教程