Slurm Installその後5-続き. slurmstepd: error

はじめに

前回エントリでは、Container Bundleを使ってsbatchでのコンテナ実行を試してみました。

その際に以下のエラーが発生していましたが、原因が特定できたので情報を残しておこうと思います。

エラーの状況は以下でした。

[john@master ~]$ sbatch --container /mnt/share/john/centos_image --wrap 'grep ^NAME /etc/os-release'
Submitted batch job 779

[john@node1 ~]$ cat slurm-779.out
NAME="CentOS Linux"
slurmstepd: error: _try_parse: JSON parsing error 71 bytes: boolean expected

出力にslurmstepd: error: _try_parse: JSON parsing error 71 bytes: boolean expectedが出てしまっているというものでした。

前回エントリ

taqqu.hatenablog.com

エラーについて

slurmd.log

ジョブが実行されたノード(Node1)のslurmdログ/etc/slurm/slurmd.logの抜粋を貼っておきます。

ログを確認したところ、Node1ではジョブ実行時に以下が出力されていました。

[2022-11-23T08:49:26.728] debug2: Finish processing RPC: REQUEST_TERMINATE_JOB
[2022-11-23T08:49:26.747] [871.batch] debug:  _get_container_state: RunTimeQuery rc:256 output:time="2022-11-23T08:49:26Z" level=error msg="container does not exist"
[2022-11-23T08:49:26.748] [871.batch] error: _try_parse: JSON parsing error 71 bytes: boolean expected
[2022-11-23T08:49:26.748] [871.batch] debug:  container already dead

ジョブログslurm-779.outに「...JSON parsing error...」がerrorレベルで出力されていますが、その直前でmsg="container does not exist"というメッセージが、また抜粋の最終行にも、msg="container already dead"というdebugレベルのメッセージが出力されていました。

※ /etc/slurm/slurm.confでSlurmdDebugをdebug以上を指定しておく

Kill Container

このmsg="container already dead"の出力元ソースコードを確認したところ、src/slurmd/slurmstepd/container.cに含まれる_kill_container()から呼ばれる_get_container_state()が該当箇所にあたりました。

この部分はどうやら/etc/slurm/oci.confにRunTimeKillを指定していた為、その挙動を実現する為の実装箇所のようです。

<再掲>

ContainerPath=/home/john/local_image
CreateEnvFile=False
RunTimeQuery="runc --rootless=true --root=/tmp/ state %n.%u.%j.%s.%t"
RunTimeKill="runc --rootless=true --root=/tmp/ kill -a %n.%u.%j.%s.%t"
RunTimeDelete="runc --rootless=true --root=/tmp/ delete --force %n.%u.%j.%s.%t"
RunTimeRun="runc --rootless=true --root=/tmp/ run %n.%u.%j.%s.%t -b %b"

以上から読み取れることはとてもシンプルですね。

コンテナの停止(runc...kill...)を行おうとしたけど...既にコンテナは停止していた（container already dead）

ということですね。

サンプルで作成したContainer bundleは、Slurmジョブの中でrunc ...run...でコンテナ生成＆実行されていますが、その後すぐに停止するコンテナです。もともとはdockerイメージcentos:latestをpullしてContainer bundle化して、grep ^NAME /etc/os-releaseを実行するだけのものでした。

Slurmで起こっていたことを自分でruncコマンドで再現すると、概ね以下のような具合です。

[john@master john]$ runc run test --bundle ./centos_image
NAME="CentOS Linux"
[john@master john]$ runc list
ID          PID         STATUS      BUNDLE      CREATED     OWNER
[john@master john]$ runc kill test
ERRO[0000] container does not exist

runc runの後でrunc listをたたいても、terminate済なのでコンテナ情報はありませんし、当然、runc killするとエラーメッセージが出力されます。container does not existも先程の/etc/slurm/slurmd.logの出力とあっていますね。

...というところが原因でした。

今回のような事象に出会うかはContainer bundle化するコンテナのつくりによるところもあるので、

すぐに終了するコンテナなのか？
起動しつづけるコンテナなのか？
また、oci.confでのRunTimeコマンドの指定のやり方

など、Slurmで自分が使うContainer Runtimeにあわせて、設定を確認しておいた方が良さそうですね。

動かしてみる

では、今回のエラーの解消を確認してみたいと思います。

oci.confの変更

今回のサンプルそのままで簡単にやりたかったので、以下のようにrunc runコマンドにdetachオプション(-d)をつける形で実行してみます。バックグラウンドで起動させておく形です。

/etc/slurm/oci.conf

RunTimeRun="runc --rootless=true --root=/tmp/ run %n.%u.%j.%s.%t -b %b"

RunTimeRun="runc --rootless=true --root=/tmp/ run %n.%u.%j.%s.%t -d -b %b"

反映

oci.confの変更が済んだらノードに撒きます（Node1への例）。

scp -p /etc/slurm/oci.conf node1:/etc/slurm

忘れずにslurmd, slurmctldを再起動します。 ※今回はNode１だけ

sudo systemctl restart slurmd

sudo systemctl restart slurmctld

実行

前回エントリと同様にsbatchを投げてみます。

[john@master ~]$ sbatch --container /mnt/share/john/centos_image --wrap 'grep ^NAME /etc/os-release'
Submitted batch job 875

COMPLETEDしているようです。

[slurm@master ~]$ sacct --allusers --job 875
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
875                wrap partition+  chemistry          1  COMPLETED      0:0

では、Node1で出力されたログを確認します。

[john@node1 ~]$ cat slurm-875.out 
NAME="CentOS Linux"
[john@node1 ~]$

→想定した出力のみです。slurmstepd: error: _try_parse: JSON parsing error 71 bytes: boolean expected が出なくなったことが確認できました。

ということで以上、前回エラーの解決でした。

最後に

初見で「slurmstepd: error: _try_parse: JSON parsing error 71 bytes: boolean expected」だけをみたときは？でした。

エラーログを落ち着いて確認すれば難しい問題ではないかもしれませんね。

マイLab手帖

普段はサイエンス業界でSRE的な仕事をやっています。主に自宅環境でのハンズオンの備忘録。