The monitor
account is used for automated gathering performance logs on the master, the compute nodes and the scheduler (the queue). The logs, which are available to all cluster users, are stored under ~monitor/logs/
.
The ~monitor/cron/master/c4-status
script is run regularly and logs to ~monitor/logs/<YYYYMMDD>/master/c4-status.log
. For instance, assume that node n13 went down recently. Then you can scan these log files for when it was first reported down, e.g.
$ grep -B 5 -E ".*13.*[ ]+down" ~monitor/logs/20160829/master/c4-status.log
[...]
TIMESTAMP: 20160829-09:03:01
# Cluster status
## Nodes
Node(s) Status Mode User Group
13,19,24-25 down ---------- root root
--
TIMESTAMP: 20160829-18:33:01
# Cluster status
## Nodes
Node(s) Status Mode User Group
13,19,24-25 down ---------- root root
--
TIMESTAMP: 20160829-19:03:02
# Cluster status
## Nodes
Node(s) Status Mode User Group
13,19,24-25 down ---------- root root
--
[...]
This tells us that n13 was “down” 2016-08-29 at 09:03 and then down again at 18:33 the same day, and from there on it remained down.
The ~monitor/cron/queue/qstat
script runs qstat -t -n -1
regularly and logs to ~monitor/logs/<YYYYMMDD>/queue/qstat.log
.
Say node n13 went down around 18:30 on 2016-08-29 (example above). Then we can extract the log from this file for this date starting around 18:00-18:59 using:
$ grep -A 10000 -E "20160829_1[8]" /home/henrik/cron-bin/logs/stats-qstat.log > /tmp/qstat.log
and then search for what jobs were running on n13
during this time:
$ grep -F n13 /tmp/qstat.log
Looking at this output we can see which jobs was running and by whom before and at the time of the node was going down. There’s also some info on some of the qsub
parameters used, e.g.
Req'd Req'd Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
----------------------- ----------- -------- ---------------- ------ ----- ------ ------ --------- - ---------
[...]
507988.cclc01.som.ucsf johndoe batch rna.rsem_count 276528 1 12 1gb 72:00:00 R 15:11:15