Project

General

Profile

Actions

Idea #8102

closed

[Tools] Monitor running jobs and report when they're running poorly

Added by Brett Smith about 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Start date:
01/05/2016
Due date:
Story points:
-

Description

"Poorly" might mean they've gone unresponsive, they're not making effective use of their compute resources, etc.

This might be a big story that needs to get split up into multiple implementation stories.

Actions #1

Updated by Bryan Cosca about 10 years ago

A couple ideas:

If the job tempfailed, send an email to the person who ran the job saying something like "Hey your job temp failed because of [insert log here]." That way, I could know if I need to cancel that tempfail or let it keep running.

If the job is taking significantly longer than previous attempts, that should be reported as well. For example, wx7k5-8i9sb-nwzha570z5r07zf took 10.5 hours, and wx7k5-8i9sb-t6d3a3c8d98xwgt is currently taking 20+ hours.

The difference between the two jobs is a small one line change where I change the output file string.

bcosc2@shell:~/hartwig$ git diff 97d0..d6d4
diff --git a/crunch_scripts/sambambamerge.py b/crunch_scripts/sambambamerge.py
index e8605d7..feacc12 100755
--- a/crunch_scripts/sambambamerge.py
+++ b/crunch_scripts/sambambamerge.py
@@ -31,7 +31,7 @@ sambamba_path = "/sambamba_v0.5.8" 
 # 16 cores 60 GB RAM

 outdir = arvados.crunch.TaskOutputDir()
-merge_out = os.path.join(outdir.path, job_key+'_merge.bam')
+merge_out = os.path.join(outdir.path, job_key+'.merge.bam')

 sam_args = [sambamba_path, 'merge', '-t', '16', merge_out]
 #sam_args = [samtools_path,'merge','-@','16',os.path.join(tmpdir,'NA12878D_HiSeqX_R1.fastq.sam')]

A couple other things to report: are we using swap? Are we not fully utilizing all the cores available? (Things to let the user downsize or upgrade their node specifications)

Actions #2

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF