Idea #8102: [Tools] Monitor running jobs and report when they're running poorly - Arvados

Actions

Copy link

Idea #8102

closed

[Tools] Monitor running jobs and report when they're running poorly

Added by Brett Smith about 10 years ago. Updated about 6 years ago.

Status:

Closed

Priority:

Normal

Assigned To:

Category:

Target version:

Start date:

01/05/2016

Due date:

Story points:

Description

"Poorly" might mean they've gone unresponsive, they're not making effective use of their compute resources, etc.

This might be a big story that needs to get split up into multiple implementation stories.

Actions

Copy link

Updated by Bryan Cosca about 10 years ago

A couple ideas:

If the job tempfailed, send an email to the person who ran the job saying something like "Hey your job temp failed because of [insert log here]." That way, I could know if I need to cancel that tempfail or let it keep running.

If the job is taking significantly longer than previous attempts, that should be reported as well. For example, wx7k5-8i9sb-nwzha570z5r07zf took 10.5 hours, and wx7k5-8i9sb-t6d3a3c8d98xwgt is currently taking 20+ hours.

The difference between the two jobs is a small one line change where I change the output file string.

bcosc2@shell:~/hartwig$ git diff 97d0..d6d4
diff --git a/crunch_scripts/sambambamerge.py b/crunch_scripts/sambambamerge.py
index e8605d7..feacc12 100755
--- a/crunch_scripts/sambambamerge.py
+++ b/crunch_scripts/sambambamerge.py
@@ -31,7 +31,7 @@ sambamba_path = "/sambamba_v0.5.8" 
 # 16 cores 60 GB RAM

 outdir = arvados.crunch.TaskOutputDir()
-merge_out = os.path.join(outdir.path, job_key+'_merge.bam')
+merge_out = os.path.join(outdir.path, job_key+'.merge.bam')

 sam_args = [sambamba_path, 'merge', '-t', '16', merge_out]
 #sam_args = [samtools_path,'merge','-@','16',os.path.join(tmpdir,'NA12878D_HiSeqX_R1.fastq.sam')]

A couple other things to report: are we using swap? Are we not fully utilizing all the cores available? (Things to let the user downsize or upgrade their node specifications)

Actions

Copy link

Updated by Peter Amstutz about 6 years ago

Status changed from New to Closed

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Arvados

Custom queries

Idea #8102

[Tools] Monitor running jobs and report when they're running poorly

Updated by Bryan Cosca about 10 years ago

Updated by Peter Amstutz about 6 years ago