Idea #8102
closed[Tools] Monitor running jobs and report when they're running poorly
Description
"Poorly" might mean they've gone unresponsive, they're not making effective use of their compute resources, etc.
This might be a big story that needs to get split up into multiple implementation stories.
Updated by Bryan Cosca about 10 years ago
A couple ideas:
If the job tempfailed, send an email to the person who ran the job saying something like "Hey your job temp failed because of [insert log here]." That way, I could know if I need to cancel that tempfail or let it keep running.
If the job is taking significantly longer than previous attempts, that should be reported as well. For example, wx7k5-8i9sb-nwzha570z5r07zf took 10.5 hours, and wx7k5-8i9sb-t6d3a3c8d98xwgt is currently taking 20+ hours.
The difference between the two jobs is a small one line change where I change the output file string.
bcosc2@shell:~/hartwig$ git diff 97d0..d6d4 diff --git a/crunch_scripts/sambambamerge.py b/crunch_scripts/sambambamerge.py index e8605d7..feacc12 100755 --- a/crunch_scripts/sambambamerge.py +++ b/crunch_scripts/sambambamerge.py @@ -31,7 +31,7 @@ sambamba_path = "/sambamba_v0.5.8" # 16 cores 60 GB RAM outdir = arvados.crunch.TaskOutputDir() -merge_out = os.path.join(outdir.path, job_key+'_merge.bam') +merge_out = os.path.join(outdir.path, job_key+'.merge.bam') sam_args = [sambamba_path, 'merge', '-t', '16', merge_out] #sam_args = [samtools_path,'merge','-@','16',os.path.join(tmpdir,'NA12878D_HiSeqX_R1.fastq.sam')]
A couple other things to report: are we using swap? Are we not fully utilizing all the cores available? (Things to let the user downsize or upgrade their node specifications)