Project

General

Profile

Actions

Bug #6684

closed

Pipeline run is very slow

Added by Abram Connelly over 10 years ago. Updated about 6 years ago.

Status:
Closed
Priority:
Normal
Assigned To:
-
Category:
-
Target version:
-
Story points:
-

Description

Pipeline instance su92l-d1hrv-lnwv2waq5s55upr ran for 1 day and 3 hours. I believe this should have taken an hour or two. The pipeline greps over two large collections, collects statistics and reports them at the end of it's run. The statistics it generates are stored as relatively small text files.

Actions #1

Updated by Abram Connelly over 10 years ago

Running a simpler pipeline which does a zgrep and wc on the contents has failed. I believe the issue to be related to these lines in the log file:

2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr Traceback (most recent call last):
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 254, in catch_exceptions_wrapper
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr     return orig_func(self, *args, **kwargs)
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr   File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 443, in forget
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr     ent = self.inodes[inode]
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr TypeError: 'NoneType' object has no attribute '__getitem__'
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr srun: error: compute8: task 0: Terminated

Pipeline instance su92l-d1hrv-2gpgptkqx7fc962.

Actions #2

Updated by Abram Connelly over 10 years ago

Pipeline instance su92l-d1hrv-2gpgptkqx7fc962 looks to have failed because of a fault in my script. From the logs:

2015-07-21_23:54:55 su92l-8i9sb-m4m1girz122q3f1 4686 10 stderr /tmp/crunch-job/src/crunch_scripts/stress-test-keep: line 51: /bin/zgrep: Argument list too long
2015-07-21_23:54:55 su92l-8i9sb-m4m1girz122q3f1 4686 4 stderr /tmp/crunch-job/src/crunch_scripts/stress-test-keep: line 51: /bin/zgrep: Argument list too long

Maybe the fuse errors are related, I don't know, but pipeline instance su92l-d1hrv-2gpgptkqx7fc962 failed because of an Argument list too long error.

Actions #3

Updated by Brett Smith over 10 years ago

Abram Connelly wrote:

Maybe the fuse errors are related, I don't know, but pipeline instance su92l-d1hrv-2gpgptkqx7fc962 failed because of an Argument list too long error.

Abram,

This is a Unix error that means you constructed a command line with more arguments than the kernel can handle. See, e.g., this StackOverflow. Consider using xargs or another strategy to break up the argument list.

Actions #4

Updated by Nico César over 10 years ago

looking at

https://workbench.su92l.arvadosapi.com/collections/56440ab56454b4bbbe79ec1633b5ef61+89/su92l-8i9sb-m4m1girz122q3f1.log.txt?disposition=inline&size=1798803

as Abram mention before there FUSE errors but not in ALL tasks. just in 9 tasks:

$ grep "ERROR: Unhandled exception during FUSE operation" su92l-8i9sb-m4m1girz122q3f1.log.txt | cut -d" " -f4 | sort -u -n
5
24
41
68
76
90
114
116
121
129

no specific compute node has the problem:

$ grep 'exit 15 success=' su92l-8i9sb-m4m1girz122q3f1.log.txt  
2015-07-21_23:55:02 su92l-8i9sb-m4m1girz122q3f1 4686 5 child 13586 on compute8.1 exit 15 success=
2015-07-21_23:55:18 su92l-8i9sb-m4m1girz122q3f1 4686 41 child 13971 on compute16.3 exit 15 success=
2015-07-21_23:55:21 su92l-8i9sb-m4m1girz122q3f1 4686 24 child 13800 on compute15.2 exit 15 success=
2015-07-21_23:55:30 su92l-8i9sb-m4m1girz122q3f1 4686 76 child 14333 on compute20.5 exit 15 success=
2015-07-21_23:55:31 su92l-8i9sb-m4m1girz122q3f1 4686 68 child 14251 on compute6.5 exit 15 success=
2015-07-21_23:55:35 su92l-8i9sb-m4m1girz122q3f1 4686 90 child 14475 on compute18.6 exit 15 success=
2015-07-21_23:55:39 su92l-8i9sb-m4m1girz122q3f1 4686 114 child 14874 on compute3.8 exit 15 success=
2015-07-21_23:55:40 su92l-8i9sb-m4m1girz122q3f1 4686 116 child 14895 on compute6.8 exit 15 success=
2015-07-21_23:55:40 su92l-8i9sb-m4m1girz122q3f1 4686 121 child 14983 on compute16.8 exit 15 success=
2015-07-21_23:55:55 su92l-8i9sb-m4m1girz122q3f1 4686 129 child 16601 on compute16.1 exit 15 success=

Actions #5

Updated by Abram Connelly over 10 years ago

Brett Smith wrote:

Abram Connelly wrote:

Maybe the fuse errors are related, I don't know, but pipeline instance su92l-d1hrv-2gpgptkqx7fc962 failed because of an Argument list too long error.

Abram,

This is a Unix error that means you constructed a command line with more arguments than the kernel can handle. See, e.g., this StackOverflow. Consider using xargs or another strategy to break up the argument list.

Yes, I know what an Argument list too long error is. I mentioned it to point out that the pipeline failed because of a fault in my script, not because of another issue.

Actions #6

Updated by Peter Amstutz about 6 years ago

  • Status changed from New to Closed
Actions

Also available in: Atom PDF