Bug #6684
closedPipeline run is very slow
Description
Pipeline instance su92l-d1hrv-lnwv2waq5s55upr ran for 1 day and 3 hours. I believe this should have taken an hour or two. The pipeline greps over two large collections, collects statistics and reports them at the end of it's run. The statistics it generates are stored as relatively small text files.
Updated by Abram Connelly over 10 years ago
Running a simpler pipeline which does a zgrep and wc on the contents has failed. I believe the issue to be related to these lines in the log file:
2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr Traceback (most recent call last): 2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 254, in catch_exceptions_wrapper 2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr return orig_func(self, *args, **kwargs) 2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr File "/usr/local/lib/python2.7/dist-packages/arvados_fuse/__init__.py", line 443, in forget 2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr ent = self.inodes[inode] 2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr TypeError: 'NoneType' object has no attribute '__getitem__' 2015-07-21_23:55:01 su92l-8i9sb-m4m1girz122q3f1 4686 5 stderr srun: error: compute8: task 0: Terminated
Pipeline instance su92l-d1hrv-2gpgptkqx7fc962.
Updated by Abram Connelly over 10 years ago
Pipeline instance su92l-d1hrv-2gpgptkqx7fc962 looks to have failed because of a fault in my script. From the logs:
2015-07-21_23:54:55 su92l-8i9sb-m4m1girz122q3f1 4686 10 stderr /tmp/crunch-job/src/crunch_scripts/stress-test-keep: line 51: /bin/zgrep: Argument list too long 2015-07-21_23:54:55 su92l-8i9sb-m4m1girz122q3f1 4686 4 stderr /tmp/crunch-job/src/crunch_scripts/stress-test-keep: line 51: /bin/zgrep: Argument list too long
Maybe the fuse errors are related, I don't know, but pipeline instance su92l-d1hrv-2gpgptkqx7fc962 failed because of an Argument list too long error.
Updated by Brett Smith over 10 years ago
Abram Connelly wrote:
Maybe the fuse errors are related, I don't know, but pipeline instance su92l-d1hrv-2gpgptkqx7fc962 failed because of an
Argument list too longerror.
Abram,
This is a Unix error that means you constructed a command line with more arguments than the kernel can handle. See, e.g., this StackOverflow. Consider using xargs or another strategy to break up the argument list.
Updated by Nico César over 10 years ago
looking at
as Abram mention before there FUSE errors but not in ALL tasks. just in 9 tasks:
$ grep "ERROR: Unhandled exception during FUSE operation" su92l-8i9sb-m4m1girz122q3f1.log.txt | cut -d" " -f4 | sort -u -n 5 24 41 68 76 90 114 116 121 129
no specific compute node has the problem:
$ grep 'exit 15 success=' su92l-8i9sb-m4m1girz122q3f1.log.txt 2015-07-21_23:55:02 su92l-8i9sb-m4m1girz122q3f1 4686 5 child 13586 on compute8.1 exit 15 success= 2015-07-21_23:55:18 su92l-8i9sb-m4m1girz122q3f1 4686 41 child 13971 on compute16.3 exit 15 success= 2015-07-21_23:55:21 su92l-8i9sb-m4m1girz122q3f1 4686 24 child 13800 on compute15.2 exit 15 success= 2015-07-21_23:55:30 su92l-8i9sb-m4m1girz122q3f1 4686 76 child 14333 on compute20.5 exit 15 success= 2015-07-21_23:55:31 su92l-8i9sb-m4m1girz122q3f1 4686 68 child 14251 on compute6.5 exit 15 success= 2015-07-21_23:55:35 su92l-8i9sb-m4m1girz122q3f1 4686 90 child 14475 on compute18.6 exit 15 success= 2015-07-21_23:55:39 su92l-8i9sb-m4m1girz122q3f1 4686 114 child 14874 on compute3.8 exit 15 success= 2015-07-21_23:55:40 su92l-8i9sb-m4m1girz122q3f1 4686 116 child 14895 on compute6.8 exit 15 success= 2015-07-21_23:55:40 su92l-8i9sb-m4m1girz122q3f1 4686 121 child 14983 on compute16.8 exit 15 success= 2015-07-21_23:55:55 su92l-8i9sb-m4m1girz122q3f1 4686 129 child 16601 on compute16.1 exit 15 success=
Updated by Abram Connelly over 10 years ago
Brett Smith wrote:
Abram Connelly wrote:
Maybe the fuse errors are related, I don't know, but pipeline instance su92l-d1hrv-2gpgptkqx7fc962 failed because of an
Argument list too longerror.Abram,
This is a Unix error that means you constructed a command line with more arguments than the kernel can handle. See, e.g., this StackOverflow. Consider using xargs or another strategy to break up the argument list.
Yes, I know what an Argument list too long error is. I mentioned it to point out that the pipeline failed because of a fault in my script, not because of another issue.