Project

General

Profile

Actions

Bug #2934

closed

Limit impact of chatty crunch job on websocket/Log mechanism

Added by Tom Clegg over 10 years ago. Updated over 10 years ago.

Status:
Resolved
Priority:
Normal
Assigned To:
Tim Pierce
Category:
-
Start date:
06/10/2014
Due date:
% Done:

100%

Estimated time:
(Total: 1.00 h)
Story points:
1.0

Description

Current code limits each job to one Log entry per second. Additional limits to impose:
  • 64K Log entries per job
  • 64Mbytes of log data per job

When one of these limits is reached, add one more entry like "Server configured limit reached (crunch_limit_log_events_per_job: 65536). Subsequent logs truncated".

Subsequent logs will still get stored in Keep, but won't be available through event bus.

These limits (and the current LOG_BUFFER_SIZE constant) should be configurable. Add to services/api/config/application.default.yml in the common section:

  crunch_limit_log_events_per_job: 65536
  crunch_limit_log_event_bytes_per_job: 67108864
  crunch_limit_log_bytes_per_event: 4096

Then use Rails.configuration.crunch_limit_log_bytes_per_event instead of LOG_BUFFER_SIZE, etc.

Optional: Add crunch_limit_seconds_between_log_events (this is a minimum, and currently hard coded at 1, but we can probably support 0, 0.5, 5, etc., easily enough).


Subtasks 1 (0 open1 closed)

Task #3012: Review 2934-limit-crunch-logsResolvedPeter Amstutz06/10/2014

Actions
Actions #1

Updated by Tom Clegg over 10 years ago

  • Description updated (diff)
Actions #2

Updated by Tom Clegg over 10 years ago

  • Assigned To set to Tim Pierce
Actions #3

Updated by Tom Clegg over 10 years ago

  • Status changed from New to In Progress
Actions #4

Updated by Tim Pierce over 10 years ago

As of d7e23d9, crunch-dispatch.rb keeps event counts and byte counts in memory instead of reading them from disk.

Per discussion on IRC: we will almost certainly need to tune and tweak these parameters to find good settings. The ones that are defined in the story should be considered a starting point and not definitive. They're all configurable parameters.

We could implement byte limits per event rather than across all events (i.e. limit to 1Kb per event rather than 64MB total for all logged events), but I don't see the benefit. Applying the limit as a cap on total bytes generated gives jobs the flexibility to generate a few log events that are longer than the average without truncating them, and still prevents jobs from running away and filling the entire disk.

Actions #5

Updated by Peter Amstutz over 10 years ago

  • Suggest changing the read size on read_nonblock() in #read_pipes to Rails.configuration.crunch_log_bytes_per_event instead of hardcoded at 2**20, otherwise if something sufficiently spammy could write out (a lot) more than Rails.configuration.crunch_log_bytes_per_event bytes in a single event.
Actions #6

Updated by Tim Pierce over 10 years ago

Peter Amstutz wrote:

  • Suggest changing the read size on read_nonblock() in #read_pipes to Rails.configuration.crunch_log_bytes_per_event instead of hardcoded at 2**20, otherwise if something sufficiently spammy could write out (a lot) more than Rails.configuration.crunch_log_bytes_per_event bytes in a single event.

crunch_log_bytes_per_event isn't an upper limit, it's a lower one (which is why I didn't add "limit" to the name) -- it's the lower threshold for flushing the event to the Log table, if events are getting created more often than once per second. So we expect and allow that some events will actually produce more than crunch_log_bytes_per_event bytes at a time. (I'm absolutely open to changing the name: crunch_log_min_bytes_per_event? crunch_log_byte_threshold_for_fast_events? I dunno.)

Actions #7

Updated by Peter Amstutz over 10 years ago

My understanding of the current logic:
  • If it's been at least N seconds (where currently N=1) then flush the buffer
  • If there are at least M bytes (where M is from the configuration file) then flush the buffer
  • There's also some line buffering going on, so I think it does generally force events to break evenly across lines.

My suggestion is simply that if you only read M bytes at a time, then you'll produce records that are at most (2*M-1) bytes instead of the current 1 MiB read size which means you could get a 1 MiB record, which seems really unwieldy and unfriendly to downstream log consumers. However, I'm inclined to let this go, and see how it performs in testing.

Actions #8

Updated by Tim Pierce over 10 years ago

OK. I agree that sounds like a good idea, but would just as soon confirm whether it really is a problem first. (It's Been Working Okay So Far [tm])

Changes at 46ab868 per our offline discussion to add a new test fixture repository to test/test.git.tar rather than mucking with the repos that are already there. This still requires a small change to commits_controller_test.rb to account for the fact that there's a whole new repository in there.

Actions #9

Updated by Peter Amstutz over 10 years ago

Not a fatal error but still disconcerting:

[ 40/196] Arvados::V1::CommitsControllerTest#test_test_find_commit_rangefatal: bad object 31ce37fe365b3dc204300a3e4c396ad333ed0556
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: ambiguous argument 'tag1': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
fatal: bad object 077ba2ad3ea24a929091a9e6ce545c93199b8e57
Actions #10

Updated by Tim Pierce over 10 years ago

Peter Amstutz wrote:

Not a fatal error but still disconcerting:

[...]

Yes, these are expected -- the tests are attempting to look up the specific revision 077ba2ad3ea24a929091a9e6ce545c93199b8e57 on every repository readable to users(:active), and now there are additional test fixture repositories readable to :active that do not include that commit. It's definitely disconcerting. Added documentation in 0143d26 to explain this output.

Actions #11

Updated by Tim Pierce over 10 years ago

  • Status changed from In Progress to Resolved
  • % Done changed from 0 to 100

Applied in changeset arvados|commit:95811b3a901df3ceab701fe18e9ec1bb614bef90.

Actions

Also available in: Atom PDF