Bug #8398
closed[API] Handle (temporarily) unavailable database more gracefully
Description
When the Arvados API server has a problem connecting to postgres, it seems that this results in the passenger worker dying and nginx sending a 502 response (a result of the PG::ConnectionBad exception).
There are a few scenarios in which the postgres database could reasonably be temporarily unavailable, so it would be good if the API server could catch the exception and implement some sort of reasonable retry behavior before finally giving up and returning an error to the requesting client.
The PG::ConnectionBad error occurs when attempting to get a connection while the postgres server is maxed out (i.e. due to the settings of max_connections and superuser_reserved_connections), or when the database is restarting. When crunch jobs are running, unfortunately it seems that there are at least some situations in which having a temporarily unavailable database results in a 502 error being returned to the client, and the job failing as a result. It would be better if the API server did something other than crash when a failure to connect to the database occurs.
Some typical error message (in /var/log/nginx/passenger.log) are:
[ 2016-02-08 15:15:57.2912 41706/7f9fe6ffd700 age/Cor/Con/InternalUtils.cpp:105 ]: [Client 24-3757] Sending 502 response: application did not send a complete response App 3373 stderr: [ 2016-02-08 15:16:01.4867 13630/0x007f20e329fa30(Worker 1) utils.rb:87 ]: *** Exception PG::ConnectionBad in Rack application object (FATAL: remaining connection slots are reserved for non-replication superuser connections App 3373 stderr: ) (process 13630, thread 0x007f20e329fa30(Worker 1)): App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:1222:in `initialize' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:1222:in `new' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:1222:in `connect' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:324:in `initialize' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:28:in `new' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:28:in `postgresql_connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:315:in `new_connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:325:in `checkout_new_connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:247:in `block (2 levels) in checkout' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:242:in `loop' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:242:in `block in checkout' App 3373 stderr: from /usr/local/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:239:in `checkout' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:102:in `block in connection' App 3373 stderr: from /usr/local/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:101:in `connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:410:in `retrieve_connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_specification.rb:171:in `retrieve_connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_specification.rb:145:in `connection' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/sanitization.rb:130:in `replace_bind_variables' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/sanitization.rb:117:in `sanitize_sql_array' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/sanitization.rb:28:in `sanitize_sql_for_conditions' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/relation/query_methods.rb:324:in `build_where' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/relation/query_methods.rb:136:in `where' App 3373 stderr: from /var/ App 3373 stderr: www/arvados-api/current/app/middlewares/arvados_api_token.rb:38:in `call' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/actionpack-3.2.17/lib/action_dispatch/middleware/remote_ip.rb:31:in `call' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/railties-3.2.17/lib/rails/engine.rb:484:in `call' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/railties-3.2.17/lib/rails/application.rb:231:in `call' App 3373 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/railties-3.2.17/lib/rails/railtie/configurable.rb:30:in `method_missing' App 3373 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in `process_request' App 3373 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:160:in `accept_and_process_next_request' App 3373 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:113:in `main_loop' App 3373 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler.rb:416:in `block (3 levels) in start_threads' App 3373 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception'
or
App 58856 stderr: [ 2016-02-08 16:33:49.9153 58949/0x007f173e5ce1b0(Worker 1) utils.rb:87 ]: *** Exception PG::ConnectionBad in Rack application object (FATAL: the database system is starting up App 58856 stderr: FATAL: the database system is starting up App 58856 stderr: ) (process 58949, thread 0x007f173e5ce1b0(Worker 1)): App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:1222:in `initialize' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:1222:in `new' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:1222:in `connect' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:324:in `initialize' [ 2016-02-08 16:33:49.9155 58751/7fc9577fe700 age/Cor/Con/InternalUtils.cpp:105 ]: [Client 5-354] Sending 502 response: application did not send a complete response App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:28:in `new' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/postgresql_adapter.rb:28:in `postgresql_connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:315:in `new_connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:325:in `checkout_new_connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:247:in `block (2 levels) in checkout' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:242:in `loop' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:242:in `block in checkout' App 58856 stderr: from /usr/local/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:239:in `checkout' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:102:in `block in connection' App 58856 stderr: from /usr/local/lib/ruby/2.1.0/monitor.rb:211:in `mon_synchronize' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:101:in `connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_pool.rb:410:in `retrieve_connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_specification.rb:171:in `retrieve_connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/connection_adapters/abstract/connection_specification.rb:145:in `connection' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/sanitization.rb:130:in `replace_bind_variables' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/sanitization.rb:117:in `sanitize_sql_array' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/sanitization.rb:28:in `sanitize_sql_for_conditions' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/relation/query_methods.rb:324:in `build_where' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/activerecord-3.2.17/lib/active_record/relation/query_methods.rb:136:in `where' App 58856 stderr: from /var/www/arvados-api/current/app/middlewares/arvados_api_token.rb:38:in `call' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/actionpack-3.2.17/lib/action_dispatch/middleware/remote_ip.rb:31:in `call' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/railties-3.2.17/lib/rails/engine.rb:484:in `call' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/railties-3.2.17/lib/rails/application.rb:231:in `call' App 58856 stderr: from /var/www/arvados-api/shared/vendor_bundle/ruby/2.1.0/gems/railties-3.2.17/lib/rails/railtie/configurable.rb:30:in `method_missing' App 58856 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/rack/thread_handler_extension.rb:97:in `process_request' App 58856 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:152:in `accept_and_process_next_request' App 58856 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler/thread_handler.rb:113:in `main_loop' App 58856 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/request_handler.rb:416:in `block (3 levels) in start_threads' App 58856 stderr: from /usr/lib/ruby/vendor_ruby/phusion_passenger/utils.rb:113:in `block in create_thread_and_abort_on_exception'
Updated by Brett Smith about 10 years ago
- Subject changed from Handle (temporarily) unavailable database more gracefully to [API] Handle (temporarily) unavailable database more gracefully
Updated by Brett Smith about 10 years ago
Right now it seems like we can more bang for the buck by setting this retries at the Nginx level. Nginx can be configured to retry actions when the proxy upstream returns designated error codes. With Nginx appropriately reconfigured, we can retry a wider variety of failures with less effort. See, e.g., http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_next_upstream
We probably will have database-specific retry code in the API server, but right now our thinking is to investigate and document Nginx configuration changes first.