Summary:
I am proposing to have the ability to configure the worker threads in a way that allows two pools:
Reserve the Number of workers for only where known query duration is <= x seconds.
Allow any query on the rest of the workers.
EXAMPLE: Set 2 of the worker threads to only operate on Known queries that take less than 2 seconds to complete.
This may help when long running queries completely block all other queries. This tactic ensures small effecient queries (which should be 99% of all queries) always have a chance of getting run quickly.
Interesting idea! Seems like you’d need a tag of some sort for these queries. Or rock solid heuristic for how long a query takes to execute. On our SaaS instance it would be hard to find a query that “usually” runs for 2 seconds but “never” takes longer than 10. Database timeframes can fluctuate.
A different solution would be a query runner with a timeout set at < 5 seconds so that queries taking longer than this are immediately killed. The worker would memorize which ID’s it killed in the past so that it leaves jobs related to those queries for another worker. Then you could monitor that list of “killed queries” and tweak the timeout settings.
FYI: starting in the forthcoming V9 beta we’ve replaced Celery with RQ. So if your issues still exist in the V9 version the solution would happen there. We’re done with Celery.