No Customer Left Behind: A Self-balancing Task Scheduler

Lucio handles thousands of documents and queries every day. As we scale to support larger enterprises with hundreds of legal professionals, that number keeps growing. More requests put increasing pressure on our document AI and processing pipelines, all competing for limited processing capacity.

Some features, like Briefcase, push through huge batches of documents at once. Others, like Assistant, depend on fast, real-time responses. These two usecases pull in opposite directions. One needs high throughput. The other demands low latency.

As usage ramped up, we started to feel the impact. Heavy workloads from a few users were slowing things down for everyone else. Smaller or newer customers, in particular, were getting left behind. That wasn’t acceptable.

So we built a custom task scheduler that keeps things fair while still delivering a consistent experience, even under load. In this post, I’ll walk through what we built, the thinking behind it, and how a weighted, time-decayed model helped us strike the right balance.

The Old Scheduler (Or lack thereof)

For a long time, we didn’t manage task scheduling ourselves. We relied on Celery, which handled task dispatching with a basic queueing model. It was simple, effective, and required minimal tuning. Perfect for the early days.

After a while, we layered on a naive form of prioritization: if a task had too many documents, it was pushed to a “low priority” queue; smaller tasks stayed in “high priority.” That was it. No notion of users, no concept of fairness, just a rough heuristic based on payload size.

def get_queue_name(task):
    if task.document_count > 1000:
        return "low_priority"
    return "high_priority"

It worked well enough when load was light and usage patterns were predictable. But as larger customers came onboard and started using features like Briefcase with thousands of documents at once, this simplicity started breaking down.

The problem was that the tasks that looked “light” in isolation could easily overwhelm the platform when submitted in bulk. A single user could submit hundreds of tasks at once, all marked as high priority. Celery would happily keep picking up these tasks from the queue while smaller users waited. Real-time features like Assistant became collateral damage, stuck behind a wall of bulk processing. This often led to me and my colleagues having to manually intervene at odd hours to keep things moving. Not a fun experience.

We patched things as best we could: separate queues, concurrency tuning, retries. But the underlying problem remained: The system didn’t know, or care, who the work was for.

It was time for a rethink.

Fairness

When we talk about scheduling, it’s usually in terms of tasks: what job runs next, how to prioritize workloads and how to keep workers busy. But that’s not the kind of fairness we needed.

For us, fairness had to operate at the user level. Instead of giving every task an equal chance, we wanted to ensure that every user got a fair share of the platform’s capacity.

If there are N active users submitting work, each should get roughly 1/N of the total capacity. That share should shift as users come and go, and there should be a gradual rebalancing to reflect those changes.

In other words, if one user submits a burst of tasks, they shouldn’t dominate the system forever. Their tasks should be scheduled, but not at the expense of others. And if someone hasn’t had their tasks scheduled in a while, they should get boosted until they catch up.

This user-level fairness is what we wanted to encode in our scheduler.

Representing Fairness

To convert this concept of fairness into something tangible, we came up with a scoring system for tasks. This score decides which tasks get scheduled and when. But instead of just looking at the task itself, the score also takes into account who submitted it and how recently they’ve been served.

At a high level, the score has two components: a task’s intrinsic weight, and a time-based factor that reflects when the user last had a task scheduled.

Task Weight

Each task type has a static weight that reflects its importance. Some tasks are latency-sensitive, others are bulk background jobs. The weight allows us to encode this difference directly.

For example, a small summarization task may get a higher weight than a massive document processing job. This ensures that lightweight interactive features don’t get drowned out by heavy workloads.

This weight acts like a multiplier for how quickly a task’s score grows over time.

Time-based Factor

The second component is time-based. For every user, we track when their last task was scheduled. The longer they’ve been waiting, the higher the fairness score for their new tasks.

This introduces a sharp penalty for high-volume users. As soon as one of their tasks is scheduled, their fairness score resets, dropping the rank of their remaining tasks immediately.

That reset makes it nearly impossible for any user to monopolize the processing capacity indefinitely. On the other hand, users who haven’t had tasks scheduled for a while see their tasks’ scores continue to grow, pushing them further up in the queue until they’re selected.

The Scoring Formula

The final score is a product of task weight and waiting time.

priority_score = task_weight * (current_time - user_last_scheduled_time)

Tasks are scored and re-ranked every scheduling cycle. The top few are picked for execution.

Managing Throughput

Fairness ensures that users get a proportional share of the available capacity. But that’s not enough on its own. We also need to manage how much work the system takes on at once.

Many of our upstream services, OCR engines, embedding models, third-party APIs, have rate limits or strict concurrency caps. Sending too many requests at once can cause cascading failures, timeouts, or throttling. To avoid this, we need a way to control how many tasks the scheduler can select in a single cycle.

TopN

TopN is a dynamic cap on how many tasks the scheduler is allowed to schedule in one cycle. Think of it as a budget: at any given moment, we can only schedule up to N tasks from the pending queue.

This prevents the scheduler from flooding the system during a busy spike, while still allowing it to be responsive as soon as resources free up.

Initializing TopN

When the system starts, TopN is set to a conservative value based on our concurrency limits across various upstream services. For example, if we know we can safely process 100 tasks in parallel without hitting upstream rate limits, we initialize TopN to 100.

We also consider:

Historical throughput under peak load
Latency characteristics of different task types for that specific deployment
Safety buffers to account for bursty workloads or retries

This starting point keeps things stable during ramp-up.

Each scheduled task decreases TopN by one. This means that as tasks are picked up, the budget shrinks, preventing the scheduler from overcommitting resources.

Replenishment

As tasks complete, the budget is replenished. For every finished task, we increment TopN by one. This makes the scheduler self-regulating: if tasks are finishing quickly, we get to schedule more in the next round. If they’re slow, the scheduler naturally backs off.

In practice, this is implemented by monkey-patching the celery Task class to update TopN whenever a task completes or fails. This way, the scheduler always has an up-to-date view of how many tasks it can safely select.

class FairTask(celery.Task):
    def on_success(self, retval, task_id, args, kwargs):
        self._return_capacity()
        return super().on_success(retval, task_id, args, kwargs)

    def on_failure(self, exc, task_id, args, kwargs, einfo):
        self._return_capacity()
        return super().on_failure(exc, task_id, args, kwargs, einfo)

    def _return_capacity(self):
        atomic_incr_topN()

celery.Task = FairTask

This also keeps the scheduling logic decoupled from execution. The scheduler doesn’t assume anything about how long tasks will take. It only reacts to completions and adjusts capacity accordingly.

Evaluating the Scheduler

To validate our new scheduler, we ran several scenarios that simulate real-world usage patterns. The goal was to see how well it maintains fairness and responsiveness under different load conditions.

I present here two of the most important scenarios we tested:

A single user submitting a burst of tasks, followed by multiple users submitting smaller amounts of tasks
Same as 1, but the single heavy user submits after the multiple users have submitted their tasks

Setup (for both runs)

TopN: 100
Concurrency: 100
Task behavior:
- Each task sleeps for 1–3 seconds before completing
- All tasks are of the same class and weight.

Scenario 1: High-Volume User Goes First

Blue tasks are from the high-volume user. The scheduler starts with their tasks, but quickly interleaves low-volume users' tasks to maintain balance.

In this setup, the high-volume user submits all 200 tasks first. The scheduler initially processes only their work, as no other tasks are in the system yet.

But once the low-volume users start submitting, you can see the scheduler begin to rebalance. Their tasks quickly rise in priority due to lack of recent attention. The scheduler starts interleaving tasks from multiple users, gradually correcting the imbalance.

Even though one user started with all the momentum, the system clamps their dominance and restores fairness. By the end, the waterfall is multicolored and well-distributed, no user is left behind.

Scenario 2: Low-Volume Users Go First

Green tasks are from the high-volume user. The scheduler starts with a burst of low-volume users, but gradually introduces the high-volume user's tasks without disrupting the flow.

Here, the 10 low-volume users submit their 100 tasks first, followed by the high-volume user.

The scheduler starts off serving the early users exclusively. But once the heavy user’s tasks enter the queue, they’re gradually introduced into the schedule without disrupting the flow.

Despite the volume disparity, the scheduler never lets one user dominate. It dynamically adjusts to give all users time on the system, and over time, you get a fair, alternating pattern that respects both recency and load.

Takeaway

In both cases, the scheduler self-regulates. Early advantage doesn’t guarantee dominance. High volume doesn’t mean priority. All that matters is: has this user had a fair chance recently?

That simple principle keeps the system balanced, responsive, and most importantly, fair.

Closing Thoughts

What started as a response to late-night alerts ended up teaching us a valuable lesson. The most effective solution wasn’t about adding more infrastructure, but about defining the problem correctly: we needed fairness for users, not for tasks.

By encoding that principle in a simple, time-weighted score, we built a scheduler that is largely self-managing. It adapts to load, prevents any single user from dominating, and is built on provable guarantees against starvation.

It’s a powerful reminder that sometimes the most robust systems come from getting one or two core principles right, rather than from adding layers of complexity.

The scheduler is in the last stages of integration, and we’re excited to see how it performs in production.