I'll talk about three load balancing algorithms: random, round robin, and least loaded. The first assigns incoming jobs to random servers; the second assigns incoming jobs to servers sequentially, running through all the servers then starting over; the third assigns jobs to servers based on a load heuristic, trying to assign jobs to the lightest loaded server.
To analyze the algorithms, I make use of the following statistical moments to evaluate load. The mean load of the system at a given point is the average load over all workers. The variance of the load of a system is a rough measurement of how tightly the loads are clustered (how big is the spread). The skewness of the system is a rough measurement of whether workers trend towards higher than average, or lower than average. And finally, kurtosis is a rough measurement of whether variations from the mean come in the form of many servers with small deviations, or a few servers with large deviations.
Observation 1: While mean and variance provide good differentiation between balancing algorithms, skewness and kurtosis are not helpful. As such, we will present only mean and variance while discussing performance.
To analyze the algorithms, I make use of the following statistical moments to evaluate load. The mean load of the system at a given point is the average load over all workers. The variance of the load of a system is a rough measurement of how tightly the loads are clustered (how big is the spread). The skewness of the system is a rough measurement of whether workers trend towards higher than average, or lower than average. And finally, kurtosis is a rough measurement of whether variations from the mean come in the form of many servers with small deviations, or a few servers with large deviations.
Observation 1: While mean and variance provide good differentiation between balancing algorithms, skewness and kurtosis are not helpful. As such, we will present only mean and variance while discussing performance.
Trace 1: Constant Duration / Constant Arrival Rate
Result 1: Under uniform load, with sufficient compute to avoid backlog in an optimal assignment, random < round robin $\approx$ least loaded
This should not be particularly surprising. That a randomized approach should fail spectacularly is almost intuitive, as random assignment does a very poor job distributing load. Since the load is uniform (e.g. tasks take the same time to complete, and arrive uniformly), round robin and least loaded are more or less equivalent. This is born out in the following graphs, which show the average load and the variance of the load of the system over the course of a run, on a trace with a constant job duration and constant arrival rate. Takeaway: In situations where load is constant in duration and arrival rate, either of round-robin or least-loaded will work well.
This should not be particularly surprising. That a randomized approach should fail spectacularly is almost intuitive, as random assignment does a very poor job distributing load. Since the load is uniform (e.g. tasks take the same time to complete, and arrive uniformly), round robin and least loaded are more or less equivalent. This is born out in the following graphs, which show the average load and the variance of the load of the system over the course of a run, on a trace with a constant job duration and constant arrival rate. Takeaway: In situations where load is constant in duration and arrival rate, either of round-robin or least-loaded will work well.
Takeaway: In situations where load is constant in duration and arrival rate, either of round-robin or least-loaded will work well.
Trace 2: Modal Duration / Constant Arrival Rate
To analyze the load under slightly less stable conditions, I created a modal trace. Jobs take either 100, 500, or 1000ms, chosen uniformly at random for each job. Jobs arrive at a constant rate.
Result 2: With constant arrival rate and trimodal duration, random < round robin < least loaded.
Again, this should seem fairly intuitive. When the duration of jobs are not constant, round robin may assign more longer jobs to the same server, rather than packing them in optimally. We use a naive heuristic for least loaded (examining the number of outstanding jobs, rather than trying to estimate time to completion for jobs), and even so the least loaded algorithm achieves substantially better performance. In particular, note that the variance is very, very low, relative to the other two algorithms.
Result 2: With constant arrival rate and trimodal duration, random < round robin < least loaded.
Again, this should seem fairly intuitive. When the duration of jobs are not constant, round robin may assign more longer jobs to the same server, rather than packing them in optimally. We use a naive heuristic for least loaded (examining the number of outstanding jobs, rather than trying to estimate time to completion for jobs), and even so the least loaded algorithm achieves substantially better performance. In particular, note that the variance is very, very low, relative to the other two algorithms.
Takeaway: In situations where jobs arrive regularly, but duration is variable, use a least-loaded routing algorithm.
Trace 3: Constant Duration / Linearly Increasing Arrival Rate
To analyze the load under degenerate conditions, i.e. when the system becomes overloaded, I created a trace with constant job duration, and linearly decreasing interarrival rates, up to a maximum arrival rate threshold. The trace repeats twice.
Result 3: Under degenerate conditions, random < round robin $\approx$ least loaded.
Note that all three algorithms produce very similar average loads. I find this to be quite surprising. The variance shows that random assignment does poorly, assigning jobs badly, while round robin and least loaded both keep variance very low. This suggests that if the trace were to have a higher cutoff for the interarrival rate, the mean would suffer for the randomized algorithm, but round robin and least loaded would stay more or less in lockstep.
Result 3: Under degenerate conditions, random < round robin $\approx$ least loaded.
Note that all three algorithms produce very similar average loads. I find this to be quite surprising. The variance shows that random assignment does poorly, assigning jobs badly, while round robin and least loaded both keep variance very low. This suggests that if the trace were to have a higher cutoff for the interarrival rate, the mean would suffer for the randomized algorithm, but round robin and least loaded would stay more or less in lockstep.
Takeaway: If you are concerned about degenerate conditions when dealing with homogeneous jobs of about constant duration, use either the round-robin or the least-loaded routing algorithm.
Trace 4: Modal Duration / Linearly Increasing Arrival Rate
Finally, to simulate somewhat more realistic degeneracy conditions, I combined the constant arrival modal trace with the linear increasing trace. The result is a trace with linearly decreasing interarrival rates (up to a threshold), with job duration sampled from $\{100, 500, 1000\}$ms, uniformly at random.
Result 4: Under realistic degenerate conditions, random < round robin < least loaded
Again, all three have similar means, but the variance tells. Here, we see that depending on the particular scenario, round robin may perform as well as least loaded (as in the first spike), which will occur when the optimal assignment happens to be the round robin assignment. But when the optimal assignment is different, round robin may perform quite poorly, as in the second spike, when the variance for round robin spikes. Our conclusion is that under heavy load, round robin would not degrade gracefully (and definitely not as gracefully as the least loaded balancer).
Result 4: Under realistic degenerate conditions, random < round robin < least loaded
Again, all three have similar means, but the variance tells. Here, we see that depending on the particular scenario, round robin may perform as well as least loaded (as in the first spike), which will occur when the optimal assignment happens to be the round robin assignment. But when the optimal assignment is different, round robin may perform quite poorly, as in the second spike, when the variance for round robin spikes. Our conclusion is that under heavy load, round robin would not degrade gracefully (and definitely not as gracefully as the least loaded balancer).
Takeaway: If you are concerned about degenerate conditions when dealing with heterogeneous jobs of varying duration, use the least-loaded routing algorithm.