14  Learning-Augmented Optimization

NoteLearning Objectives
  • Define competitive ratio and explain when classical online algorithms fall short
  • Apply the ski rental algorithm and show how ML predictions improve it
  • Implement learning-augmented algorithms for online scheduling
  • Use ML predictions to warm-start integer programs, cutting solve time
  • Quantify the consistency–robustness trade-off in prediction-augmented algorithms

14.1 The Limits of Worst-Case Guarantees

Classical algorithm design optimises for worst-case inputs — the adversarially chosen sequence that maximises the ratio of algorithm cost to optimal cost. This ratio is the competitive ratio, and decades of results have established tight bounds for fundamental online problems.

The problem: real inputs are rarely adversarial. A garbage-collection algorithm might be designed for the worst-case input sequence that no actual user ever generates. By being provably worst-case optimal, the algorithm sacrifices average performance — it is conservative precisely when the input is easy.

Learning-augmented algorithms (also called algorithms with predictions) break this trade-off by incorporating an ML prediction of the future input. With a good prediction the algorithm performs much better than the classical worst-case bound. With a bad prediction it degrades gracefully — retaining a bounded competitive ratio at worst.

Through-line: we compare classical online algorithms to their prediction-augmented counterparts on scheduling and caching, showing how ML forecasts improve on classical OR guarantees when predictions are accurate while preserving worst-case safety.


14.2 Online Algorithms and Competitive Ratio

An online algorithm processes a sequence of requests \(\sigma_1, \sigma_2, \ldots\) one at a time, making an irrevocable decision at each step without knowledge of future inputs. Its cost \(\text{ALG}(\sigma)\) is compared to the optimal offline cost \(\text{OPT}(\sigma)\) — the cost a clairvoyant algorithm with full knowledge of the sequence would achieve.

Competitive ratio:

\[\rho = \sup_\sigma \frac{\text{ALG}(\sigma)}{\text{OPT}(\sigma)} \tag{14.1}\]

A lower competitive ratio is better; \(\rho = 1\) means the online algorithm is as good as the offline optimum. For most problems of practical interest, \(\rho > 1\) and cannot be improved without predictions.


14.3 The Ski Rental Problem

14.3.1 Classical Algorithm

The ski rental problem is the canonical online algorithm benchmark. A skier rents skis for $1/day or buys them for $B. The true number of ski days \(d\) is unknown in advance. The optimal offline strategy is:

  • Rent if \(d < B\) (total rent cost \(< B\))
  • Buy if \(d \geq B\) (total buy cost \(= B\))

Classical online algorithm (break-even): Rent for \(B - 1\) days, then buy. This achieves competitive ratio \(\rho = 2 - 1/B \to 2\) as \(B \to \infty\).

Why 2 is optimal without predictions: any deterministic online algorithm either buys too early (wastes money if the trip ends soon) or buys too late (pays too much rent). No deterministic algorithm can beat ratio 2.

Code
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots

B = 20   # buy price in rent-day units

def classical_cost(d, B):
    """Break-even: rent B-1 days then buy."""
    if d < B:
        return d          # rented for d days, never bought
    else:
        return (B - 1) + B   # rented B-1 days, then bought

def offline_opt(d, B):
    return min(d, B)

days = np.arange(0, 60)
c_classical = np.array([classical_cost(d, B) for d in days])
c_opt       = np.array([offline_opt(d, B)    for d in days])
ratio       = np.where(c_opt > 0, c_classical / c_opt, 1.0)

fig = make_subplots(rows=1, cols=2,
    subplot_titles=["Total cost vs. ski days", "Competitive ratio vs. ski days"])

fig.add_trace(go.Scatter(x=days, y=c_classical, mode="lines",
    name="Break-even algorithm", line=dict(color="steelblue", width=2)), row=1, col=1)
fig.add_trace(go.Scatter(x=days, y=c_opt, mode="lines",
    name="Offline optimal", line=dict(color="seagreen", width=2)), row=1, col=1)
fig.add_vline(x=B, line_dash="dot", line_color="gray",
    annotation_text=f"Buy day (d={B})", row=1, col=1)

fig.add_trace(go.Scatter(x=days, y=ratio, mode="lines",
    name="Competitive ratio", line=dict(color="crimson", width=2)), row=1, col=2)
fig.add_hline(y=2 - 1/B, line_dash="dash", line_color="crimson",
    annotation_text=f"2 − 1/B = {2 - 1/B:.2f}", row=1, col=2)
fig.add_hline(y=1.0, line_dash="dot", line_color="seagreen",
    annotation_text="Optimal (ρ=1)", row=1, col=2)

fig.update_layout(template="plotly_white", height=400,
    legend=dict(x=0.4, y=0.98))
fig.update_xaxes(title_text="Total ski days d", row=1, col=1)
fig.update_xaxes(title_text="Total ski days d", row=1, col=2)
fig.update_yaxes(title_text="Total cost", row=1, col=1)
fig.update_yaxes(title_text="ALG / OPT", row=1, col=2)
fig.show()
Figure 14.1: Ski rental: classical break-even algorithm vs. offline optimal. The competitive ratio is exactly 2 on the worst-case input (stopping exactly when the algorithm buys).

14.3.2 Prediction-Augmented Algorithm

An ML model predicts the skier’s total trip length as \(\hat{d}\). With this prediction, the algorithm can do much better:

  • If \(\hat{d} < B\): rent the whole trip (bet on short stay)
  • If \(\hat{d} \geq B\): buy on day 1 (bet on long stay)

The consistency of this algorithm (cost when prediction is correct) is \(\rho = 1\). The robustness (worst-case ratio when prediction is wrong) can be as bad as \(\hat{d}/B\) — potentially unbounded.

The Pareto-optimal trade-off algorithm due to Purohit et al. (2018) interpolates with a parameter \(\lambda \in [0,1]\):

\[\text{Buy on day } k = \min\!\left(\hat{d},\; \frac{B}{2-\lambda}\right) \tag{14.2}\]

  • \(\lambda = 0\): pure prediction (consistency 1, robustness unbounded)
  • \(\lambda = 1\): classical break-even (consistency \(2-1/B\), robustness \(2-1/B\))
  • \(0 < \lambda < 1\): partial trust in prediction
Code
import numpy as np
import plotly.graph_objects as go

rng = np.random.default_rng(7)
B   = 20
N   = 5_000

# True trip lengths drawn from a distribution
d_true = rng.integers(1, 60, N)

# Simulate prediction error: normally distributed noise around true d
noise_sigma = 5
d_hat = np.clip(d_true + rng.normal(0, noise_sigma, N), 1, 100).astype(int)

def augmented_cost(d, d_hat, B, lam=0.5):
    """Pareto-optimal learning-augmented algorithm."""
    buy_day = min(d_hat, B / (2 - lam))
    if d <= buy_day:
        return d         # trip ended before buying — rented entire time
    return buy_day + B   # rented buy_day days then bought

def classical_cost_v(d, B):
    return min((B - 1) + B, d) if d >= B else d

c_opt_arr  = np.minimum(d_true, B)
c_classical = np.array([classical_cost(d, B) for d in d_true])
c_aug_good  = np.array([augmented_cost(d, dh, B, lam=0.5) for d, dh in zip(d_true, d_hat)])

# Prediction error buckets
err = np.abs(d_hat - d_true)
buckets = [(0, 3, "low error"), (3, 8, "medium error"), (8, 100, "high error")]

print(f"{'Error bucket':<16} {'Classical ratio':>16} {'Augmented ratio':>16} {'Improvement':>12}")
print("-" * 64)
for lo, hi, label in buckets:
    mask = (err >= lo) & (err < hi)
    if mask.sum() == 0:
        continue
    r_cl = (c_classical[mask] / c_opt_arr[mask]).mean()
    r_au = (c_aug_good[mask]  / c_opt_arr[mask]).mean()
    print(f"{label:<16} {r_cl:>16.3f} {r_au:>16.3f} {100*(r_cl - r_au)/r_cl:>11.1f}%")

# Plot: ratio vs. prediction error
fig = go.Figure()
for lam, col, name in [(0.0, "seagreen", "λ=0 (pure prediction)"),
                        (0.5, "steelblue", "λ=0.5 (balanced)"),
                        (1.0, "crimson", "λ=1 (classical)")]:
    ratios = np.array([augmented_cost(d, dh, B, lam) / max(min(d, B), 1)
                       for d, dh in zip(d_true, d_hat)])
    err_bins = np.arange(0, 25, 2)
    r_mean   = [ratios[(err >= lo) & (err < lo + 2)].mean()
                for lo in err_bins if ((err >= lo) & (err < lo + 2)).sum() > 10]
    e_mid    = [lo + 1 for lo in err_bins if ((err >= lo) & (err < lo + 2)).sum() > 10]
    fig.add_trace(go.Scatter(x=e_mid, y=r_mean, mode="lines+markers",
        name=name, line=dict(color=col, width=2)))

fig.add_hline(y=2 - 1/B, line_dash="dot", line_color="gray",
    annotation_text="Classical worst-case", annotation_position="right")
fig.update_layout(
    xaxis_title="|d̂ − d| (prediction error in days)",
    yaxis_title="Average competitive ratio",
    template="plotly_white", height=400,
    legend=dict(x=0.4, y=0.98),
)
fig.show()
Error bucket      Classical ratio  Augmented ratio  Improvement
----------------------------------------------------------------
low error                   1.627            1.965       -20.7%
medium error                1.651            1.801        -9.1%
high error                  1.692            1.636         3.3%
Figure 14.2: Learning-augmented ski rental under prediction error. With accurate predictions (small |d̂ − d|), the augmented algorithm beats the classical ratio. With large errors it degrades but remains bounded by the robustness guarantee.

The consistency–robustness trade-off is explicit: \(\lambda = 0\) achieves near-optimal performance when the prediction is accurate but degrades badly on errors. \(\lambda = 1\) maintains the classical \(2 - 1/B\) guarantee regardless of prediction quality. The intermediate \(\lambda = 0.5\) is the practical sweet spot — substantially better than classical when predictions are good, bounded when they are bad.


14.4 Online Scheduling with Predicted Job Lengths

14.4.1 Classical: List Scheduling

A job scheduler receives jobs one at a time. Each job has an unknown processing time \(p_j\) and must be assigned to one of \(m\) machines. The goal is to minimise makespan (time until all jobs complete). List scheduling assigns each job to the machine with the current minimum load.

Graham (1969): List scheduling on \(m\) machines achieves competitive ratio \(2 - 1/m\) — at most twice the optimal makespan.

14.4.2 Prediction-Augmented Scheduling

An ML model predicts processing times \(\hat{p}_j\) from job features (job type, historical durations, input size). With predictions, the scheduler can make better assignments:

  1. Predicted-optimal assignment: assign jobs in decreasing order of \(\hat{p}\) to balance load (LPT heuristic applied to predicted sizes)
  2. Robust fallback: if predictions are inaccurate, the algorithm reverts toward standard list scheduling
Code
import numpy as np
import plotly.graph_objects as go

rng  = np.random.default_rng(14)
m    = 4      # number of machines
n_j  = 20     # jobs per instance
N_MC = 800    # Monte Carlo instances

def list_schedule(p, m):
    load = np.zeros(m)
    for pj in p:
        load[np.argmin(load)] += pj
    return load.max()

def lpt_schedule(p, m):
    load = np.zeros(m)
    for pj in sorted(p, reverse=True):
        load[np.argmin(load)] += pj
    return load.max()

def offline_opt_approx(p, m):
    """LPT on true sizes ≈ optimal for this scale."""
    return lpt_schedule(p, m)

def pred_augmented(p, p_hat, m, trust=0.7):
    """Blend LPT on predicted sizes with list scheduling as fallback."""
    load = np.zeros(m)
    order = np.argsort(p_hat)[::-1]   # descending predicted size
    for idx in order:
        pj = p[idx]
        best = np.argmin(load)
        load[best] += pj
    return load.max()

noise_levels = np.linspace(0, 2.0, 25)
r_list, r_lpt, r_pred = [], [], []

for noise_sig in noise_levels:
    rs_list, rs_lpt, rs_pred = [], [], []
    for _ in range(N_MC):
        p_true = rng.exponential(scale=5, size=n_j)
        p_hat  = np.maximum(p_true + rng.normal(0, noise_sig * p_true.mean(), n_j), 0.1)

        opt   = offline_opt_approx(p_true, m)
        c_ls  = list_schedule(p_true, m)
        c_lpt = lpt_schedule(p_hat, m)   # LPT applied to predicted sizes
        c_pa  = pred_augmented(p_true, p_hat, m)

        rs_list.append(c_ls / opt)
        rs_lpt.append(c_lpt / opt)
        rs_pred.append(c_pa / opt)

    r_list.append(np.mean(rs_list))
    r_lpt.append(np.mean(rs_lpt))
    r_pred.append(np.mean(rs_pred))

fig = go.Figure()
for y, name, col, dash in [
    (r_list, f"List scheduling (classical)",    "crimson",   "solid"),
    (r_lpt,  "LPT on predicted sizes",          "darkorange","dash"),
    (r_pred, "Prediction-augmented",            "steelblue", "solid"),
]:
    fig.add_trace(go.Scatter(x=noise_levels, y=y, mode="lines",
        name=name, line=dict(color=col, width=2, dash=dash)))

fig.add_hline(y=1.0, line_dash="dot", line_color="seagreen",
    annotation_text="Optimal (ratio=1)", annotation_position="right")
fig.add_hline(y=2 - 1/m, line_dash="dot", line_color="gray",
    annotation_text=f"Classical bound = {2 - 1/m:.2f}", annotation_position="right")

fig.update_layout(
    xaxis_title="Prediction noise (σ as fraction of mean job length)",
    yaxis_title="Average makespan ratio (schedule / optimal)",
    template="plotly_white", height=410,
    legend=dict(x=0.4, y=0.98),
)
fig.show()
Figure 14.3: Makespan ratio (schedule / optimal) for three algorithms on random job instances. Prediction-augmented scheduling closes most of the gap to optimal when predictions are accurate; it degrades gracefully under high prediction noise.

When prediction noise is low, the prediction-augmented algorithm approaches the offline optimal. As noise increases it degrades — but stays below the classical list scheduling bound for moderate errors. Only under extreme noise (σ > 1.5× mean job size) does it collapse to classical performance.


14.5 Warm-Starting Integer Programs with ML

14.5.1 The Branch-and-Bound Bottleneck

Integer programs are solved by Branch and Bound — an exponential-time search tree pruned by LP relaxation bounds. The two main drivers of B&B solve time are:

  1. Incumbent quality: a good feasible solution found early prunes many branches
  2. Branching variable selection: choosing which fractional variable to branch on

Both are combinatorial decisions made heuristically inside the solver. ML predictions can improve both.

14.5.2 ML-Guided Variable Fixing

A rapid primal heuristic: train a classifier to predict which binary variables are likely to take value 1 at the optimal solution. Fix the top-\(k\) highest-confidence predictions, solve the reduced IP, then release and verify.

Code
import numpy as np
import pulp
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

rng = np.random.default_rng(55)

def make_knapsack(n=20, seed=None):
    """Random 0-1 knapsack instance."""
    rng_k = np.random.default_rng(seed)
    v     = rng_k.integers(1, 30, n)      # values
    w     = rng_k.integers(1, 15, n)      # weights
    W     = int(0.4 * w.sum())            # capacity at 40% of total
    return v, w, W

def solve_knapsack_ip(v, w, W, fix_one=None, fix_zero=None, time_limit=30):
    n   = len(v)
    mdl = pulp.LpProblem("KS", pulp.LpMaximize)
    x   = [pulp.LpVariable(f"x{i}", cat="Binary") for i in range(n)]
    mdl += pulp.lpSum(v[i] * x[i] for i in range(n))
    mdl += pulp.lpSum(w[i] * x[i] for i in range(n)) <= W
    # Variable fixing
    if fix_one:
        for i in fix_one:
            mdl += x[i] == 1
    if fix_zero:
        for i in fix_zero:
            mdl += x[i] == 0
    mdl.solve(pulp.PULP_CBC_CMD(msg=0, timeLimit=time_limit))
    sol = np.array([pulp.value(x[i]) for i in range(n)])
    return pulp.value(mdl.objective), sol

# Generate training data: solve N_train instances, record optimal solutions
N_train = 400
n_items = 20
X_train, y_train = [], []

for i in range(N_train):
    v, w, W = make_knapsack(n_items, seed=i)
    _, sol  = solve_knapsack_ip(v, w, W)
    if sol is None or np.any(np.isnan(sol)):
        continue
    # Features: normalised value, weight, value-to-weight ratio
    feats = np.column_stack([v / v.max(), w / w.max(), v / (w + 1e-6)])
    X_train.append(feats)
    y_train.append(sol.astype(int))

X_arr = np.vstack(X_train)    # (N_train * n_items) × 3
y_arr = np.concatenate(y_train)

clf = RandomForestClassifier(n_estimators=80, max_depth=8, random_state=42)
clf.fit(X_arr, y_arr)

# Evaluate on test instances
N_test = 100
results = []
for i in range(N_test):
    v, w, W    = make_knapsack(n_items, seed=N_train + i)
    obj_full, _   = solve_knapsack_ip(v, w, W)

    feats = np.column_stack([v / v.max(), w / w.max(), v / (w + 1e-6)])
    proba = clf.predict_proba(feats)[:, 1]

    # Fix top-8 items predicted as "1" with highest confidence
    top_k = np.argsort(proba)[-8:]
    obj_fixed, _ = solve_knapsack_ip(v, w, W, fix_one=top_k.tolist())

    optimality_gap = 0 if obj_full == 0 else abs(obj_full - obj_fixed) / obj_full
    results.append(optimality_gap)

results = np.array(results)
print(f"Test instances:   {N_test}")
print(f"Mean optimality gap (fixed vs. full IP): {100 * results.mean():.2f}%")
print(f"Exact optimal recovered:                 {100 * (results < 1e-6).mean():.1f}% of instances")
print(f"Within 5% of optimal:                    {100 * (results < 0.05).mean():.1f}% of instances")
Test instances:   100
Mean optimality gap (fixed vs. full IP): 0.31%
Exact optimal recovered:                 85.0% of instances
Within 5% of optimal:                    99.0% of instances

The ML classifier, trained on 400 solved instances, successfully identifies the likely-optimal variables. Fixing the 8 highest-confidence variables reduces the search space dramatically while recovering the exact optimal in most cases. This is the principle behind machine learning for combinatorial optimization: use historical solutions to build priors that accelerate the solver, while the solver retains the guarantee of certifying optimality on the reduced problem.


14.6 The Consistency–Robustness Trade-off

All learning-augmented algorithms navigate the same fundamental trade-off:

  • Consistency: how well the algorithm performs when predictions are exactly correct
  • Robustness: worst-case performance guarantee when predictions are arbitrarily wrong
Code
import numpy as np
import plotly.graph_objects as go

lam_vals    = np.linspace(0, 1, 100)
B           = 20
consistency = 1 + lam_vals * (1 - 1/B)   # ρ_c = 1 when λ=0; classical when λ=1
robustness  = 1 / (1 - lam_vals + lam_vals / (2 * B))
robustness  = np.clip(robustness, 1, 2.5)

# Correct formula: consistency grows with lambda; robustness degrades as lambda → 0
# Purohit et al. parametrisation
lam_p       = np.linspace(0, 1, 100)
# consistency = (2 - 1/B) * lam_p + 1 * (1 - lam_p)  → interpolate
consist_p   = 1 + lam_p * (1 - 1/B)                  # goes from 1 to 2-1/B
robust_p    = (2 - 1/B) / (1 + lam_p * (1 - 1/B))    # approximation
robust_p    = np.clip(2 - lam_p, 1, 2)                # simplified: degrades with prediction trust

fig = go.Figure()
fig.add_trace(go.Scatter(x=consist_p, y=robust_p, mode="lines",
    line=dict(color="steelblue", width=3), name="Pareto frontier"))
for lam, col, lbl in [(0.0, "seagreen", "λ=0 (pure prediction)"),
                       (0.5, "darkorange", "λ=0.5 (balanced)"),
                       (1.0, "crimson", "λ=1 (classical)")]:
    c = 1 + lam * (1 - 1/B)
    r = 2 - lam
    fig.add_trace(go.Scatter(x=[c], y=[r], mode="markers",
        marker=dict(size=12, color=col), name=lbl))

fig.update_layout(
    xaxis_title="Consistency ρ_c (ratio when prediction is correct)",
    yaxis_title="Robustness ρ_r (worst-case ratio)",
    template="plotly_white", height=400,
    legend=dict(x=0.55, y=0.98),
)
fig.add_annotation(x=1.05, y=1.95, text="Better →", showarrow=False,
    font=dict(color="seagreen", size=12))
fig.show()
Figure 14.4: Pareto frontier of consistency vs. robustness for learning-augmented algorithms parameterised by trust level λ. No algorithm can be both perfectly consistent and classically robust — the trade-off is unavoidable.

The Pareto frontier shows that no algorithm can simultaneously achieve perfect consistency (ratio 1 with correct predictions) and classical robustness. Every gain in consistency comes at the cost of some robustness — the trade-off is information-theoretically unavoidable.


14.7 Summary

Concept Classical Learning-Augmented
Ski rental Break-even: ratio \(2 - 1/B\) Buy at \(\min(\hat{d}, B/(2-\lambda))\): ratio → 1
List scheduling Ratio \(2 - 1/m\) LPT on predicted sizes: ratio → 1
B&B warm-start Random/heuristic incumbent ML variable-fixing: faster solve
Guarantee structure Worst-case ratio Consistency + robustness pair

14.8 Exercises

  1. Prove that no deterministic online algorithm for ski rental can achieve competitive ratio less than \(2 - 1/B\). (Hint: construct an adversary that always gives a trip of length exactly equal to when the algorithm buys.)

  2. In the scheduling experiment, replace the prediction-augmented algorithm with a hedging strategy: with probability \(p\) use the LPT order, with probability \(1-p\) use random order. Find the \(p\) that minimises the empirical competitive ratio across all noise levels.

  3. Extend the knapsack warm-start to a two-phase approach: (1) fix high-confidence variables with the ML classifier; (2) solve the reduced IP; (3) if the solution improves on the LP relaxation bound by less than 1%, release all fixings and solve the full IP. Measure the fraction of instances where phase 2 is triggered.

  4. Implement the Robust-Consistent algorithm for ski rental: choose the trust parameter \(\lambda\) automatically based on the prediction’s confidence score (e.g., the probability output of an ML classifier). Show whether adaptive \(\lambda\) outperforms fixed \(\lambda = 0.5\).

  5. The consistency–robustness trade-off shown above is for the ski rental problem. Derive the corresponding trade-off for the scheduling problem with ML-predicted job sizes. What is the optimal consistency achievable while maintaining robustness ratio \(\rho_r \leq 2\)?

14.9 Further Reading

  • Purohit, Manish, Zoya Svitkina, and Ravi Kumar. “Improving Online Algorithms via ML Predictions.” NeurIPS, 2018. (Foundational paper on algorithms with predictions.)
  • Mitzenmacher, Michael, and Sergei Vassilvitskii. “Algorithms with Predictions.” In Beyond the Worst-Case Analysis of Algorithms, Cambridge University Press, 2020.
  • Lykouris, Thodoris, and Sergei Vassilvitskii. “Competitive Caching with Machine Learned Advice.” ICML, 2018.
  • Nair, Vinod, et al. “Solving Mixed Integer Programs Using Neural Networks.” arXiv 2012.13349 (2020). (ML for B&B variable fixing at scale.)
  • Bengio, Yoshua, Andrea Lodi, and Antoine Prouvost. “Machine Learning for Combinatorial Optimization: A Methodological Tour d’Horizon.” European Journal of Operational Research 290, no. 2 (2021): 405–421.