Overview
Cloud computing’s carbon footprint grows as demand for compute scales. The UK has committed to net-zero greenhouse-gas emissions by 2050 — a target that puts pressure on every energy-consuming sector, datacentres included. One practical mechanism for reducing cloud carbon emissions is carbon-aware temporal workload shifting: delaying batch jobs to run during lower-carbon-intensity periods on the electricity grid. Several scheduling policies for this approach exist in research literature, ranging from simple lowest-slot selection to threshold-based deferral and window-optimal search. The problem is that no systematic basis for choosing among them existed — a cloud engineer had no evidence-based starting point for picking the right policy given their workload composition, season, or grid context.
This project built a discrete-event simulation framework that ingests real GB carbon-intensity traces (NESO 5-minute resolution data) and HPC workload logs (RICC-2010-2 Parallel Workloads Archive), models four scheduling policies under a common interface, and produces a recommendation matrix mapping each (workload profile × seasonal carbon-trace window × policy) combination to the policy that achieves the lowest operational carbon emissions in simulation.
The work sits at the intersection of cloud systems and practical sustainability. The research gap — the absence of a systematic multi-dimensional policy comparison — was identified by surveying the literature: GAIA’s three-way cost/carbon/performance treatment stood in contrast with the narrower per-policy analyses in other works. Building a simulator rather than an empirical study was motivated by the need for controlled, reproducible, variable comparisons — running a real HPC workload against five seasonal carbon traces and four policies simultaneously is not feasible without simulation.
The primary audience is cloud engineers scheduling deferrable batch workloads on the GB grid who want evidence-based guidance on which temporal shifting policy suits their specific operational context. The secondary audience is researchers who may extend the simulator — hence the modular, plugin-extensible architecture.
Approach & Architecture
The simulator uses a tick-based discrete-event model: 1 tick = 5 seconds, 60 ticks per 5-minute carbon intensity interval. This resolution was chosen so that every carbon-trace observation maps exactly to 60 ticks with no fractional remainder, preserving the integral invariant — splitting hourly intensity by 720 ticks-per-hour gives the same total carbon cost when summed, which ensures accurate mid-interval task tracking. Continuous-time simulation was considered but rejected because event queues and floating-point time arithmetic would have complicated both determinism and cross-experiment comparison.
Seven principal components make up the modular pipeline: Carbon Data Module, Workload Model, Simulation Engine, Policy layer, Accounting Engine, Metrics & Evaluation Engine, and Visualisation Layer. Each assumes a single responsibility and communicates through typed data contracts — immutable frozen dataclasses for Task and CarbonTrace. This means the simulation core is identical across all experiments; only the policy and configuration change, ensuring valid comparisons.
Policy extensibility is handled by a plugin registry. A string key in POLICY_REGISTRY bridges the YAML config to policy classes; all four policies implement a single abstract interface (schedule(task, current_tick, trace) → ScheduleDecision) enforced by Python ABCs. Adding a new policy requires implementing one abstract class and adding one registry entry — no core simulation file is touched. The four policies span a range of lookahead strategies: Immediate (no deferral), LowestSlot (lowest-intensity slot within a waiting window), ThresholdDeferral (defer until intensity drops below a threshold), and WindowOptimal (minimum carbon integral over the window).
CarbonWindow is an intermediate object that wraps the relevant trace slice for each policy call, returns a fallback intensity for out-of-range ticks, and centralises boundary logic. This decouples policies from trace length and was motivated by the observation that Immediate needs no trace view, ThresholdDeferral needs only the current tick’s value, and WindowOptimal needs a slice of width waiting_window + task_duration — a single interface that passes the full trace would have forced each policy to handle its own boundary arithmetic.
Experiment runner is YAML-driven, generating every Cartesian permutation from a single config file. The headline matrix covers 60 runs (3 workload profiles × 5 seasonal windows × 4 policies). Each run writes output independently, so partial failures do not require re-running completed cells. A two-path simulation dispatch distinguishes unlimited capacity (the headline matrix) from a constrained three-tier CapacityPool with real CPU budgets and spot-first allocation — the capacity layer is implemented and tested but the constrained matrix sweep is identified as future work.
NumPy arrays are used for carbon trace storage (over Python lists) after profiling revealed per-tick Python loops in total_carbon() were the dominant runtime cost on the full RICC trace. Determinism is enforced by seeding each task’s RNG independently using a Knuth-hash of the global seed and the task ID, making spot-preemption outcomes deterministic per task regardless of execution order.
Development & Learning
Seven of eight project objectives were fully achieved across two semesters: Semester 1 for architecture and core engine, Semester 2 for experiments, evaluation, and dissertation. The simulator ingests NESO GB carbon-intensity traces and the RICC-2010-2 HPC workload trace, implements all four scheduling policies, models three instance types with stochastic spot preemption, enforces deterministic reproducibility, and produces a colour-coded recommendation matrix heatmap. 665 tests pass across 18 test files at project completion.
Non-deterministic output with identical seeds was the most consequential technical problem encountered. The simulator produced different output values on different runs despite an identical seed — unit tests and golden-trace regression tests all passed. Tracing through three call-stack layers revealed the cause: Python’s dict and set iteration order is non-deterministic. The fix was to replace all set-based collections in the hot path with sorted lists. Determinism was then confirmed by test_determinism.py — running the same seed twice produces byte-identical output. The lesson: reproducibility requires ordering every collection the seeded code iterates over, not just seeding the RNG. This bug would have been invisible without both a dedicated determinism test suite and real-data runs; synthetic golden traces passed throughout.
ThresholdDeferral returning 0% savings in all 15 matrix cells was a subtler problem that surfaced only during the Wave 5 matrix sweep, with no errors or test failures. The policy was comparing hourly-scale threshold values (kgCO₂/kWh) against per-tick intensity values (1/720th of hourly scale), making the threshold effectively infinite — every tick was below it, so the policy always scheduled immediately. The fix was applied at the policy level by scaling the comparison correctly, and a regression test was added using a carbon trace whose tick scale and hourly scale differ in known ways. After the fix, ThresholdDeferral correctly produces a 10.1% saving at 1.4h mean delay on the canonical week-27 configuration.
Designing a unified policy interface for policies with fundamentally different lookahead requirements was the third significant challenge. The resolution — the CarbonWindow intermediate object — was confirmed correct when ThresholdDeferral was added as a fourth policy without requiring changes to any of the existing three. In retrospect, this intermediate object should have been designed before implementing any policy; retrofitting it onto LowestSlot and Immediate after the fact required re-testing both.
The central finding of the 60-run sweep is counter-intuitive: seasonal carbon-trace shape matters more than workload type in determining which policy wins. WindowOptimal is recommended in 14 of 15 matrix cells, with mean savings ranging 6.7%–24.5% across seasons. ThresholdDeferral’s 10.1% savings at only 1.4h mean delay — versus WindowOptimal’s 27.0% at 9.9h — demonstrates a tunable Pareto point that may be more practical for latency-sensitive workloads. The threshold sensitivity sweep reveals a monotonic carbon/delay trade-off with a notable discontinuity at 0.100 kgCO₂/kWh, corresponding to business-hours task clustering in the real workload trace.