Building a neural operator-based surrogate for indoor temperature field reconstruction from sparse wireless sensor data.
All experiments use data collected from Lab 1, a real indoor office environment instrumented with a wireless sensor network of 13 IoT nodes. Each node logs temperature, relative humidity, and RSSI at approximately 10-second intervals. The dataset covers 40 days of continuous measurements, providing both short-term dynamics and longer seasonal drift.
Standard neural networks map one fixed-size vector to another. An operator network goes one level higher: it learns a mapping between functions. In our case, the input function is the temperature field observed at 12 sensor locations at a given time, and the output function is the full spatial temperature field — so we can query it at any (x, y) coordinate in the room. This means the model generalises to locations where we have no sensors, effectively reconstructing the continuous temperature field from sparse measurements.
DeepONet (Deep Operator Network, Lu et al. 2021) implements this idea with two coupled sub-networks:
Encodes the input function — the sensor readings at a fixed set of measurement locations. Produces a latent vector that summarises the current state of the room.
Encodes the query location — the (x, y) coordinate where we want to predict the temperature. Produces a basis vector for that position.
The final prediction is the dot product of the branch and trunk outputs (plus a bias), giving a single temperature value at the queried location. By sweeping the query over a dense grid, we reconstruct the entire 2D temperature field.
Sensor 7 is withheld from the branch input in all experiments. The model sees only the other 12 sensors during inference and must reconstruct the temperature at Sensor 7's (x, y) location. This simulates a real-world scenario where a sensor goes offline or is not deployed.
As a first experiment we trained two identical DeepONet models — one on the raw 10-second data and one on 5-minute mean-resampled data — to understand whether input noise level affects reconstruction accuracy.
To better understand the effect of noise, we compared the raw 5-minute data with a smoothed version produced by a Savitzky-Golay (SG) filter, then trained a DeepONet on the filtered signal.
The Savitzky-Golay filter is a digital smoothing technique that fits a low-degree polynomial to a sliding window of data points using least squares, then replaces each point with the polynomial's value at that position. Unlike a simple moving average, it preserves the shape of peaks and edges — important for temperature data where ramp-up and cool-down dynamics carry physical meaning. We use a window of 21 points (~105 minutes) and polynomial order 3.
Temperature and humidity in a room are physically coupled — HVAC systems control both, and occupancy affects both. We investigated whether humidity carries additional information that could improve temperature reconstruction.
Principal Component Analysis (PCA) is a linear dimensionality reduction technique. Given N correlated signals (here: 13 humidity time series), PCA finds a set of orthogonal directions — principal components — ordered by the amount of variance they explain. PC1 is the first principal component: a single time series that is the linear combination of all 13 humidity channels that explains the most variance. Because all sensors are in the same room and strongly correlated, PC1 alone captures the dominant room-wide humidity pattern, compressing 13 channels into 1 without losing much information.
The branch network input is extended from 12 to 13 features by appending the humidity PC1 value at each timestep alongside the 12 sensor temperatures. Everything else (trunk net, training procedure, withheld sensor) stays the same.
The baseline DeepONet branch network processes a single timestep snapshot of the sensor array. It has no memory of past states. To exploit temporal dependencies in the temperature signal, we replace the MLP branch with a Long Short-Term Memory (LSTM) network that encodes a sliding window of recent history.
LSTM (Long Short-Term Memory) is a type of recurrent neural network designed to learn long-range dependencies in sequential data. Unlike a simple RNN, an LSTM cell maintains two state vectors: a cell state (long-term memory) and a hidden state (short-term output). Three gates — input, forget, and output — learn to write, erase, and read from the cell state, preventing the gradient from vanishing over long sequences. This makes LSTMs well-suited for temperature time series, where a room's thermal inertia means the current temperature depends on what happened in the past hour or more.
A purely data-driven model can fit sensor observations accurately but has no guarantee that its spatial temperature field obeys the governing equations of heat transfer. To constrain the solution to physically plausible dynamics, we add a PDE residual loss derived from the advection-diffusion-source equation for indoor thermal fields. The key departure from an RC-circuit approach is that this formulation is fully continuous in space — it reasons about the temperature field at every point in the room, not just at sensor locations.
An RC network is a lumped model: it treats the room as a small number of discrete zones connected by resistors. It cannot represent how heat spreads continuously across the floor plan. The advection-diffusion PDE operates on the continuous spatial field T(x, y, t) that the DeepONet trunk already represents, so the physics loss directly penalises the network output at any queried coordinate — including the withheld Sensor 7 location.
The temperature field satisfies the following PDE at every interior point of the room:
Approximated by a finite difference between consecutive 5-minute timesteps: model(ut+1, xy) − model(ut, xy). Both evaluations share the same spatial coordinates; only the branch input changes.
Computed via automatic differentiation through the SIREN trunk with respect to the input coordinates (x, y). The SIREN activation (sin(ω₀··)) guarantees non-vanishing second derivatives everywhere, keeping the Laplacian ∇²T meaningful. ω₀ = 5 is used (reduced from the standard 30) since thermal fields are spatially smooth — large ω₀ causes ∇²T ~ ω₀² blow-up in the PDE residual.
The PDE residual is evaluated at 512 collocation points per training step, split equally between:
Randomly sampled from the 12 training sensor coordinates. The data loss anchors the field here, so the PDE is better conditioned at these points.
Uniformly sampled from the room interior [0,1]². These enforce the PDE in regions with no sensor, including the neighbourhood of withheld Sensor 7.
Without geometric constraints, spatially-varying MLPs can learn degenerate solutions — for example, inventing wind that blows through solid walls, or room-wide heat sinks that absorb the entire PDE residual. Four additional loss terms enforce real-world geometry:
L_noslip = ‖v(x_wall)‖²
Air velocity must be zero at solid walls. 64 points are sampled on each of the four
walls at every training step and the v-net output is penalised directly.
This prevents the learned circulation from passing through walls.
L_sparse = mean(|S(x,y)|)
Real heat sources are localised (AC vents, windows).
L1 on the source output pushes S to exactly zero for most of the room,
forcing the network to concentrate heating/cooling into a small distinct region
rather than spreading a diffuse bias across the entire field.
v-net and source-net are deliberately shallow (1 hidden layer) and narrow (width 16 and 32 respectively). A deeper, wider network has enough capacity to memorise the PDE residual as a function of its inputs — effectively becoming another fudge factor. Reducing capacity forces the networks to learn smooth, generalisable spatial patterns instead.
L_neumann = (∂T/∂n)2 at walls
Concrete walls have low thermal conductivity, so the normal heat flux at the
perimeter is approximately zero: ∂T/∂n ≈ 0.
The normal gradient of T is computed via autograd at 64 wall points per wall
and penalised. This prevents the temperature field from developing sharp gradients
at the room boundary.
The exact AC vent locations, airflow rates, and occupancy patterns are unknown. Rather than prescribing these from engineering drawings, the model discovers them from data subject to physical constraints. The no-slip and Neumann boundary conditions act as geometric priors that rule out unphysical solutions, while the L1 sparsity and reduced capacity prevent the physics sub-networks from finding degenerate mathematical shortcuts. The result is a spatial field that is not only accurate at sensor locations but physically consistent everywhere in the room.
The previous physics section used a snapshot MLP branch — it treated every timestep independently, with no memory of how the room arrived at its current state. This section upgrades the branch network to an LSTM encoder, giving the model a 4-hour sliding window of sensor history, and combines it with a richer Advection-Diffusion-Source PDE that simultaneously learns the air-velocity field, thermal diffusivity, and spatial heat sources from data.
A plain MLP branch encodes only the current sensor snapshot — it has no memory of recent occupancy ramps, AC cycles, or thermal inertia. An LSTM branch encodes the full trajectory of how the room got to its current thermal state. This matters for two reasons:
Instead of the simple heat equation used earlier, this model enforces a full advection-diffusion-source PDE at a set of collocation points every training step:
Each term has a concrete physical role:
v_net is a shallow MLP (2→16→2) that outputs (vx, vy)
at any room location. A soft divergence-free penalty ‖∇·v‖² ≈ 0 encourages mass conservation.Four physics-motivated boundary conditions are enforced at 64 sampled wall points per edge (256 total) each training iteration:
Each training step draws 512 collocation points using a mixed strategy: half (256) from the actual sensor locations (high-information zones), half (256) uniformly random from the room interior [0,1]². This ensures the PDE residual is enforced both where data is dense and in sparsely covered regions.
Sensor 7 (location: x=1.92 m, y=3.91 m) is completely excluded from training — it does not appear in the branch input, in the data loss, or in the collocation set. After training, the model is queried at the sensor 7 coordinates for every valid timestep and the prediction is compared against the true sensor 7 readings. This tests spatial generalisation: the model must reconstruct temperature at an unobserved location purely from the physics constraints and the 12 surrounding sensors.
Inference uses 512-sample chunks over the full time series (first 48 timesteps are excluded as LSTM warm-up). Predictions are de-normalised with the global mean and standard deviation from training.
The 6-panel result figure provides a complete diagnostic view of the trained model:
After training the model reports the following learned PDE parameters:
The LSTM-PINN model differs from Section 7 in three fundamental ways:
Knowing how well the model knows the temperature at each room location is as important as the prediction itself. A model that is confident everywhere — even where sensor coverage is sparse — cannot be trusted for decision making (e.g. HVAC placement, fault detection). We extend DeepONet to output both a mean prediction and a predictive variance at every queried location.
Uncertainty quantification (UQ) means estimating not just a single prediction value but also a confidence measure for that prediction. In a probabilistic neural network, the model outputs the parameters of a probability distribution — here a Gaussian with mean μ and variance σ² — instead of a single number. A high σ² at a location signals that the model lacks confidence there, typically because no training sensor is nearby. This spatial uncertainty can guide decisions such as where to add new sensors to maximally reduce overall uncertainty.
In the first experiment all 13 sensors are used as branch inputs. After training, we query the model on a dense grid spanning the entire room and plot the predicted variance σ² at each grid point as a heatmap. Regions with low σ² are well-constrained by nearby sensors; regions with high σ² represent locations the model is uncertain about.
In the second experiment, 4 sensors are removed from the branch input, simulating sensor failure or a sparser deployment. The model receives only 9 sensor readings at inference time. We expect σ² to spike at — or near — the withheld sensor locations, since the model now has no direct information about those zones.
The uncertainty map answers the question: "If I can only afford N sensors, where should I place them?" By iteratively adding a virtual sensor at the highest-σ² grid point and re-evaluating, one can derive a greedy optimal sensor placement strategy purely from the trained probabilistic model — no additional data collection required.
After the early single-sensor experiments above, the project was refactored into a cleaner masked-reconstruction pipeline focused on Lab 1 at 5-minute resolution. The newer versions move from a temperature-only baseline to a joint temperature + humidity model, then to targeted sparsity tests and finally to a minimal active sensor study. Across all newer versions, the branch input uses an explicit observed/missing mask, test windows remain blocked in time, and evaluation is reported under controlled masking protocols.
Data files live under ./data, model code under ./deepOnet, and run outputs under ./deepOnet/runs. On GitHub, the multi-output run is stored under lab1data_v0.2.2, the targeted sparsity run for DeepONet v0.3.2 is stored under lab1data_v0.3.1, and the sensor-importance study is stored under lab1data_v0.4.1.
The original single-target experiments were consolidated into a single reproducible workflow. Missing sensors are no longer represented only by zero-fill; instead the model also receives an explicit observed/missing mask. Later versions add joint temperature-humidity outputs, early stopping, structured masking protocols, and sensor-subset search.
Every evaluation masks the target sensor and varies how many additional sensors are hidden. The analysis now goes beyond random masking: we also test central removal, perimeter removal, spread-out removal, and leave-only-k-active subsets to understand not just how many sensors matter, but also which ones matter most.
The first consolidated baseline predicts temperature only and evaluates how well the model reconstructs masked sensors from the remaining Lab 1 measurements. This version established the new train/validation/test workflow and showed that the model was already strong at low masking levels, but still fragile when too many sensors were removed.
| Masked Total | Variable | RMSE | MAE | R² |
|---|---|---|---|---|
| 1 | Temperature | 0.1253 | 0.1000 | 0.9596 |
| 3 | Temperature | 0.1699 | 0.1306 | 0.8965 |
| 6 | Temperature | 0.4045 | 0.3283 | 0.3795 |
The baseline worked well when only a few sensors were missing, but performance collapsed at 6 masked sensors. This motivated the next redesign: joint temperature-humidity training, wider masking during training, and stronger control over evaluation protocols.
The next major upgrade predicts temperature and relative humidity together in a single model. This version also introduced early stopping and a wider masking distribution during training. The result was a large jump in robustness: the joint model remained accurate not only at low masking, but even in the severe-sparsity regime.
| Masked Total | Temp RMSE | Temp R² | Humidity RMSE | Humidity R² |
|---|---|---|---|---|
| 1 | 0.0975 | 0.9616 | 0.4214 | 0.9845 |
| 3 | 0.1067 | 0.9564 | 0.4651 | 0.9809 |
| 6 | 0.1334 | 0.9454 | 0.5709 | 0.9725 |
| 9 | 0.2182 | 0.8659 | 0.9856 | 0.9241 |
| 12 | 0.5103 | 0.3652 | 2.8315 | 0.4482 |
The multi-output model stayed strong all the way to 9 masked sensors, and only really broke down at 12 masked sensors. This was the point where the question shifted from “can the model reconstruct missing sensors?” to “which remaining sensors are the most valuable?”
To probe the sparsity boundary more carefully, v0.3.2 evaluated structured masking protocols for 8, 9, 10, 11, and 12 masked sensors. Instead of only using random masks, we also tested central removal, perimeter removal, spatially spread removal, and leave-only-k-active subsets. This reveals whether performance depends only on the number of active sensors or also on their spatial arrangement.
| Masked Total | Best Protocol | Temp R² | Humidity R² | Weakest Protocol | Temp R² | Humidity R² |
|---|---|---|---|---|---|---|
| 8 | Spread removed / Perimeter removed | 0.9332 | 0.9674 | Central removed | 0.9019 | 0.9551 |
| 9 | Spread removed / Perimeter removed | 0.9255 | 0.9621 | Central removed | 0.8897 | 0.9520 |
| 10 | Spread removed | 0.9239 | 0.9553 | Central removed / Random | 0.9000 | 0.9523 |
| 11 | Active central diagonal / Perimeter removed | 0.9042 | 0.9535 | Corner diagonal only / Random | 0.8715 | 0.9399 |
| 12 | Spread removed | 0.8643 | 0.9289 | Center one only | 0.6350 | 0.8103 |
Even with 11 masked sensors, the model still performs fairly well under multiple protocols. The real performance cliff appears at 12 masked sensors, where the exact spatial arrangement of the remaining sensors becomes decisive. Spread-out coverage is much more informative than relying on a single central point.
Once the sparsity boundary was understood, the next step was to determine which sensors are actually necessary. Version v0.4.1 adds three analyses: a one-at-a-time drop test, a greedy backward elimination path, and an exhaustive search over all active subsets of size 1, 2, and 3. The results show that the room contains a large amount of redundancy and that only a small, well-placed subset of sensors is enough for strong reconstruction.
In the single-sensor drop study, removing sensors 9, 7, 6, or 10 had the smallest effect on the overall score. This suggests these locations are more redundant than the others.
Sensors 5, 8, and 3 are the hardest to remove without hurting performance. Sensor 5 is especially important: it survives until the very end of the backward elimination path and is also the best single active sensor in the exhaustive search.
The best 2-sensor set already delivers strong reconstruction, and the best 3-sensor set comes surprisingly close to the full 13-sensor reference. In other words: the minimal useful set is much smaller than expected.
| Active Count | Best Active Sensors | Composite Score | Temp R² | Humidity R² |
|---|---|---|---|---|
| 1 | [5] | 0.8391 | 0.8140 | 0.8642 |
| 2 | [5, 8] | 0.9334 | 0.9157 | 0.9512 |
| 3 | [2, 5, 8] | 0.9445 | 0.9271 | 0.9618 |
The newer DeepONet pipeline shows that Lab 1 is highly redundant spatially. Strong temperature-humidity reconstruction is possible with only 2–3 strategically placed sensors, and even a single well-chosen sensor can recover much of the room state. The next natural step is transfer: testing whether a model trained on Lab 1 can adapt to Lab 2.
The Step-5 architecture benchmark compared six closely related models: the compact comparison baseline (v0.5.1), a wider snapshot model (v0.5.2), a deeper snapshot model (v0.5.3), a short-history LSTM branch model (v0.5.4), a deep+wide model (v0.5.5), and a deep+GELU variant (v0.5.6). The strongest average score was achieved by the LSTM model v0.5.4, but only by a tiny margin over v0.5.3. Because the gain was negligible while the architecture was more complex, the project standardised on v0.5.3: the deep snapshot DeepONet with explicit masking.
| Model | Main Change | Mean R² (all comparison protocols) | Decision |
|---|---|---|---|
| v0.5.1 | Compact benchmark reference | 0.9195 | Baseline only |
| v0.5.2 | Wider snapshot MLP | 0.9227 | Small gain |
| v0.5.3 | Deeper snapshot MLP (depth = 6) | 0.9301 | Chosen final model |
| v0.5.4 | Short-window LSTM branch | 0.9310 | Tiny edge, but more complex |
| v0.5.5 | Deep + wide | No clear gain | Rejected |
| v0.5.6 | Deep + GELU | No clear gain | Rejected |
Step 5 showed that depth mattered more than width, and that moving from tanh to GELU did not unlock a better model. Because v0.5.3 stayed essentially tied with v0.5.4 while keeping a simpler operator structure and a cheaper transfer pipeline, it became the model carried forward into the Lab1→Lab2 transfer study.
The trunk sees only normalised coordinates. That is intentional: all temporal and sensor-state information lives in the branch, while the trunk learns a purely spatial basis. This keeps the operator decomposition clean and makes transfer to a new room easier to interpret because geometry enters only through the coordinate system.
The LSTM branch gave only a tiny average gain. The project therefore chose the deeper snapshot branch instead: it captures the spatial state very well, remains fast, and is easier to fine-tune in transfer experiments because there is no recurrent state to adapt.
The data pipeline is deliberately conservative to avoid leakage. For each run, the code reads the common time file Time_5min.xlsx plus one file per variable (Temperature_5min.xlsx and Humidity_5min.xlsx), extracts the 13 sensor columns, and stacks them into a tensor of shape [time, sensor, variable]. Every raw value is divided by 100 before modelling so that the two channels live on similar numeric scales. Any timestep containing a NaN anywhere in the full sensor-variable tensor is dropped completely, which guarantees that the operator never sees partially missing raw data before masking is applied on purpose.
The mean and standard deviation are computed across all training timesteps and all 13 sensors for each variable. This means the model is not trying to learn separate sensor-wise calibration offsets inside the network. Instead, the network sees a common standardised variable space and must learn spatial structure from the coordinates and masking pattern.
Earlier versions simply zero-filled missing sensors. That turned out to be ambiguous, because after z-score normalisation a value near zero can also mean “this sensor is close to the mean.” The final design fixes that by feeding the model both the masked values and an explicit observed/missing mask. During training, every sample chooses one target sensor and always hides it, then hides a random number of additional sensors chosen uniformly between 1 and 12. The branch therefore sees a broad sparsity distribution during training rather than only one carefully scripted masking level.
Each sample is indexed by a timestep and a target sensor. The target sensor is always masked in the branch input and becomes the supervision target.
Extra missing sensors are sampled from the remaining 12 sensors. This makes the operator robust to a wide range of sparse deployment regimes.
The explicit 13-dimensional mask tells the network which zeros are true missingness and which are genuine near-mean standardised values.
The project uses two different blocked split policies for two different questions. In the architecture benchmark (v0.5.x), the model kept the earlier blocked-window split so that all candidate architectures were compared under exactly the same conditions. Once the architecture was chosen, the transfer study upgraded the split to a larger blocked design that is closer to a real held-out deployment scenario.
| Stage | Purpose | Split Logic | Why |
|---|---|---|---|
| v0.5.x benchmark | Compare architectures fairly | Compact blocked validation and test windows inherited from the earlier pipeline | Preserves strict comparability with all previous Lab 1 runs |
| v0.6.1 transfer | Train final Lab 1 source model | Blocked 80/10/10 on Lab 1 using 6-hour windows | Provides a stronger held-out source split for transfer |
| v0.6.1 transfer | Adapt / test on Lab 2 | Blocked 16/4/10 on Lab 2 (adapt-train / adapt-val / test) | Keeps a clean unseen Lab 2 test set while reserving a small adaptation budget |
The project explicitly tested whether more width, recurrent temporal context, deeper+wide capacity, or GELU activations gave meaningful gains. They did not. The practical sweet spot was the deep tanh snapshot model. That result is useful in itself: the bottleneck is not simply “make the network bigger,” but rather choosing a masking formulation and split policy that expose the right reconstruction problem.
After selecting v0.5.3 as the final source architecture, the next question was whether the model learned anything that generalises beyond Lab 1. Lab 2 is a different room with a different aspect ratio, different sensor geometry, different equipment layout, and a different air-conditioning structure. In the source paper, LAB2 is described as having two thermal focal points due to its two AC outlets, so transfer is a genuine domain-shift test, not just another split of the same room.
Phase 1 evaluates the two transfer endpoints: zero-shot (no Lab 2 training at all) and full fine-tuning (adapt the entire network on a small blocked Lab 2 adaptation split with a lower learning rate).
Instead of only testing all 13 sensors, the transfer study also evaluates sparse deployment regimes on Lab 2: all 13 active sensors, a 5-sensor subset, and a 3-sensor subset. This separates true domain transfer from sheer sensor-count effects.
| Transfer Mode | Lab2 Protocol | Temp R² | Humidity R² | Main Interpretation |
|---|---|---|---|---|
| Zero-shot | All 13 active | 0.9029 | 0.8710 | Strong transfer without any Lab 2 training |
| Zero-shot | 5 active | 0.8929 | 0.8425 | Still meaningful under sparse deployment |
| Zero-shot | 3 active | 0.9113 | 0.8705 | Chosen 3-sensor subset was surprisingly informative |
| Full fine-tune | All 13 active | 0.9854 | 0.9721 | Closes almost the entire transfer gap |
| Full fine-tune | 5 active | 0.9816 | 0.9692 | High-quality sparse transfer |
| Full fine-tune | 3 active | 0.9818 | 0.9633 | Excellent adaptation even under strong sparsity |
The source-trained Lab 1 model transfers to Lab 2 better than expected even zero-shot, and full fine-tuning becomes extremely strong. For temperature, the fine-tuned Lab 2 model even reaches or exceeds the corresponding Lab 1 reference averages. This is strong evidence that the operator is learning a transferable representation of indoor thermo-hygrometric structure rather than only memorising one room.
A digital twin is most useful when it can be deployed to a room that was never part of training. The zero-shot result shows that the model already carries non-trivial cross-room information. The fine-tuning result then answers the practical question: if zero-shot is not enough, a relatively small amount of new Lab 2 data is sufficient to adapt the model into a high-quality room-specific surrogate.
The next transfer stage is now running. Phase 2 tests head-only and partial fine-tune adaptation across many Lab 2 sensor subsets, focusing on 5, 3, and 2 active sensors. The goal is no longer just to show transfer works; it is to determine how much adaptation capacity is needed and which sparse Lab 2 sensor layouts are most informative.
The v0.7.1 pipeline unifies the full Lab 1 → Lab 2 transfer story in one comparable benchmark. Instead of evaluating different transfer modes on different protocol sets, all modes are now measured on the same blocked Lab 2 split, the same selected sparse subsets, and the same reporting tables and figures. This makes the comparison between zero-shot, head-only, partial fine-tune, and full fine-tune directly interpretable.
One single pipeline now handles Lab 1 pretraining, Lab 2 adaptation, sparse subset search, unified re-evaluation, and figure generation. That removes the earlier mismatch between fixed v0.6.1 protocols and sampled v0.6.2 protocols.
Not only whether transfer works, but how much adaptation capacity is needed to recover high-quality Lab 2 performance under 5, 3, and 2 active-sensor regimes.
| Regime | Zero-shot Temp R² | Head-only Temp R² | Partial FT Temp R² | Full FT Temp R² | Zero-shot Humidity R² | Head-only Humidity R² | Partial FT Humidity R² | Full FT Humidity R² |
|---|---|---|---|---|---|---|---|---|
| All 13 active | 0.833 | 0.960 | 0.970 | 0.981 | 0.889 | 0.968 | 0.975 | 0.986 |
| Top 5 mean (k=5) | 0.825 | 0.956 | 0.968 | 0.978 | 0.876 | 0.965 | 0.973 | 0.984 |
| Top 5 mean (k=3) | 0.820 | 0.959 | 0.968 | 0.978 | 0.864 | 0.969 | 0.975 | 0.982 |
| Top 5 mean (k=2) | 0.808 | 0.952 | 0.963 | 0.974 | 0.865 | 0.964 | 0.969 | 0.978 |
Zero-shot transfer is meaningful, head-only adaptation recovers most of the lost performance, partial fine-tuning improves further, and full fine-tuning is the strongest overall mode. This ranking is stable across all-13 sensing and the selected sparse subsets, which is exactly what the final benchmark needed to show.
The LAB1 reference bars later on are a useful anchor, but they are not a strict upper bound for Lab 2. They come from a different room, with a different geometry and different test distribution. So when the fully adapted Lab 2 model matches or slightly exceeds the Lab 1 reference in some summaries, that should be read as very strong target-domain performance, not as a contradiction.
The final website benchmark now uses the exact v0.7.1 outputs. The benchmark re-evaluates LAB1 reference, LAB2 zero-shot, LAB2 head-only, LAB2 partial fine-tune, and LAB2 full fine-tune on a single shared protocol set: all 13 active sensors, plus the top 5 subsets for k=5, k=3, and k=2 discovered from the head-only subset search. This is the clean, directly comparable transfer benchmark that the final report and website were aiming for.
| Regime | LAB1 Ref Temp R² | LAB2 Zero-shot Temp R² | LAB2 Head-only Temp R² | LAB2 Partial FT Temp R² | LAB2 Full FT Temp R² | LAB1 Ref Humidity R² | LAB2 Zero-shot Humidity R² | LAB2 Head-only Humidity R² | LAB2 Partial FT Humidity R² | LAB2 Full FT Humidity R² |
|---|---|---|---|---|---|---|---|---|---|---|
| All 13 active | 0.951 | 0.833 | 0.960 | 0.970 | 0.981 | 0.986 | 0.889 | 0.968 | 0.975 | 0.986 |
| Top 5 mean (k=5) | 0.932 | 0.825 | 0.956 | 0.968 | 0.978 | 0.979 | 0.876 | 0.965 | 0.973 | 0.984 |
| Top 5 mean (k=3) | 0.923 | 0.820 | 0.959 | 0.968 | 0.978 | 0.976 | 0.864 | 0.969 | 0.975 | 0.982 |
| Top 5 mean (k=2) | 0.896 | 0.808 | 0.952 | 0.963 | 0.974 | 0.966 | 0.865 | 0.964 | 0.969 | 0.978 |
This benchmark finally resolves the old comparability problem. Earlier versions mixed fixed protocols and sampled subset sweeps, which made direct transfer-mode comparison awkward. In v0.7.1, every mode is scored on the same selected benchmark, so the conclusions are much more defensible.
The strongest sparse subsets were selected from the head-only search and then reused for the common benchmark. A practical pattern stands out immediately: sensor 12 appears repeatedly, often together with sensors 2, 11, 1, 10, or 9. That suggests the best sparse layouts are leveraging a few strategically informative locations rather than merely spreading sensors evenly across the room.
| k | Selected subsets |
|---|---|
| 5 | {3, 9, 10, 11, 12}; {2, 5, 8, 10, 12}; {3, 4, 6, 11, 12}; {1, 2, 6, 9, 12}; {2, 7, 8, 11, 13} |
| 3 | {2, 10, 12}; {1, 9, 11}; {9, 11, 12}; {6, 10, 12}; {1, 2, 12} |
| 2 | {2, 12}; {1, 2}; {11, 12}; {6, 8}; {1, 11} |
The all-13 maps make the spatial transfer story visible. In both temperature and humidity, the zero-shot errors are not uniform across the room; they are concentrated in a few difficult Lab 2 regions. Head-only and partial fine-tuning correct most of those local failures, and full fine-tuning produces a nearly uniform high-R² map.
The next figures show the exact selected sparse subsets for k=5, k=3, and k=2. These are not averages over all possible subsets. They are the strongest subsets discovered in the head-only search, re-evaluated under every transfer mode on the common benchmark.
These example plots compare the Lab 2 ground truth, a sparse IDW baseline, and the DeepONet reconstructions under the different transfer modes for the same sparse protocol. The IDW baseline is easy to interpret but visibly oversmooths the field. The adapted DeepONet models recover sharper and more realistic room structure from the same sparse sensor input.
The current pipeline reconstructs dense temperature and humidity fields. A separate thermal comfort model is still the next layer to add on top of these reconstructed fields; it is not yet part of the current benchmark outputs.