Clustering Internals¶
Understanding the data structures and visualization tools behind time series clustering.
This notebook demonstrates:
- Data structure: The
Clusteringclass that stores all clustering information - Plot accessor: Built-in visualizations via
.plot - Data expansion: Using
expand_data()to map aggregated data back to original timesteps - IO workflow: What's preserved and lost when saving/loading clustered systems
!!! note "Requirements" This notebook requires the tsam package for time series aggregation. Install with: pip install "flixopt[full]"
!!! note "Prerequisites" This notebook assumes familiarity with 08c-clustering.
from data.generate_example_systems import create_district_heating_system
import flixopt as fx
fx.CONFIG.notebook()
flow_system = create_district_heating_system()
flow_system.connect_and_transform()
Clustering Metadata¶
After calling cluster(), metadata is stored in fs.clustering:
from tsam import ExtremeConfig
fs_clustered = flow_system.transform.cluster(
n_clusters=8,
cluster_duration='1D',
extremes=ExtremeConfig(
method='new_cluster', max_value=['HeatDemand(Q_th)|fixed_relative_profile'], preserve_n_clusters=True
),
)
fs_clustered.clustering
Clustering( 31 periods → 8 clusters timesteps_per_cluster=24 dims=[] )
The Clustering object contains:
cluster_assignments: Which cluster each original period maps tocluster_occurrences: How many original periods each cluster representstimestep_mapping: Maps each original timestep to its representativeoriginal_data/aggregated_data: The data before and after clusteringresults:ClusteringResultsobject with xarray-like interface (.dims,.coords,.sel())
# Cluster order shows which cluster each original period maps to
fs_clustered.clustering.cluster_assignments
<xarray.DataArray 'cluster_assignments' (original_cluster: 31)> Size: 248B
array([6, 0, 0, 2, 6, 5, 5, 7, 5, 7, 1, 4, 5, 1, 1, 1, 1, 1, 4, 3, 3, 3,
3, 3, 2, 6, 0, 0, 0, 0, 0])
Dimensions without coordinates: original_cluster# Cluster occurrences shows how many original periods each cluster represents
fs_clustered.clustering.cluster_occurrences
<xarray.DataArray 'cluster_occurrences' (cluster: 8)> Size: 64B array([7, 6, 2, 5, 2, 4, 3, 2]) Coordinates: * cluster (cluster) int64 64B 0 1 2 3 4 5 6 7
Visualizing Clustering¶
The .plot accessor provides built-in visualizations for understanding clustering results.
# Compare original vs aggregated data as timeseries
# By default, plots all time-varying variables
fs_clustered.clustering.plot.compare()
# Use a different approach of visualizing the data using normalize heatmaps
ds = fs_clustered.clustering.plot.compare(data_only=True).data
ds_normalized = (ds - ds.min()) / (ds.max() - ds.min())
ds_normalized.to_array().plotly.imshow(
x='time',
animation_frame='representation',
zmin=0,
zmax=1,
color_continuous_scale='viridis',
title='Normalized Comparison',
)
# Compare specific variables only
fs_clustered.clustering.plot.compare(variables='HeatDemand(Q_th)|fixed_relative_profile')
# Duration curves show how well the aggregated data preserves the distribution
fs_clustered.clustering.plot.compare(kind='duration_curve').data
<xarray.Dataset> Size: 54kB
Dimensions: (representation: 2, duration: 744)
Coordinates:
* representation (representation) <U9 72B 'Origin...
* duration (duration) int64 6kB 0 1 ... 743
Data variables:
GridBuy(P_el)|costs|per_flow_hour (representation, duration) float64 12kB ...
GridSell(P_el)|costs|per_flow_hour (representation, duration) float64 12kB ...
HeatDemand(Q_th)|fixed_relative_profile (representation, duration) float64 12kB ...
ElecDemand(P_el)|fixed_relative_profile (representation, duration) float64 12kB ...# View typical period profiles for each cluster
# Each line represents a cluster's representative day
fs_clustered.clustering.plot.clusters(variables='HeatDemand(Q_th)|fixed_relative_profile', color='cluster')
# Heatmap shows cluster assignments for each original period
fs_clustered.clustering.plot.heatmap()
Expanding Aggregated Data¶
The Clustering.expand_data() method maps aggregated data back to original timesteps. This is useful for comparing clustering results before optimization:
# Get original and aggregated data
clustering = fs_clustered.clustering
original = clustering.original_data['HeatDemand(Q_th)|fixed_relative_profile']
aggregated = clustering.aggregated_data['HeatDemand(Q_th)|fixed_relative_profile']
# Expand aggregated data back to original timesteps
expanded = clustering.expand_data(aggregated)
print(f'Original: {len(original.time)} timesteps')
print(f'Aggregated: {len(aggregated.time)} timesteps')
print(f'Expanded: {len(expanded.time)} timesteps')
Original: 744 timesteps Aggregated: 24 timesteps Expanded: 744 timesteps
Summary¶
| Property | Description |
|---|---|
clustering.n_clusters | Number of representative clusters |
clustering.timesteps_per_cluster | Timesteps in each cluster period |
clustering.cluster_assignments | Maps original periods to clusters |
clustering.cluster_occurrences | Count of original periods per cluster |
clustering.timestep_mapping | Maps original timesteps to representative indices |
clustering.original_data | Dataset before clustering |
clustering.aggregated_data | Dataset after clustering |
clustering.results | ClusteringResults with xarray-like interface |
ClusteringResults (xarray-like)¶
Access the underlying tsam results via clustering.results:
# Dimension info (like xarray)
clustering.results.dims # ('period', 'scenario') or ()
clustering.results.coords # {'period': [2020, 2030], 'scenario': ['high', 'low']}
# Select specific result (like xarray)
clustering.results.sel(period=2020, scenario='high') # Label-based
clustering.results.isel(period=0, scenario=1) # Index-based
Plot Accessor Methods¶
| Method | Description |
|---|---|
plot.compare() | Compare original vs aggregated data (timeseries) |
plot.compare(kind='duration_curve') | Compare as duration curves |
plot.clusters() | View each cluster's profile |
plot.heatmap() | Visualize cluster assignments |
Key Parameters¶
# Compare with options
clustering.plot.compare(
variables='Demand|profile', # Single variable, list, or None (all)
kind='timeseries', # 'timeseries' or 'duration_curve'
select={'scenario': 'Base'}, # xarray-style selection
colors='viridis', # Colorscale name, list, or dict
facet_col='period', # Facet by period if present
facet_row='scenario', # Facet by scenario if present
)
# Heatmap shows cluster assignments (no variable needed)
clustering.plot.heatmap()
# Expand aggregated data to original timesteps
expanded = clustering.expand_data(aggregated_data)
Cluster Weights¶
Each representative timestep has a weight equal to the number of original periods it represents. This ensures operational costs scale correctly:
$$\text{Objective} = \sum_{t \in \text{typical}} w_t \cdot c_t$$
The weights sum to the original timestep count:
print(f'Sum of weights: {fs_clustered.cluster_weight.sum().item():.0f}')
print(f'Original timesteps: {len(flow_system.timesteps)}')
Sum of weights: 31 Original timesteps: 744
Solution Expansion¶
After optimization, expand() maps results back to full resolution:
solver = fx.solvers.HighsSolver(mip_gap=0.01, log_to_console=False)
fs_clustered.optimize(solver)
fs_expanded = fs_clustered.transform.expand()
print(f'Clustered: {len(fs_clustered.timesteps)} timesteps')
print(f'Expanded: {len(fs_expanded.timesteps)} timesteps')
Clustered: 24 timesteps Expanded: 744 timesteps
IO Workflow¶
When saving and loading a clustered FlowSystem, most clustering information is preserved. However, some methods that access tsam's internal AggregationResult objects are not available after IO.
What's Preserved After IO¶
- Structure:
n_clusters,timesteps_per_cluster,dims,coords - Mappings:
cluster_assignments,cluster_occurrences,timestep_mapping - Data:
original_data,aggregated_data - Original timesteps:
original_timesteps - Results structure:
results.sel(),results.isel()forClusteringResultaccess
What's Lost After IO¶
clustering.sel(): Accessing fullAggregationResultobjectsclustering.items(): Iterating overAggregationResultobjects- tsam internals:
AggregationResult.accuracy,AggregationResult.plot, etc.
# Before IO: Full tsam access is available
result = fs_clustered.clustering.sel() # Get the AggregationResult
print(f'Before IO - AggregationResult available: {type(result).__name__}')
print(f' - n_clusters: {result.n_clusters}')
print(f' - accuracy.rmse (mean): {result.accuracy.rmse.mean():.4f}')
Before IO - AggregationResult available: AggregationResult - n_clusters: 8 - accuracy.rmse (mean): 0.0924
# Save and load the clustered system
import tempfile
from pathlib import Path
try:
with tempfile.TemporaryDirectory() as tmpdir:
path = Path(tmpdir) / 'clustered_system.nc'
fs_clustered.to_netcdf(path)
fs_loaded = fx.FlowSystem.from_netcdf(path)
# Structure is preserved
print('After IO - Structure preserved:')
print(f' - n_clusters: {fs_loaded.clustering.n_clusters}')
print(f' - dims: {fs_loaded.clustering.dims}')
print(f' - original_data variables: {list(fs_loaded.clustering.original_data.data_vars)[:3]}...')
except OSError as e:
print(f'Note: NetCDF save/load skipped due to environment issue: {type(e).__name__}')
print('This can happen in some CI environments. The functionality works locally.')
fs_loaded = fs_clustered # Use original for subsequent cells
Note: NetCDF save/load skipped due to environment issue: OSError This can happen in some CI environments. The functionality works locally.
# After IO: sel() raises ValueError because AggregationResult is not preserved
try:
fs_loaded.clustering.sel()
except ValueError as e:
print('After IO - sel() raises ValueError:')
print(f' "{e}"')
# Key operations still work after IO:
# - Optimization
# - Expansion back to full resolution
# - Accessing original_data and aggregated_data
fs_loaded.optimize(solver)
fs_loaded_expanded = fs_loaded.transform.expand()
print('Loaded system can still be:')
print(f' - Optimized: {fs_loaded.solution is not None}')
print(f' - Expanded: {len(fs_loaded_expanded.timesteps)} timesteps')
Loaded system can still be: - Optimized: True - Expanded: 744 timesteps
IO Workflow Summary¶
┌─────────────────┐ to_netcdf() ┌─────────────────┐
│ fs_clustered │ ─────────────────► │ NetCDF file │
│ │ │ │
│ ✓ clustering │ │ ✓ structure │
│ ✓ sel() │ │ ✓ mappings │
│ ✓ items() │ │ ✓ data │
│ ✓ AggregationResult │ ✗ AggregationResult
└─────────────────┘ └─────────────────┘
│
│ from_netcdf()
▼
┌─────────────────┐
│ fs_loaded │
│ │
│ ✓ optimize() │
│ ✓ expand() │
│ ✓ original_data │
│ ✗ sel() │
│ ✗ items() │
└─────────────────┘
!!! tip "Best Practice" If you need tsam's AggregationResult for analysis (accuracy metrics, built-in plots), do this before saving the FlowSystem. After loading, the core workflow (optimize → expand) works normally.
Reducing File Size¶
For smaller files (~38% reduction), use include_original_data=False when saving. This disables plot.compare() after loading, but the core workflow still works:
import tempfile
from pathlib import Path
# Compare file sizes with and without original_data
try:
with tempfile.TemporaryDirectory() as tmpdir:
path_full = Path(tmpdir) / 'full.nc'
path_small = Path(tmpdir) / 'small.nc'
fs_clustered.to_netcdf(path_full, include_original_data=True)
fs_clustered.to_netcdf(path_small, include_original_data=False)
size_full = path_full.stat().st_size / 1024
size_small = path_small.stat().st_size / 1024
print(f'With original_data: {size_full:.1f} KB')
print(f'Without original_data: {size_small:.1f} KB')
print(f'Size reduction: {(1 - size_small / size_full) * 100:.0f}%')
except OSError as e:
print(f'Note: File size comparison skipped due to environment issue: {type(e).__name__}')
Note: File size comparison skipped due to environment issue: OSError