Clustering Internals¶

Understanding the data structures and visualization tools behind time series clustering.

This notebook demonstrates:

Data structure: The Clustering class that stores all clustering information
Plot accessor: Built-in visualizations via .plot
Data expansion: Using expand_data() to map aggregated data back to original timesteps
IO workflow: What's preserved and lost when saving/loading clustered systems

!!! note "Requirements" This notebook requires the tsam package for time series aggregation. Install with: pip install "flixopt[full]"

!!! note "Prerequisites" This notebook assumes familiarity with 08c-clustering.

In [1]:

  Copied!     
 
from data.generate_example_systems import create_district_heating_system

import flixopt as fx

fx.CONFIG.notebook()

flow_system = create_district_heating_system()
flow_system.connect_and_transform()
from data.generate_example_systems import create_district_heating_system import flixopt as fx fx.CONFIG.notebook() flow_system = create_district_heating_system() flow_system.connect_and_transform()

Clustering Metadata¶

After calling cluster(), metadata is stored in fs.clustering:

In [2]:

  Copied!     
 
from tsam import ExtremeConfig

fs_clustered = flow_system.transform.cluster(
    n_clusters=8,
    cluster_duration='1D',
    extremes=ExtremeConfig(
        method='new_cluster', max_value=['HeatDemand(Q_th)|fixed_relative_profile'], preserve_n_clusters=True
    ),
)

fs_clustered.clustering
from tsam import ExtremeConfig fs_clustered = flow_system.transform.cluster( n_clusters=8, cluster_duration='1D', extremes=ExtremeConfig( method='new_cluster', max_value=['HeatDemand(Q_th)|fixed_relative_profile'], preserve_n_clusters=True ), ) fs_clustered.clustering

Out[2]:

Clustering(
  31 periods → 8 clusters
  timesteps_per_cluster=24
  dims=[]
)

The Clustering object contains:

cluster_assignments: Which cluster each original period maps to
cluster_occurrences: How many original periods each cluster represents
timestep_mapping: Maps each original timestep to its representative
original_data / aggregated_data: The data before and after clustering
results: ClusteringResults object with xarray-like interface (.dims, .coords, .sel())

In [3]:

  Copied!     
 
# Cluster order shows which cluster each original period maps to
fs_clustered.clustering.cluster_assignments
# Cluster order shows which cluster each original period maps to fs_clustered.clustering.cluster_assignments

Out[3]:

<xarray.DataArray 'cluster_assignments' (original_cluster: 31)> Size: 248B
array([6, 0, 0, 2, 6, 5, 5, 7, 5, 7, 1, 4, 5, 1, 1, 1, 1, 1, 4, 3, 3, 3,
       3, 3, 2, 6, 0, 0, 0, 0, 0])
Dimensions without coordinates: original_cluster

In [4]:

  Copied!     
 
# Cluster occurrences shows how many original periods each cluster represents
fs_clustered.clustering.cluster_occurrences
# Cluster occurrences shows how many original periods each cluster represents fs_clustered.clustering.cluster_occurrences

Out[4]:

<xarray.DataArray 'cluster_occurrences' (cluster: 8)> Size: 64B
array([7, 6, 2, 5, 2, 4, 3, 2])
Coordinates:
  * cluster  (cluster) int64 64B 0 1 2 3 4 5 6 7

Visualizing Clustering¶

The .plot accessor provides built-in visualizations for understanding clustering results.

In [5]:

  Copied!     
 
# Compare original vs aggregated data as timeseries
# By default, plots all time-varying variables
fs_clustered.clustering.plot.compare()
# Compare original vs aggregated data as timeseries # By default, plots all time-varying variables fs_clustered.clustering.plot.compare()

Out[5]:

In [6]:

  Copied!     
 
# Use a different approach of visualizing the data using normalize heatmaps
ds = fs_clustered.clustering.plot.compare(data_only=True).data

ds_normalized = (ds - ds.min()) / (ds.max() - ds.min())
ds_normalized.to_array().plotly.imshow(
    x='time',
    animation_frame='representation',
    zmin=0,
    zmax=1,
    color_continuous_scale='viridis',
    title='Normalized Comparison',
)
# Use a different approach of visualizing the data using normalize heatmaps ds = fs_clustered.clustering.plot.compare(data_only=True).data ds_normalized = (ds - ds.min()) / (ds.max() - ds.min()) ds_normalized.to_array().plotly.imshow( x='time', animation_frame='representation', zmin=0, zmax=1, color_continuous_scale='viridis', title='Normalized Comparison', )

In [7]:

  Copied!     
 
# Compare specific variables only
fs_clustered.clustering.plot.compare(variables='HeatDemand(Q_th)|fixed_relative_profile')
# Compare specific variables only fs_clustered.clustering.plot.compare(variables='HeatDemand(Q_th)|fixed_relative_profile')

Out[7]:

In [9]:

  Copied!     
 
# View typical period profiles for each cluster
# Each line represents a cluster's representative day
fs_clustered.clustering.plot.clusters(variables='HeatDemand(Q_th)|fixed_relative_profile', color='cluster')
# View typical period profiles for each cluster # Each line represents a cluster's representative day fs_clustered.clustering.plot.clusters(variables='HeatDemand(Q_th)|fixed_relative_profile', color='cluster')

Out[9]:

In [10]:

  Copied!     
 
# Heatmap shows cluster assignments for each original period
fs_clustered.clustering.plot.heatmap()
# Heatmap shows cluster assignments for each original period fs_clustered.clustering.plot.heatmap()

Out[10]:

Expanding Aggregated Data¶

The Clustering.expand_data() method maps aggregated data back to original timesteps. This is useful for comparing clustering results before optimization:

In [11]:

  Copied!     
 
# Get original and aggregated data
clustering = fs_clustered.clustering
original = clustering.original_data['HeatDemand(Q_th)|fixed_relative_profile']
aggregated = clustering.aggregated_data['HeatDemand(Q_th)|fixed_relative_profile']

# Expand aggregated data back to original timesteps
expanded = clustering.expand_data(aggregated)

print(f'Original:   {len(original.time)} timesteps')
print(f'Aggregated: {len(aggregated.time)} timesteps')
print(f'Expanded:   {len(expanded.time)} timesteps')
# Get original and aggregated data clustering = fs_clustered.clustering original = clustering.original_data['HeatDemand(Q_th)|fixed_relative_profile'] aggregated = clustering.aggregated_data['HeatDemand(Q_th)|fixed_relative_profile'] # Expand aggregated data back to original timesteps expanded = clustering.expand_data(aggregated) print(f'Original: {len(original.time)} timesteps') print(f'Aggregated: {len(aggregated.time)} timesteps') print(f'Expanded: {len(expanded.time)} timesteps')

Original:   744 timesteps
Aggregated: 24 timesteps
Expanded:   744 timesteps

Summary¶

Property	Description
`clustering.n_clusters`	Number of representative clusters
`clustering.timesteps_per_cluster`	Timesteps in each cluster period
`clustering.cluster_assignments`	Maps original periods to clusters
`clustering.cluster_occurrences`	Count of original periods per cluster
`clustering.timestep_mapping`	Maps original timesteps to representative indices
`clustering.original_data`	Dataset before clustering
`clustering.aggregated_data`	Dataset after clustering
`clustering.results`	`ClusteringResults` with xarray-like interface

ClusteringResults (xarray-like)¶

Access the underlying tsam results via clustering.results:

# Dimension info (like xarray)
clustering.results.dims      # ('period', 'scenario') or ()
clustering.results.coords    # {'period': [2020, 2030], 'scenario': ['high', 'low']}

# Select specific result (like xarray)
clustering.results.sel(period=2020, scenario='high')   # Label-based
clustering.results.isel(period=0, scenario=1)          # Index-based

Plot Accessor Methods¶

Method	Description
`plot.compare()`	Compare original vs aggregated data (timeseries)
`plot.compare(kind='duration_curve')`	Compare as duration curves
`plot.clusters()`	View each cluster's profile
`plot.heatmap()`	Visualize cluster assignments

Key Parameters¶

# Compare with options
clustering.plot.compare(
    variables='Demand|profile',       # Single variable, list, or None (all)
    kind='timeseries',                # 'timeseries' or 'duration_curve'
    select={'scenario': 'Base'},      # xarray-style selection
    colors='viridis',                 # Colorscale name, list, or dict
    facet_col='period',               # Facet by period if present
    facet_row='scenario',             # Facet by scenario if present
)

# Heatmap shows cluster assignments (no variable needed)
clustering.plot.heatmap()

# Expand aggregated data to original timesteps
expanded = clustering.expand_data(aggregated_data)

Cluster Weights¶

Each representative timestep has a weight equal to the number of original periods it represents. This ensures operational costs scale correctly:

$$\text{Objective} = \sum_{t \in \text{typical}} w_t \cdot c_t$$

The weights sum to the original timestep count:

In [12]:

  Copied!     
 
print(f'Sum of weights: {fs_clustered.cluster_weight.sum().item():.0f}')
print(f'Original timesteps: {len(flow_system.timesteps)}')
print(f'Sum of weights: {fs_clustered.cluster_weight.sum().item():.0f}') print(f'Original timesteps: {len(flow_system.timesteps)}')

Sum of weights: 31
Original timesteps: 744

Solution Expansion¶

After optimization, expand() maps results back to full resolution:

In [13]:

  Copied!     
 
solver = fx.solvers.HighsSolver(mip_gap=0.01, log_to_console=False)
fs_clustered.optimize(solver)

fs_expanded = fs_clustered.transform.expand()

print(f'Clustered: {len(fs_clustered.timesteps)} timesteps')
print(f'Expanded:  {len(fs_expanded.timesteps)} timesteps')
solver = fx.solvers.HighsSolver(mip_gap=0.01, log_to_console=False) fs_clustered.optimize(solver) fs_expanded = fs_clustered.transform.expand() print(f'Clustered: {len(fs_clustered.timesteps)} timesteps') print(f'Expanded: {len(fs_expanded.timesteps)} timesteps')

Clustered: 24 timesteps
Expanded:  744 timesteps

IO Workflow¶

When saving and loading a clustered FlowSystem, most clustering information is preserved. However, some methods that access tsam's internal AggregationResult objects are not available after IO.

What's Preserved After IO¶

Structure: n_clusters, timesteps_per_cluster, dims, coords
Mappings: cluster_assignments, cluster_occurrences, timestep_mapping
Data: original_data, aggregated_data
Original timesteps: original_timesteps
Results structure: results.sel(), results.isel() for ClusteringResult access

What's Lost After IO¶

clustering.sel(): Accessing full AggregationResult objects
clustering.items(): Iterating over AggregationResult objects
tsam internals: AggregationResult.accuracy, AggregationResult.plot, etc.

In [14]:

  Copied!     
 
# Before IO: Full tsam access is available
result = fs_clustered.clustering.sel()  # Get the AggregationResult
print(f'Before IO - AggregationResult available: {type(result).__name__}')
print(f'  - n_clusters: {result.n_clusters}')
print(f'  - accuracy.rmse (mean): {result.accuracy.rmse.mean():.4f}')
# Before IO: Full tsam access is available result = fs_clustered.clustering.sel() # Get the AggregationResult print(f'Before IO - AggregationResult available: {type(result).__name__}') print(f' - n_clusters: {result.n_clusters}') print(f' - accuracy.rmse (mean): {result.accuracy.rmse.mean():.4f}')

Before IO - AggregationResult available: AggregationResult
  - n_clusters: 8
  - accuracy.rmse (mean): 0.0924

In [15]:

  Copied!     
 
# Save and load the clustered system
import tempfile
from pathlib import Path

try:
    with tempfile.TemporaryDirectory() as tmpdir:
        path = Path(tmpdir) / 'clustered_system.nc'
        fs_clustered.to_netcdf(path)
        fs_loaded = fx.FlowSystem.from_netcdf(path)

    # Structure is preserved
    print('After IO - Structure preserved:')
    print(f'  - n_clusters: {fs_loaded.clustering.n_clusters}')
    print(f'  - dims: {fs_loaded.clustering.dims}')
    print(f'  - original_data variables: {list(fs_loaded.clustering.original_data.data_vars)[:3]}...')
except OSError as e:
    print(f'Note: NetCDF save/load skipped due to environment issue: {type(e).__name__}')
    print('This can happen in some CI environments. The functionality works locally.')
    fs_loaded = fs_clustered  # Use original for subsequent cells
# Save and load the clustered system import tempfile from pathlib import Path try: with tempfile.TemporaryDirectory() as tmpdir: path = Path(tmpdir) / 'clustered_system.nc' fs_clustered.to_netcdf(path) fs_loaded = fx.FlowSystem.from_netcdf(path) # Structure is preserved print('After IO - Structure preserved:') print(f' - n_clusters: {fs_loaded.clustering.n_clusters}') print(f' - dims: {fs_loaded.clustering.dims}') print(f' - original_data variables: {list(fs_loaded.clustering.original_data.data_vars)[:3]}...') except OSError as e: print(f'Note: NetCDF save/load skipped due to environment issue: {type(e).__name__}') print('This can happen in some CI environments. The functionality works locally.') fs_loaded = fs_clustered # Use original for subsequent cells

Note: NetCDF save/load skipped due to environment issue: OSError
This can happen in some CI environments. The functionality works locally.

In [16]:

  Copied!     
 
# After IO: sel() raises ValueError because AggregationResult is not preserved
try:
    fs_loaded.clustering.sel()
except ValueError as e:
    print('After IO - sel() raises ValueError:')
    print(f'  "{e}"')
# After IO: sel() raises ValueError because AggregationResult is not preserved try: fs_loaded.clustering.sel() except ValueError as e: print('After IO - sel() raises ValueError:') print(f' "{e}"')

In [17]:

  Copied!     
 
# Key operations still work after IO:
# - Optimization
# - Expansion back to full resolution
# - Accessing original_data and aggregated_data

fs_loaded.optimize(solver)
fs_loaded_expanded = fs_loaded.transform.expand()

print('Loaded system can still be:')
print(f'  - Optimized: {fs_loaded.solution is not None}')
print(f'  - Expanded: {len(fs_loaded_expanded.timesteps)} timesteps')
# Key operations still work after IO: # - Optimization # - Expansion back to full resolution # - Accessing original_data and aggregated_data fs_loaded.optimize(solver) fs_loaded_expanded = fs_loaded.transform.expand() print('Loaded system can still be:') print(f' - Optimized: {fs_loaded.solution is not None}') print(f' - Expanded: {len(fs_loaded_expanded.timesteps)} timesteps')

Loaded system can still be:
  - Optimized: True
  - Expanded: 744 timesteps

IO Workflow Summary¶

┌─────────────────┐    to_netcdf()     ┌─────────────────┐
│  fs_clustered   │ ─────────────────► │   NetCDF file   │
│                 │                    │                 │
│ ✓ clustering    │                    │ ✓ structure     │
│ ✓ sel()         │                    │ ✓ mappings      │
│ ✓ items()       │                    │ ✓ data          │
│ ✓ AggregationResult                  │ ✗ AggregationResult
└─────────────────┘                    └─────────────────┘
                                              │
                                              │ from_netcdf()
                                              ▼
                                       ┌─────────────────┐
                                       │   fs_loaded     │
                                       │                 │
                                       │ ✓ optimize()    │
                                       │ ✓ expand()      │
                                       │ ✓ original_data │
                                       │ ✗ sel()         │
                                       │ ✗ items()       │
                                       └─────────────────┘

!!! tip "Best Practice" If you need tsam's AggregationResult for analysis (accuracy metrics, built-in plots), do this before saving the FlowSystem. After loading, the core workflow (optimize → expand) works normally.

Reducing File Size¶

For smaller files (~38% reduction), use include_original_data=False when saving. This disables plot.compare() after loading, but the core workflow still works:

In [18]:

  Copied!     
 
import tempfile
from pathlib import Path

# Compare file sizes with and without original_data
try:
    with tempfile.TemporaryDirectory() as tmpdir:
        path_full = Path(tmpdir) / 'full.nc'
        path_small = Path(tmpdir) / 'small.nc'

        fs_clustered.to_netcdf(path_full, include_original_data=True)
        fs_clustered.to_netcdf(path_small, include_original_data=False)

        size_full = path_full.stat().st_size / 1024
        size_small = path_small.stat().st_size / 1024

    print(f'With original_data:    {size_full:.1f} KB')
    print(f'Without original_data: {size_small:.1f} KB')
    print(f'Size reduction: {(1 - size_small / size_full) * 100:.0f}%')
except OSError as e:
    print(f'Note: File size comparison skipped due to environment issue: {type(e).__name__}')
import tempfile from pathlib import Path # Compare file sizes with and without original_data try: with tempfile.TemporaryDirectory() as tmpdir: path_full = Path(tmpdir) / 'full.nc' path_small = Path(tmpdir) / 'small.nc' fs_clustered.to_netcdf(path_full, include_original_data=True) fs_clustered.to_netcdf(path_small, include_original_data=False) size_full = path_full.stat().st_size / 1024 size_small = path_small.stat().st_size / 1024 print(f'With original_data: {size_full:.1f} KB') print(f'Without original_data: {size_small:.1f} KB') print(f'Size reduction: {(1 - size_small / size_full) * 100:.0f}%') except OSError as e: print(f'Note: File size comparison skipped due to environment issue: {type(e).__name__}')

Note: File size comparison skipped due to environment issue: OSError