Time Series Clustering with cluster()¶
Accelerate investment optimization using typical periods (clustering).
This notebook demonstrates:
- Typical periods: Cluster similar time segments (e.g., days) and solve only representative ones
- Weighted costs: Automatically weight operational costs by cluster occurrence
- Two-stage workflow: Fast sizing with clustering, accurate dispatch at full resolution
!!! note "Requirements" This notebook requires the tsam package with ClusterConfig and ExtremeConfig support. Install with: pip install "flixopt[full]"
import timeit
import pandas as pd
import xarray as xr
import flixopt as fx
fx.CONFIG.notebook()
flixopt.config.CONFIG
Create the FlowSystem¶
We use a district heating system with real-world time series data (one month at 15-min resolution):
from data.generate_example_systems import create_district_heating_system
flow_system = create_district_heating_system()
flow_system.connect_and_transform()
timesteps = flow_system.timesteps
flow_system
FlowSystem ========== Timesteps: 744 (Hour) [2020-01-01 to 2020-01-31] Periods: None Scenarios: None Status: ✓ Components (9 items) -------------------- * Boiler * CHP * CoalSupply * ElecDemand * GasGrid * GridBuy * GridSell * HeatDemand * Storage Buses (4 items) --------------- * Coal * Electricity * Gas * Heat Effects (2 items) ----------------- * CO2 * costs Flows (13 items) ---------------- * Boiler(Q_fu) * Boiler(Q_th) * CHP(P_el) * CHP(Q_fu) * CHP(Q_th) * CoalSupply(Q_Coal) * ElecDemand(P_el) * GasGrid(Q_Gas) * GridBuy(P_el) * GridSell(P_el) ... (+3 more)
# Visualize input data
input_ds = xr.Dataset(
{
'Heat Demand': flow_system.components['HeatDemand'].inputs[0].fixed_relative_profile,
'Electricity Price': flow_system.components['GridBuy'].outputs[0].effects_per_flow_hour['costs'],
}
)
input_ds.plotly.line(x='time', facet_row='variable', title='One Month of Input Data')
Method 1: Full Optimization (Baseline)¶
First, solve the complete problem with all 2976 timesteps:
solver = fx.solvers.HighsSolver(mip_gap=0.01)
start = timeit.default_timer()
fs_full = flow_system.copy()
fs_full.name = 'Full Optimization'
fs_full.optimize(solver)
time_full = timeit.default_timer() - start
Method 2: Clustering with cluster()¶
The cluster() method:
- Clusters similar days using the TSAM (Time Series Aggregation Module) package
- Reduces timesteps to only typical periods (e.g., 8 typical days = 768 timesteps)
- Weights costs by how many original days each typical day represents
- Handles storage with configurable behavior via
storage_mode
!!! warning "Peak Forcing" Always use extremes=ExtremeConfig(max_value=[...]) to ensure extreme demand days are captured. Without this, clustering may miss peak periods, causing undersized components.
from tsam import ExtremeConfig
start = timeit.default_timer()
# IMPORTANT: Force inclusion of peak demand periods!
peak_series = ['HeatDemand(Q_th)|fixed_relative_profile']
# Create reduced FlowSystem with 8 typical days
fs_clustered = flow_system.transform.cluster(
n_clusters=8, # 8 typical days
cluster_duration='1D', # Daily clustering
extremes=ExtremeConfig(
method='new_cluster', max_value=peak_series, preserve_n_clusters=True
), # Capture peak demand day
)
fs_clustered.name = 'Clustered (8 days)'
time_clustering = timeit.default_timer() - start
# Optimize the reduced system
start = timeit.default_timer()
fs_clustered.optimize(solver)
time_clustered = timeit.default_timer() - start
Understanding the Clustering¶
The clustering algorithm groups similar days together. Access all metadata via fs.clustering:
# Access clustering metadata directly
clustering = fs_clustered.clustering.results
clustering
ClusteringResults(n_clusters=8)
# Show clustering info using __repr__
fs_clustered.clustering
Clustering( 31 periods → 8 clusters timesteps_per_cluster=24 dims=[] )
# Quality metrics - how well do the clusters represent the original data?
# Lower RMSE/MAE = better representation
fs_clustered.clustering.metrics.to_dataframe().style.format('{:.3f}')
| RMSE | MAE | RMSE_duration | |
|---|---|---|---|
| time_series | |||
| ElecDemand(P_el)|fixed_relative_profile | 0.056 | 0.016 | 0.030 |
| GasGrid(Q_Gas)|costs|per_flow_hour | 0.109 | 0.079 | 0.079 |
| GridBuy(P_el)|costs|per_flow_hour | 0.108 | 0.070 | 0.030 |
| GridSell(P_el)|costs|per_flow_hour | 0.108 | 0.070 | 0.029 |
| HeatDemand(Q_th)|fixed_relative_profile | 0.081 | 0.050 | 0.017 |
# Visual comparison: original vs clustered time series
fs_clustered.clustering.plot.compare()
Inspect Clustering Input Data¶
Before clustering, you can inspect which time-varying data will be used. The clustering_data() method returns only the arrays that vary over time (constant arrays are excluded since they don't affect clustering):
# See what data will be used for clustering
clustering_data = flow_system.transform.clustering_data()
print(f'Variables used for clustering ({len(clustering_data.data_vars)} total):')
for var in clustering_data.data_vars:
print(f' - {var}')
Variables used for clustering (5 total): - GasGrid(Q_Gas)|costs|per_flow_hour - GridBuy(P_el)|costs|per_flow_hour - GridSell(P_el)|costs|per_flow_hour - HeatDemand(Q_th)|fixed_relative_profile - ElecDemand(P_el)|fixed_relative_profile
# Visualize the time-varying data (select a few key variables)
key_vars = [v for v in clustering_data.data_vars if 'fixed_relative_profile' in v or 'effects_per_flow_hour' in v]
clustering_data[key_vars].plotly.line(facet_row='variable', title='Time-Varying Data Used for Clustering')
Selective Clustering with data_vars¶
By default, clustering uses all time-varying data to determine typical periods. However, you may want to cluster based on only a subset of variables while still applying the clustering to all data.
Use the data_vars parameter to specify which variables determine the clustering:
- Cluster based on subset: Only the specified variables affect which days are grouped together
- Apply to all data: The resulting clustering is applied to ALL time-varying data
This is useful when:
- You want to cluster based on demand patterns only (ignoring price variations)
- You have dominant time series that should drive the clustering
- You want to ensure certain patterns are well-represented in typical periods
# Cluster based ONLY on heat demand pattern (ignore electricity prices)
demand_var = 'HeatDemand(Q_th)|fixed_relative_profile'
fs_demand_only = flow_system.transform.cluster(
n_clusters=8,
cluster_duration='1D',
data_vars=[demand_var], # Only this variable determines clustering
extremes=ExtremeConfig(method='new_cluster', max_value=[demand_var], preserve_n_clusters=True),
)
# Verify: clustering was determined by demand but applied to all data
print(f'Clustered using: {demand_var}')
print(f'But all {len(clustering_data.data_vars)} variables are included in the result')
Clustered using: HeatDemand(Q_th)|fixed_relative_profile But all 5 variables are included in the result
# Compare metrics: clustering with all data vs. demand-only
pd.DataFrame(
{
'All Variables': fs_clustered.clustering.metrics.to_dataframe().iloc[0],
'Demand Only': fs_demand_only.clustering.metrics.to_dataframe().iloc[0],
}
).round(4)
| All Variables | Demand Only | |
|---|---|---|
| RMSE | 0.0563 | 0.1262 |
| MAE | 0.0157 | 0.0447 |
| RMSE_duration | 0.0303 | 0.0295 |
Advanced Clustering Options¶
The cluster() method exposes many parameters for fine-tuning:
from tsam import ClusterConfig
# Try different clustering algorithms
fs_kmeans = flow_system.transform.cluster(
n_clusters=8,
cluster_duration='1D',
cluster=ClusterConfig(method='kmeans'), # Alternative: 'hierarchical' (default), 'kmedoids', 'averaging'
)
fs_kmeans.clustering
Clustering( 31 periods → 8 clusters timesteps_per_cluster=24 dims=[] )
# Compare quality metrics between algorithms
pd.DataFrame(
{
'hierarchical': fs_clustered.clustering.metrics.to_dataframe().iloc[0],
'kmeans': fs_kmeans.clustering.metrics.to_dataframe().iloc[0],
}
)
| hierarchical | kmeans | |
|---|---|---|
| RMSE | 0.056259 | 0.047047 |
| MAE | 0.015673 | 0.012596 |
| RMSE_duration | 0.030335 | 0.022595 |
# Visualize cluster structure with heatmap
fs_clustered.clustering.plot.heatmap()
Apply Existing Clustering¶
When comparing design variants or performing sensitivity analysis, you often want to use the same cluster structure across different FlowSystem configurations. Use apply_clustering() to reuse a clustering from another FlowSystem:
# First, create a reference clustering
fs_reference = flow_system.transform.cluster(n_clusters=8, cluster_duration='1D')
# Modify the FlowSystem (e.g., different storage size)
flow_system_modified = flow_system.copy()
flow_system_modified.components['Storage'].capacity_in_flow_hours.maximum_size = 2000
# Apply the SAME clustering for fair comparison
fs_modified = flow_system_modified.transform.apply_clustering(fs_reference.clustering)
This ensures both systems use identical typical periods for fair comparison.
Method 3: Two-Stage Workflow (Recommended)¶
The recommended approach for investment optimization:
- Stage 1: Fast sizing with
cluster() - Stage 2: Fix sizes (with safety margin) and dispatch at full resolution
!!! tip "Safety Margin" Typical periods aggregate similar days, so individual days may have higher demand than the typical day. Adding a 5-10% margin ensures feasibility.
# Apply safety margin to sizes
SAFETY_MARGIN = 1.05 # 5% buffer
sizes_with_margin = {name: float(size.item()) * SAFETY_MARGIN for name, size in fs_clustered.stats.sizes.items()}
# Stage 2: Fix sizes and optimize at full resolution
start = timeit.default_timer()
fs_dispatch = flow_system.transform.fix_sizes(sizes_with_margin)
fs_dispatch.name = 'Two-Stage'
fs_dispatch.optimize(solver)
time_dispatch = timeit.default_timer() - start
# Total two-stage time
total_two_stage = time_clustering + time_clustered + time_dispatch
Compare Results¶
results = {
'Full (baseline)': {
'Time [s]': time_full,
'Cost [€]': fs_full.solution['costs'].item(),
'CHP': fs_full.stats.sizes['CHP(Q_th)'].item(),
'Boiler': fs_full.stats.sizes['Boiler(Q_th)'].item(),
'Storage': fs_full.stats.sizes['Storage'].item(),
},
'Clustered (8 days)': {
'Time [s]': time_clustering + time_clustered,
'Cost [€]': fs_clustered.solution['costs'].item(),
'CHP': fs_clustered.stats.sizes['CHP(Q_th)'].item(),
'Boiler': fs_clustered.stats.sizes['Boiler(Q_th)'].item(),
'Storage': fs_clustered.stats.sizes['Storage'].item(),
},
'Two-Stage': {
'Time [s]': total_two_stage,
'Cost [€]': fs_dispatch.solution['costs'].item(),
'CHP': sizes_with_margin['CHP(Q_th)'],
'Boiler': sizes_with_margin['Boiler(Q_th)'],
'Storage': sizes_with_margin['Storage'],
},
}
comparison = pd.DataFrame(results).T
baseline_cost = comparison.loc['Full (baseline)', 'Cost [€]']
baseline_time = comparison.loc['Full (baseline)', 'Time [s]']
comparison['Cost Gap [%]'] = ((comparison['Cost [€]'] - baseline_cost) / abs(baseline_cost) * 100).round(2)
comparison['Speedup'] = (baseline_time / comparison['Time [s]']).round(1)
comparison.style.format(
{
'Time [s]': '{:.1f}',
'Cost [€]': '{:,.0f}',
'CHP': '{:.1f}',
'Boiler': '{:.1f}',
'Storage': '{:.0f}',
'Cost Gap [%]': '{:.2f}',
'Speedup': '{:.1f}x',
}
)
| Time [s] | Cost [€] | CHP | Boiler | Storage | Cost Gap [%] | Speedup | |
|---|---|---|---|---|---|---|---|
| Full (baseline) | 17.1 | -148,912 | 165.7 | 0.0 | 1000 | 0.00 | 1.0x |
| Clustered (8 days) | 6.7 | -137,579 | 171.7 | 0.0 | 1000 | 7.61 | 2.5x |
| Two-Stage | 13.7 | -150,096 | 180.3 | 0.0 | 1050 | -0.79 | 1.2x |
Expand Solution to Full Resolution¶
Use expand() to map the clustered solution back to all original timesteps. This repeats the typical period values for all days belonging to that cluster:
# Expand the clustered solution to full resolution
fs_expanded = fs_clustered.transform.expand()
# Compare heat production: Full vs Expanded
heat_flows = ['CHP(Q_th)|flow_rate', 'Boiler(Q_th)|flow_rate']
# Create comparison dataset
comparison_ds = xr.Dataset(
{
name.replace('|flow_rate', ''): xr.concat(
[fs_full.solution[name], fs_expanded.solution[name]], dim=pd.Index(['Full', 'Expanded'], name='method')
)
for name in heat_flows
}
)
comparison_ds.plotly.line(x='time', facet_col='variable', color='method', title='Heat Production Comparison')
Visualize Clustered Heat Balance¶
fs_clustered.stats.plot.storage('Storage')
fs_expanded.stats.plot.storage('Storage')
API Reference¶
transform.cluster() Parameters¶
| Parameter | Type | Default | Description |
|---|---|---|---|
n_clusters | int | - | Number of typical periods (e.g., 8 typical days) |
cluster_duration | str \| float | - | Duration per cluster ('1D', '24h') or hours |
data_vars | list[str] | None | Variables to cluster on (applies result to all) |
weights | dict[str, float] | None | Optional weights for time series in clustering |
cluster | ClusterConfig | None | Clustering algorithm configuration |
extremes | ExtremeConfig | None | Essential: Force inclusion of peak/min periods |
**tsam_kwargs | - | - | Additional tsam parameters |
transform.clustering_data() Method¶
Inspect which time-varying data will be used for clustering:
# Get all time-varying variables
clustering_data = flow_system.transform.clustering_data()
print(list(clustering_data.data_vars))
# Get data for a specific period (multi-period systems)
clustering_data = flow_system.transform.clustering_data(period=2024)
Clustering Object Properties¶
After clustering, access metadata via fs.clustering:
| Property | Description |
|---|---|
n_clusters | Number of clusters |
n_original_clusters | Number of original time segments (e.g., 365 days) |
timesteps_per_cluster | Timesteps in each cluster (e.g., 24 for daily) |
cluster_assignments | xr.DataArray mapping original segment → cluster ID |
cluster_occurrences | How many original segments each cluster represents |
metrics | xr.Dataset with RMSE, MAE per time series |
results | ClusteringResults with xarray-like interface |
plot.compare() | Compare original vs clustered time series |
plot.heatmap() | Visualize cluster structure |
ClusteringResults (xarray-like)¶
Access the underlying tsam results via clustering.results:
# Dimension info (like xarray)
clustering.results.dims # ('period', 'scenario') or ()
clustering.results.coords # {'period': [2020, 2030], 'scenario': ['high', 'low']}
# Select specific result (like xarray)
clustering.results.sel(period=2020, scenario='high') # Label-based
clustering.results.isel(period=0, scenario=1) # Index-based
# Apply existing clustering to new data
agg_results = clustering.results.apply(dataset) # Returns AggregationResults
Storage Behavior¶
Each Storage component has a cluster_mode parameter:
| Mode | Description |
|---|---|
'intercluster_cyclic' | Links storage across clusters + yearly cyclic (default) |
'intercluster' | Links storage across clusters, free start/end |
'cyclic' | Each cluster is independent but cyclic (start = end) |
'independent' | Each cluster is independent, free start/end |
For a detailed comparison of storage modes, see 08c2-clustering-storage-modes.
Peak Forcing with ExtremeConfig¶
from tsam import ExtremeConfig
extremes = ExtremeConfig(
method='new_cluster', # Creates new cluster for extremes
max_value=['ComponentName(FlowName)|fixed_relative_profile'], # Capture peak demand
preserve_n_clusters=True, # Keep total cluster count unchanged
)
Recommended Workflow¶
from tsam import ExtremeConfig
# Stage 1: Fast sizing
fs_sizing = flow_system.transform.cluster(
n_clusters=8,
cluster_duration='1D',
extremes=ExtremeConfig(method='new_cluster', max_value=['Demand(Flow)|fixed_relative_profile'], preserve_n_clusters=True),
)
fs_sizing.optimize(solver)
# Apply safety margin
sizes = {k: v.item() * 1.05 for k, v in fs_sizing.stats.sizes.items()}
# Stage 2: Accurate dispatch
fs_dispatch = flow_system.transform.fix_sizes(sizes)
fs_dispatch.optimize(solver)
Summary¶
You learned how to:
- Use
cluster()to reduce time series into typical periods - Inspect clustering data with
clustering_data()before clustering - Use
data_varsto cluster based on specific variables only - Apply peak forcing with
ExtremeConfigto capture extreme demand days - Use two-stage optimization for fast yet accurate investment decisions
- Expand solutions back to full resolution with
expand() - Access clustering metadata via
fs.clustering(metrics, cluster_assignments, cluster_occurrences) - Use advanced options like different algorithms with
ClusterConfig - Apply existing clustering to other FlowSystems using
apply_clustering()
Key Takeaways¶
- Always use peak forcing (
extremes=ExtremeConfig(max_value=[...])) for demand time series - Inspect data first with
clustering_data()to see available variables - Use
data_varsto cluster on specific variables (e.g., demand only, ignoring prices) - Add safety margin (5-10%) when fixing sizes from clustering
- Two-stage is recommended: clustering for sizing, full resolution for dispatch
- Storage handling is configurable via
cluster_mode - Check metrics to evaluate clustering quality
- Use
apply_clustering()to apply the same clustering to different FlowSystem variants
Next Steps¶
- 08c2-clustering-storage-modes: Compare storage modes using a seasonal storage system
- 08d-clustering-multiperiod: Clustering with multiple periods and scenarios