Performance

The Performance page provides real-time visibility into your Ceph cluster’s performance through interactive charts powered by Prometheus metrics. Monitor storage usage, IOPS, throughput, and latency to understand cluster behavior and identify potential bottlenecks.

Key Concepts

IOPS

Input/Output Operations Per Second - measures the rate of read and write operations.

Throughput

Data transfer rate measured in bytes per second (B/s, MB/s, GB/s).

Latency

Time delay for operations, measured in milliseconds (ms). Lower is better.

Recovery Throughput

Rate of data being recovered or rebalanced across OSDs.

Required Permissions

Action	Permission
View Performance Metrics	`iam:project:infrastructure:ceph:read`

This page is read-only. Performance data is collected automatically by Prometheus from the Ceph cluster.

Available Metrics

Storage Used

Total storage capacity currently in use across the cluster.

Metric	Description
Query	`ceph_cluster_total_used_bytes`
Unit	Bytes (displayed as KB, MB, GB, TB)
Update	Every 30 seconds

IOPS (Input/Output Operations Per Second)

Metric	Description
Read IOPS	Rate of read operations across all pools
Write IOPS	Rate of write operations across all pools
Query (Read)	`sum(rate(ceph_pool_rd[1m]))`
Query (Write)	`sum(rate(ceph_pool_wr[1m]))`

Throughput

Metric	Description
Read Throughput	Data read rate in bytes per second
Write Throughput	Data write rate in bytes per second
Query (Read)	`sum(rate(ceph_pool_rd_bytes[1m]))`
Query (Write)	`sum(rate(ceph_pool_wr_bytes[1m]))`

Latency

Metric	Description
Apply Latency	Average time for OSDs to apply operations
Commit Latency	Average time for OSDs to commit operations to disk
Query (Apply)	`avg(avg_over_time(ceph_osd_apply_latency_ms[1m]))`
Query (Commit)	`avg(avg_over_time(ceph_osd_commit_latency_ms[1m]))`

Recovery Throughput

Rate of data being recovered or rebalanced between OSDs.

Metric	Description
Query	`sum(rate(ceph_osd_recovery_bytes[1m]))`
Unit	Bytes per second

Time Ranges

Select a time window to view historical performance data:

Range	Description
5m	Last 5 minutes - high resolution for immediate issues
15m	Last 15 minutes - short-term trend analysis
30m	Last 30 minutes - recent activity overview
1h	Last hour - operational view
3h	Last 3 hours - extended operations
6h	Last 6 hours - half-day analysis
12h	Last 12 hours - business hours view
24h	Last 24 hours - daily pattern analysis

How to View Performance Metrics

Select Cluster

Choose a Ceph cluster from the cluster dropdown. Only ready (bootstrapped) clusters with Prometheus enabled will show data.

Select Time Range

Click a time range button (5m, 15m, 30m, 1h, etc.) to adjust the historical window.

Review Charts

The page displays multiple metric charts:

Storage Used: Current cluster utilization
Apply Latency: OSD operation apply time
Commit Latency: OSD commit time
IOPS: Combined read/write operations chart
Throughput: Combined read/write data rate chart
Recovery Throughput: Data recovery/rebalancing rate

Inspect Values

Hover over charts to see exact values at specific points
Current values are displayed in the chart headers
Charts auto-refresh every 30 seconds

How to Interpret the Charts

Understanding IOPS

The IOPS chart shows read (green) and write (purple) operations:

High IOPS: Indicates active workload on the cluster
Balanced Read/Write: Normal for most workloads
Write-Heavy: Expected for data ingestion workloads
Read-Heavy: Expected for serving applications

Sudden IOPS spikes may indicate application activity changes. Sustained high IOPS without corresponding client activity could indicate recovery or scrubbing operations.

Understanding Throughput

The Throughput chart shows data transfer rates:

Sustained Throughput: Normal client I/O patterns
Spikes: Large file transfers or backup operations
Zero Throughput: No active I/O (idle cluster)

Understanding Latency

Latency metrics indicate storage responsiveness:

Apply Latency: Time to apply operations in memory
Commit Latency: Time to persist operations to disk

Latency	Interpretation
< 5ms	Excellent - typically SSD/NVMe OSDs
5-20ms	Good - normal for HDDs under light load
20-50ms	Fair - HDDs under moderate load
> 50ms	Poor - possible bottleneck

High latency can impact application performance. Investigate if latency consistently exceeds 50ms for SSDs or 100ms for HDDs.

Understanding Recovery Throughput

Recovery throughput appears when:

A failed OSD returns to service
New OSDs are added to the cluster
Placement groups are being rebalanced
After changing CRUSH rules or pool settings

Recovery operations are normal and expected. High recovery throughput reduces capacity for client I/O but is necessary to restore data redundancy.

Chart Features

Live Indicator

A pulsing green dot indicates the page is receiving live data updates every 30 seconds.

Tooltips

Hover over any point on a chart to see:

Exact timestamp
Precise metric value
Appropriate unit formatting

Color Coding

Green: Read operations
Purple/Violet: Write operations
Cyan: Storage capacity
Amber/Yellow: Apply latency
Red: Commit latency
Orange: Recovery throughput

Troubleshooting

No data showing in charts

Verify the cluster is bootstrapped and ready
Check that Prometheus is deployed in the cluster
Ensure the node-exporter daemons are running
Wait a few minutes for initial data collection

Data gaps in charts

Prometheus may have been temporarily unavailable
Network issues between nodes
Monitor daemon restart
Check Prometheus storage capacity

High latency values

Check OSD disk health and utilization
Verify no recovery operations are overwhelming OSDs
Check for slow OSD warnings in cluster health
Consider adding more OSDs to distribute load

Unexpected IOPS patterns

Review running workloads on the cluster
Check for scrubbing or deep-scrubbing operations
Verify no unexpected recovery is happening
Look for client application issues

Recovery throughput constantly high

OSDs may be flapping (repeatedly failing and recovering)
Check cluster health for failed OSDs
Verify network connectivity between nodes
Review OSD logs for recurring failures

Charts show 'No data available'

Prometheus may not be configured for the cluster
The cluster may not have completed bootstrap
Metrics may not be collected for this cluster
Try selecting a different time range

FAQ

How often is data updated?

Performance charts auto-refresh every 30 seconds. You can also click the refresh button to manually update all charts immediately.

What is the difference between Apply and Commit latency?

Apply Latency: Time for the OSD to process an operation and make it visible in memory. This is typically fast.Commit Latency: Time for the OSD to durably write the operation to disk. This is affected by disk performance.Both contribute to overall operation latency experienced by clients.

Why is recovery throughput important?

Recovery throughput indicates:

Data is being restored after failures
Cluster is rebalancing after topology changes
PG migrations are in progress

High recovery throughput temporarily impacts client performance but is essential for maintaining data durability.

What causes latency spikes?

Common causes of latency spikes:

Disk I/O contention from heavy workloads
Recovery or rebalancing operations
Scrubbing or deep-scrubbing
Network congestion between OSDs
Failing or degraded disks

How do I know if IOPS is too high?

“Too high” depends on your hardware:

HDDs: typically 100-200 IOPS per drive
SSDs: typically 10,000-100,000 IOPS per drive
NVMe: can exceed 500,000 IOPS per drive

Monitor latency alongside IOPS. High IOPS with low latency is normal; high IOPS with high latency indicates saturation.

Why might read and write throughput differ significantly?

Asymmetric throughput is common:

Backup operations are write-heavy
Serving content is read-heavy
Data processing may favor one direction
Replication is write-heavy

The ratio depends on your workload characteristics.

Can I export this data?

The underlying Prometheus data can be accessed directly:

Use the Prometheus API for raw metrics
Configure Grafana dashboards for custom views
Set up alerting rules based on thresholds

What time range should I use?

Choose based on your investigation needs:

5m-15m: Troubleshooting current issues
1h-3h: Understanding recent patterns
6h-24h: Analyzing daily trends and capacity planning

Getting Started

Infrastructure

Platform Services

CI/CD & Deployments

Pipeline & Helm

Performance Testing

Security

Mesh Networking

Access Management

Audit & Compliance

Settings

Key Concepts

IOPS

Throughput

Latency

Recovery Throughput

Required Permissions

Available Metrics

Storage Used

IOPS (Input/Output Operations Per Second)

Throughput

Latency

Recovery Throughput

Time Ranges

How to View Performance Metrics

How to Interpret the Charts

Understanding IOPS

Understanding Throughput

Understanding Latency

Understanding Recovery Throughput

Chart Features

Live Indicator

Tooltips

Color Coding

Troubleshooting

FAQ

Getting Started

Infrastructure

Platform Services

CI/CD & Deployments

Pipeline & Helm

Performance Testing

Security

Mesh Networking

Access Management

Audit & Compliance

Settings

​Key Concepts

IOPS

Throughput

Latency

Recovery Throughput

​Required Permissions

​Available Metrics

​Storage Used

​IOPS (Input/Output Operations Per Second)

​Throughput

​Latency

​Recovery Throughput

​Time Ranges

​How to View Performance Metrics

​How to Interpret the Charts

​Understanding IOPS

​Understanding Throughput

​Understanding Latency

​Understanding Recovery Throughput

​Chart Features

​Live Indicator

​Tooltips

​Color Coding

​Troubleshooting

​FAQ

Key Concepts

Required Permissions

Available Metrics

Storage Used

IOPS (Input/Output Operations Per Second)

Throughput

Latency

Recovery Throughput

Time Ranges

How to View Performance Metrics

How to Interpret the Charts

Understanding IOPS

Understanding Throughput

Understanding Latency

Understanding Recovery Throughput

Chart Features

Live Indicator

Tooltips

Color Coding

Troubleshooting

FAQ