Skip to main content
The Performance page provides real-time visibility into your Ceph cluster’s performance through interactive charts powered by Prometheus metrics. Monitor storage usage, IOPS, throughput, and latency to understand cluster behavior and identify potential bottlenecks.

Key Concepts

IOPS

Input/Output Operations Per Second - measures the rate of read and write operations.

Throughput

Data transfer rate measured in bytes per second (B/s, MB/s, GB/s).

Latency

Time delay for operations, measured in milliseconds (ms). Lower is better.

Recovery Throughput

Rate of data being recovered or rebalanced across OSDs.

Required Permissions

ActionPermission
View Performance Metricsiam:project:infrastructure:ceph:read
This page is read-only. Performance data is collected automatically by Prometheus from the Ceph cluster.

Available Metrics

Storage Used

Total storage capacity currently in use across the cluster.
MetricDescription
Queryceph_cluster_total_used_bytes
UnitBytes (displayed as KB, MB, GB, TB)
UpdateEvery 30 seconds

IOPS (Input/Output Operations Per Second)

MetricDescription
Read IOPSRate of read operations across all pools
Write IOPSRate of write operations across all pools
Query (Read)sum(rate(ceph_pool_rd[1m]))
Query (Write)sum(rate(ceph_pool_wr[1m]))

Throughput

MetricDescription
Read ThroughputData read rate in bytes per second
Write ThroughputData write rate in bytes per second
Query (Read)sum(rate(ceph_pool_rd_bytes[1m]))
Query (Write)sum(rate(ceph_pool_wr_bytes[1m]))

Latency

MetricDescription
Apply LatencyAverage time for OSDs to apply operations
Commit LatencyAverage time for OSDs to commit operations to disk
Query (Apply)avg(avg_over_time(ceph_osd_apply_latency_ms[1m]))
Query (Commit)avg(avg_over_time(ceph_osd_commit_latency_ms[1m]))

Recovery Throughput

Rate of data being recovered or rebalanced between OSDs.
MetricDescription
Querysum(rate(ceph_osd_recovery_bytes[1m]))
UnitBytes per second

Time Ranges

Select a time window to view historical performance data:
RangeDescription
5mLast 5 minutes - high resolution for immediate issues
15mLast 15 minutes - short-term trend analysis
30mLast 30 minutes - recent activity overview
1hLast hour - operational view
3hLast 3 hours - extended operations
6hLast 6 hours - half-day analysis
12hLast 12 hours - business hours view
24hLast 24 hours - daily pattern analysis

How to View Performance Metrics

1

Select Cluster

Choose a Ceph cluster from the cluster dropdown. Only ready (bootstrapped) clusters with Prometheus enabled will show data.
2

Select Time Range

Click a time range button (5m, 15m, 30m, 1h, etc.) to adjust the historical window.
3

Review Charts

The page displays multiple metric charts:
  • Storage Used: Current cluster utilization
  • Apply Latency: OSD operation apply time
  • Commit Latency: OSD commit time
  • IOPS: Combined read/write operations chart
  • Throughput: Combined read/write data rate chart
  • Recovery Throughput: Data recovery/rebalancing rate
4

Inspect Values

  • Hover over charts to see exact values at specific points
  • Current values are displayed in the chart headers
  • Charts auto-refresh every 30 seconds

How to Interpret the Charts

Understanding IOPS

The IOPS chart shows read (green) and write (purple) operations:
  • High IOPS: Indicates active workload on the cluster
  • Balanced Read/Write: Normal for most workloads
  • Write-Heavy: Expected for data ingestion workloads
  • Read-Heavy: Expected for serving applications
Sudden IOPS spikes may indicate application activity changes. Sustained high IOPS without corresponding client activity could indicate recovery or scrubbing operations.

Understanding Throughput

The Throughput chart shows data transfer rates:
  • Sustained Throughput: Normal client I/O patterns
  • Spikes: Large file transfers or backup operations
  • Zero Throughput: No active I/O (idle cluster)

Understanding Latency

Latency metrics indicate storage responsiveness:
  • Apply Latency: Time to apply operations in memory
  • Commit Latency: Time to persist operations to disk
LatencyInterpretation
< 5msExcellent - typically SSD/NVMe OSDs
5-20msGood - normal for HDDs under light load
20-50msFair - HDDs under moderate load
> 50msPoor - possible bottleneck
High latency can impact application performance. Investigate if latency consistently exceeds 50ms for SSDs or 100ms for HDDs.

Understanding Recovery Throughput

Recovery throughput appears when:
  • A failed OSD returns to service
  • New OSDs are added to the cluster
  • Placement groups are being rebalanced
  • After changing CRUSH rules or pool settings
Recovery operations are normal and expected. High recovery throughput reduces capacity for client I/O but is necessary to restore data redundancy.

Chart Features

Live Indicator

A pulsing green dot indicates the page is receiving live data updates every 30 seconds.

Tooltips

Hover over any point on a chart to see:
  • Exact timestamp
  • Precise metric value
  • Appropriate unit formatting

Color Coding

  • Green: Read operations
  • Purple/Violet: Write operations
  • Cyan: Storage capacity
  • Amber/Yellow: Apply latency
  • Red: Commit latency
  • Orange: Recovery throughput

Troubleshooting

  • Verify the cluster is bootstrapped and ready
  • Check that Prometheus is deployed in the cluster
  • Ensure the node-exporter daemons are running
  • Wait a few minutes for initial data collection
  • Prometheus may have been temporarily unavailable
  • Network issues between nodes
  • Monitor daemon restart
  • Check Prometheus storage capacity
  • Check OSD disk health and utilization
  • Verify no recovery operations are overwhelming OSDs
  • Check for slow OSD warnings in cluster health
  • Consider adding more OSDs to distribute load
  • Review running workloads on the cluster
  • Check for scrubbing or deep-scrubbing operations
  • Verify no unexpected recovery is happening
  • Look for client application issues
  • OSDs may be flapping (repeatedly failing and recovering)
  • Check cluster health for failed OSDs
  • Verify network connectivity between nodes
  • Review OSD logs for recurring failures
  • Prometheus may not be configured for the cluster
  • The cluster may not have completed bootstrap
  • Metrics may not be collected for this cluster
  • Try selecting a different time range

FAQ

Performance charts auto-refresh every 30 seconds. You can also click the refresh button to manually update all charts immediately.
Apply Latency: Time for the OSD to process an operation and make it visible in memory. This is typically fast.Commit Latency: Time for the OSD to durably write the operation to disk. This is affected by disk performance.Both contribute to overall operation latency experienced by clients.
Recovery throughput indicates:
  • Data is being restored after failures
  • Cluster is rebalancing after topology changes
  • PG migrations are in progress
High recovery throughput temporarily impacts client performance but is essential for maintaining data durability.
Common causes of latency spikes:
  • Disk I/O contention from heavy workloads
  • Recovery or rebalancing operations
  • Scrubbing or deep-scrubbing
  • Network congestion between OSDs
  • Failing or degraded disks
“Too high” depends on your hardware:
  • HDDs: typically 100-200 IOPS per drive
  • SSDs: typically 10,000-100,000 IOPS per drive
  • NVMe: can exceed 500,000 IOPS per drive
Monitor latency alongside IOPS. High IOPS with low latency is normal; high IOPS with high latency indicates saturation.
Asymmetric throughput is common:
  • Backup operations are write-heavy
  • Serving content is read-heavy
  • Data processing may favor one direction
  • Replication is write-heavy
The ratio depends on your workload characteristics.
The underlying Prometheus data can be accessed directly:
  • Use the Prometheus API for raw metrics
  • Configure Grafana dashboards for custom views
  • Set up alerting rules based on thresholds
Choose based on your investigation needs:
  • 5m-15m: Troubleshooting current issues
  • 1h-3h: Understanding recent patterns
  • 6h-24h: Analyzing daily trends and capacity planning