Key Concepts
IOPS
Input/Output Operations Per Second - measures the rate of read and write operations.
Throughput
Data transfer rate measured in bytes per second (B/s, MB/s, GB/s).
Latency
Time delay for operations, measured in milliseconds (ms). Lower is better.
Recovery Throughput
Rate of data being recovered or rebalanced across OSDs.
Required Permissions
| Action | Permission |
|---|---|
| View Performance Metrics | iam:project:infrastructure:ceph:read |
This page is read-only. Performance data is collected automatically by Prometheus from the Ceph cluster.
Available Metrics
Storage Used
Total storage capacity currently in use across the cluster.| Metric | Description |
|---|---|
| Query | ceph_cluster_total_used_bytes |
| Unit | Bytes (displayed as KB, MB, GB, TB) |
| Update | Every 30 seconds |
IOPS (Input/Output Operations Per Second)
| Metric | Description |
|---|---|
| Read IOPS | Rate of read operations across all pools |
| Write IOPS | Rate of write operations across all pools |
| Query (Read) | sum(rate(ceph_pool_rd[1m])) |
| Query (Write) | sum(rate(ceph_pool_wr[1m])) |
Throughput
| Metric | Description |
|---|---|
| Read Throughput | Data read rate in bytes per second |
| Write Throughput | Data write rate in bytes per second |
| Query (Read) | sum(rate(ceph_pool_rd_bytes[1m])) |
| Query (Write) | sum(rate(ceph_pool_wr_bytes[1m])) |
Latency
| Metric | Description |
|---|---|
| Apply Latency | Average time for OSDs to apply operations |
| Commit Latency | Average time for OSDs to commit operations to disk |
| Query (Apply) | avg(avg_over_time(ceph_osd_apply_latency_ms[1m])) |
| Query (Commit) | avg(avg_over_time(ceph_osd_commit_latency_ms[1m])) |
Recovery Throughput
Rate of data being recovered or rebalanced between OSDs.| Metric | Description |
|---|---|
| Query | sum(rate(ceph_osd_recovery_bytes[1m])) |
| Unit | Bytes per second |
Time Ranges
Select a time window to view historical performance data:| Range | Description |
|---|---|
| 5m | Last 5 minutes - high resolution for immediate issues |
| 15m | Last 15 minutes - short-term trend analysis |
| 30m | Last 30 minutes - recent activity overview |
| 1h | Last hour - operational view |
| 3h | Last 3 hours - extended operations |
| 6h | Last 6 hours - half-day analysis |
| 12h | Last 12 hours - business hours view |
| 24h | Last 24 hours - daily pattern analysis |
How to View Performance Metrics
Select Cluster
Choose a Ceph cluster from the cluster dropdown. Only ready (bootstrapped) clusters with Prometheus enabled will show data.
Select Time Range
Click a time range button (5m, 15m, 30m, 1h, etc.) to adjust the historical window.
Review Charts
The page displays multiple metric charts:
- Storage Used: Current cluster utilization
- Apply Latency: OSD operation apply time
- Commit Latency: OSD commit time
- IOPS: Combined read/write operations chart
- Throughput: Combined read/write data rate chart
- Recovery Throughput: Data recovery/rebalancing rate
How to Interpret the Charts
Understanding IOPS
The IOPS chart shows read (green) and write (purple) operations:- High IOPS: Indicates active workload on the cluster
- Balanced Read/Write: Normal for most workloads
- Write-Heavy: Expected for data ingestion workloads
- Read-Heavy: Expected for serving applications
Understanding Throughput
The Throughput chart shows data transfer rates:- Sustained Throughput: Normal client I/O patterns
- Spikes: Large file transfers or backup operations
- Zero Throughput: No active I/O (idle cluster)
Understanding Latency
Latency metrics indicate storage responsiveness:- Apply Latency: Time to apply operations in memory
- Commit Latency: Time to persist operations to disk
| Latency | Interpretation |
|---|---|
| < 5ms | Excellent - typically SSD/NVMe OSDs |
| 5-20ms | Good - normal for HDDs under light load |
| 20-50ms | Fair - HDDs under moderate load |
| > 50ms | Poor - possible bottleneck |
Understanding Recovery Throughput
Recovery throughput appears when:- A failed OSD returns to service
- New OSDs are added to the cluster
- Placement groups are being rebalanced
- After changing CRUSH rules or pool settings
Recovery operations are normal and expected. High recovery throughput reduces capacity for client I/O but is necessary to restore data redundancy.
Chart Features
Live Indicator
A pulsing green dot indicates the page is receiving live data updates every 30 seconds.Tooltips
Hover over any point on a chart to see:- Exact timestamp
- Precise metric value
- Appropriate unit formatting
Color Coding
- Green: Read operations
- Purple/Violet: Write operations
- Cyan: Storage capacity
- Amber/Yellow: Apply latency
- Red: Commit latency
- Orange: Recovery throughput
Troubleshooting
No data showing in charts
No data showing in charts
- Verify the cluster is bootstrapped and ready
- Check that Prometheus is deployed in the cluster
- Ensure the node-exporter daemons are running
- Wait a few minutes for initial data collection
Data gaps in charts
Data gaps in charts
- Prometheus may have been temporarily unavailable
- Network issues between nodes
- Monitor daemon restart
- Check Prometheus storage capacity
High latency values
High latency values
- Check OSD disk health and utilization
- Verify no recovery operations are overwhelming OSDs
- Check for slow OSD warnings in cluster health
- Consider adding more OSDs to distribute load
Unexpected IOPS patterns
Unexpected IOPS patterns
- Review running workloads on the cluster
- Check for scrubbing or deep-scrubbing operations
- Verify no unexpected recovery is happening
- Look for client application issues
Recovery throughput constantly high
Recovery throughput constantly high
- OSDs may be flapping (repeatedly failing and recovering)
- Check cluster health for failed OSDs
- Verify network connectivity between nodes
- Review OSD logs for recurring failures
Charts show 'No data available'
Charts show 'No data available'
- Prometheus may not be configured for the cluster
- The cluster may not have completed bootstrap
- Metrics may not be collected for this cluster
- Try selecting a different time range
FAQ
How often is data updated?
How often is data updated?
Performance charts auto-refresh every 30 seconds. You can also click the refresh button to manually update all charts immediately.
What is the difference between Apply and Commit latency?
What is the difference between Apply and Commit latency?
Apply Latency: Time for the OSD to process an operation and make it visible in memory. This is typically fast.Commit Latency: Time for the OSD to durably write the operation to disk. This is affected by disk performance.Both contribute to overall operation latency experienced by clients.
Why is recovery throughput important?
Why is recovery throughput important?
Recovery throughput indicates:
- Data is being restored after failures
- Cluster is rebalancing after topology changes
- PG migrations are in progress
What causes latency spikes?
What causes latency spikes?
Common causes of latency spikes:
- Disk I/O contention from heavy workloads
- Recovery or rebalancing operations
- Scrubbing or deep-scrubbing
- Network congestion between OSDs
- Failing or degraded disks
How do I know if IOPS is too high?
How do I know if IOPS is too high?
“Too high” depends on your hardware:
- HDDs: typically 100-200 IOPS per drive
- SSDs: typically 10,000-100,000 IOPS per drive
- NVMe: can exceed 500,000 IOPS per drive
Why might read and write throughput differ significantly?
Why might read and write throughput differ significantly?
Asymmetric throughput is common:
- Backup operations are write-heavy
- Serving content is read-heavy
- Data processing may favor one direction
- Replication is write-heavy
Can I export this data?
Can I export this data?
The underlying Prometheus data can be accessed directly:
- Use the Prometheus API for raw metrics
- Configure Grafana dashboards for custom views
- Set up alerting rules based on thresholds
What time range should I use?
What time range should I use?
Choose based on your investigation needs:
- 5m-15m: Troubleshooting current issues
- 1h-3h: Understanding recent patterns
- 6h-24h: Analyzing daily trends and capacity planning