Performance and Scalability
This section details Khedra's performance characteristics, optimization strategies, and scalability considerations.
Performance Tuning Guide
Optimal Configuration for Different Use Cases
Light Usage (Personal Development/Testing)
services:
scraper:
enabled: true
sleep: 30 # Longer sleep for less aggressive indexing
batchSize: 100 # Smaller batches to reduce memory usage
monitor:
enabled: false # Disable if not needed
sleep: 60
batchSize: 50
api:
enabled: true
port: 8080 # Standard configuration
ipfs:
enabled: false # Disable to reduce resource usage
Expected Performance: 5-10 blocks/sec indexing, minimal resource usage
Standard Usage (Regular Development/Analysis)
services:
scraper:
enabled: true
sleep: 12 # Balanced sleep interval
batchSize: 500 # Default batch size
monitor:
enabled: true # Enable for address tracking
sleep: 12
batchSize: 100
api:
enabled: true
port: 8080
ipfs:
enabled: true # Enable for collaboration
port: 8083
Expected Performance: 15-25 blocks/sec indexing, moderate resource usage
High-Performance Usage (Production/Heavy Analysis)
services:
scraper:
enabled: true
sleep: 5 # Aggressive indexing
batchSize: 2000 # Large batches for efficiency
monitor:
enabled: true
sleep: 5 # Fast monitoring
batchSize: 500
api:
enabled: true
port: 8080
ipfs:
enabled: true
port: 8083
Expected Performance: 25-40 blocks/sec indexing, high resource usage
Batch Size Optimization Guidelines
Factors Affecting Optimal Batch Size
-
RPC Endpoint Performance:
- Fast/unlimited RPC: 1000-5000 blocks
- Standard RPC: 500-1000 blocks
- Rate-limited RPC: 50-200 blocks
-
Available Memory:
- 8GB+ RAM: 1000-2000 blocks
- 4-8GB RAM: 500-1000 blocks
- <4GB RAM: 100-500 blocks
-
Network Latency:
- Local RPC node: 2000-5000 blocks
- Same region: 1000-2000 blocks
- Remote/high latency: 100-500 blocks
Batch Size Tuning Process
- Start with defaults (500 blocks)
- Monitor performance metrics
- Adjust based on bottlenecks:
- If RPC timeouts: decrease batch size
- If memory issues: decrease batch size
- If CPU idle time: increase batch size
- If slow overall progress: increase batch size
Sleep Interval Recommendations
Based on System Resources
High-End Systems (8+ cores, 16GB+ RAM):
- Scraper: 5-10 seconds
- Monitor: 5-10 seconds
Mid-Range Systems (4-8 cores, 8-16GB RAM):
- Scraper: 10-15 seconds
- Monitor: 10-15 seconds
Resource-Constrained Systems (<4 cores, <8GB RAM):
- Scraper: 20-30 seconds
- Monitor: 30-60 seconds
Based on RPC Provider
Unlimited/Premium RPC:
- Scraper: 5-10 seconds
- Monitor: 5-10 seconds
Standard RPC with rate limits:
- Scraper: 15-30 seconds
- Monitor: 30-60 seconds
Free/heavily limited RPC:
- Scraper: 60-120 seconds
- Monitor: 120-300 seconds
RPC Endpoint Optimization
Choosing RPC Providers
Recommended for High Performance:
- Local RPC node (best performance)
- Premium providers (Alchemy, Infura Pro)
- Archive nodes with trace support
Configuration for Multiple RPC Endpoints:
chains:
mainnet:
rpcs:
- "https://eth-mainnet.alchemyapi.io/v2/YOUR_KEY"
- "https://mainnet.infura.io/v3/YOUR_KEY"
- "https://rpc.ankr.com/eth"
enabled: true
RPC Performance Testing
# Test RPC response time
time curl -X POST -H "Content-Type: application/json" \
--data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
https://your-rpc-endpoint
# Test batch request performance
time curl -X POST -H "Content-Type: application/json" \
--data '[{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x1000000",false],"id":1}]' \
https://your-rpc-endpoint
System Resource Monitoring
Key Metrics to Monitor
-
CPU Usage:
top -p $(pgrep khedra)
-
Memory Usage:
ps -o pid,vsz,rss,comm -p $(pgrep khedra)
-
Disk I/O:
iotop -p $(pgrep khedra)
-
Network Usage:
nethogs -p $(pgrep khedra)
Performance Thresholds
CPU Usage:
- <50%: Can increase batch size or decrease sleep
- 50-80%: Optimal range
-
80%: Decrease batch size or increase sleep
Memory Usage:
- <2GB: Can increase batch size
- 2-4GB: Monitor for memory leaks
-
4GB: Decrease batch size
Disk I/O:
- High read: Index queries are efficient
- High write: Indexing in progress (normal)
- Very high write: May need to reduce batch size
Scaling Considerations
Horizontal Scaling Strategies
-
Chain Separation:
- Run separate Khedra instances per blockchain
- Distribute chains across multiple servers
- Use load balancer for API access
-
Service Separation:
- Run API service on separate instances
- Dedicated IPFS nodes for data sharing
- Centralized monitoring service
-
Geographic Distribution:
- Deploy close to RPC providers
- Regional API instances for lower latency
- IPFS network for global data sharing
Vertical Scaling Guidelines
Memory Scaling:
- 8GB: Single chain, moderate usage
- 16GB: Multiple chains or heavy usage
- 32GB+: High-performance production usage
CPU Scaling:
- 4 cores: Basic usage
- 8 cores: Standard production
- 16+ cores: High-performance or multiple chains
Storage Scaling:
- SSD required for optimal performance
- 100GB per chain per year (estimate)
- Consider compression and archival strategies
Service Metrics and Monitoring
Available Performance Metrics
Each Khedra service exposes performance metrics that can be accessed through the Control Service API. These metrics provide insight into service health, performance, and resource utilization.
Control Service Metrics
Service Status Metrics:
uptime
: Service runtime duration since last startstate
: Current service state (running, paused, stopped, etc.)last_started
: Timestamp of last service startrestart_count
: Number of times service has been restartedhealth_score
: Overall service health indicator (0-100)
System Resource Metrics:
memory_usage_bytes
: Current memory consumptioncpu_usage_percent
: Current CPU utilizationgoroutines_count
: Number of active goroutinesgc_cycles
: Garbage collection statistics
Scraper Service Metrics
Indexing Performance:
blocks_processed_total
: Total number of blocks indexedblocks_per_second
: Current indexing throughputbatch_size_current
: Current batch size settingbatch_processing_time_ms
: Average time per batchindex_chunks_created
: Number of index chunks generatedappearances_extracted_total
: Total address appearances found
RPC Performance:
rpc_requests_total
: Total RPC requests maderpc_requests_failed
: Number of failed RPC requestsrpc_response_time_ms
: Average RPC response timerpc_rate_limit_hits
: Number of rate limit encountersrpc_endpoint_health
: Status of each configured RPC endpoint
Processing State:
current_block_number
: Latest block being processedtarget_block_number
: Target block (chain tip)blocks_behind
: Number of blocks behind chain tipindexing_progress_percent
: Overall indexing completion percentage
Monitor Service Metrics
Monitoring Performance:
addresses_monitored
: Number of addresses being trackedmonitoring_checks_total
: Total monitoring checks performedactivity_detected_total
: Number of activities detectednotifications_sent_total
: Number of notifications dispatchedfalse_positives
: Number of false positive detections
Detection Metrics:
detection_latency_ms
: Time from block to activity detectionmonitoring_batch_size
: Current batch size for monitoringmonitoring_frequency_seconds
: Current monitoring interval
API Service Metrics
Request Performance:
api_requests_total
: Total API requests servedapi_requests_per_second
: Current request throughputapi_response_time_ms
: Average response timeapi_errors_total
: Number of API errorsapi_cache_hits
: Number of cache hitsapi_cache_misses
: Number of cache misses
Endpoint Metrics:
status_endpoint_calls
: Calls to status endpointsindex_endpoint_calls
: Calls to index query endpointsmonitor_endpoint_calls
: Calls to monitor endpointsadmin_endpoint_calls
: Calls to admin endpoints
IPFS Service Metrics
Network Performance:
ipfs_peers_connected
: Number of connected IPFS peersipfs_data_uploaded_bytes
: Total data uploaded to IPFSipfs_data_downloaded_bytes
: Total data downloaded from IPFSipfs_pin_operations
: Number of pin operations performedipfs_chunks_shared
: Number of index chunks shared
Synchronization Metrics:
ipfs_sync_operations
: Number of sync operationsipfs_sync_latency_ms
: Average sync operation timeipfs_failed_retrievals
: Number of failed chunk retrievals
Accessing Service Metrics
REST API Access
Metrics are available through the Control Service API:
# Get metrics for all services
curl http://localhost:8080/api/v1/metrics
# Get metrics for specific service
curl http://localhost:8080/api/v1/services/scraper/metrics
# Get detailed metrics with verbose output
curl http://localhost:8080/api/v1/services/scraper?verbose=true&include=metrics
# Get metrics in different formats
curl http://localhost:8080/api/v1/metrics?format=json
curl http://localhost:8080/api/v1/metrics?format=prometheus
CLI Access
# Show basic service status with key metrics
khedra status --metrics
# Show detailed metrics for all services
khedra metrics
# Show metrics for specific service
khedra metrics --service=scraper
# Export metrics to file
khedra metrics --output=/path/to/metrics.json
# Watch metrics in real-time
khedra metrics --watch --interval=5s
Programmatic Access
// Example: Getting service metrics programmatically
import "github.com/TrueBlocks/trueblocks-khedra/v5/pkg/client"
client := client.NewKhedraClient("http://localhost:8080")
metrics, err := client.GetServiceMetrics("scraper")
if err != nil {
log.Fatal(err)
}
fmt.Printf("Blocks per second: %f\n", metrics["blocks_per_second"])
fmt.Printf("Memory usage: %d bytes\n", metrics["memory_usage_bytes"])
Interpreting Metrics
Performance Health Indicators
Scraper Service Health:
- Healthy:
blocks_per_second > 10
,rpc_response_time_ms < 500
,memory_usage_bytes
stable - Warning:
blocks_per_second < 5
,rpc_response_time_ms > 1000
,blocks_behind > 1000
- Critical:
blocks_per_second < 1
,rpc_requests_failed > 10%
,memory_usage_bytes
increasing rapidly
Monitor Service Health:
- Healthy:
detection_latency_ms < 30000
,false_positives < 5%
, all monitored addresses active - Warning:
detection_latency_ms > 60000
,false_positives > 10%
, some addresses not responding - Critical:
detection_latency_ms > 300000
,false_positives > 25%
, monitoring completely behind
API Service Health:
- Healthy:
api_response_time_ms < 100
,api_errors_total < 1%
,api_cache_hits > 80%
- Warning:
api_response_time_ms > 500
,api_errors_total > 5%
,api_cache_hits < 60%
- Critical:
api_response_time_ms > 2000
,api_errors_total > 15%
, service unresponsive
Resource Utilization Thresholds
Memory Usage:
- Normal: < 2GB per service
- High: 2-4GB per service (monitor for leaks)
- Critical: > 4GB per service (immediate attention required)
CPU Usage:
- Normal: < 50% average
- High: 50-80% average (acceptable under load)
- Critical: > 80% sustained (performance degradation likely)
Metrics-Based Troubleshooting
High Resource Usage
High Memory Usage:
# Check memory metrics
curl http://localhost:8080/api/v1/services/scraper/metrics | jq '.memory_usage_bytes'
# If memory usage is high:
# 1. Reduce batch size
# 2. Increase sleep interval
# 3. Check for memory leaks in logs
# 4. Restart service if memory continues growing
High CPU Usage:
# Check CPU metrics and goroutine count
curl http://localhost:8080/api/v1/metrics | jq '.cpu_usage_percent, .goroutines_count'
# If CPU usage is high:
# 1. Reduce batch size
# 2. Increase sleep interval
# 3. Check for infinite loops in logs
# 4. Verify RPC endpoint performance
Performance Degradation
Slow Indexing:
# Check indexing performance
curl http://localhost:8080/api/v1/services/scraper/metrics | jq '.blocks_per_second, .rpc_response_time_ms'
# Troubleshooting steps:
# 1. Check RPC response times
# 2. Verify network connectivity
# 3. Adjust batch size based on performance
# 4. Check for rate limiting
API Response Delays:
# Check API performance
curl http://localhost:8080/api/v1/services/api/metrics | jq '.api_response_time_ms, .api_cache_hits'
# Troubleshooting steps:
# 1. Check cache hit ratio
# 2. Verify index integrity
# 3. Monitor concurrent request load
# 4. Check for slow database queries
Service Failures
RPC Connection Issues:
# Check RPC health metrics
curl http://localhost:8080/api/v1/services/scraper/metrics | jq '.rpc_requests_failed, .rpc_rate_limit_hits'
# Troubleshooting steps:
# 1. Test RPC endpoints directly
# 2. Increase sleep intervals if rate limited
# 3. Switch to backup RPC endpoints
# 4. Check network connectivity
Alerting and Monitoring Setup
Prometheus Integration
Khedra can export metrics in Prometheus format for integration with monitoring systems:
# prometheus.yml configuration
scrape_configs:
- job_name: 'khedra'
static_configs:
- targets: ['localhost:8080']
metrics_path: '/api/v1/metrics'
params:
format: ['prometheus']
scrape_interval: 30s
Grafana Dashboard
Key metrics to monitor in Grafana:
Performance Dashboard:
- Blocks per second (Scraper)
- API response times
- Memory and CPU usage
- RPC response times
Health Dashboard:
- Service uptime
- Error rates
- Detection latency
- System resource utilization
Alerting Rules
Example alerting rules for common issues:
# Slow indexing alert
- alert: SlowIndexing
expr: khedra_blocks_per_second < 5
for: 5m
labels:
severity: warning
annotations:
summary: "Khedra indexing is slow"
description: "Indexing rate is {{ $value }} blocks/sec, below threshold"
# High memory usage alert
- alert: HighMemoryUsage
expr: khedra_memory_usage_bytes > 4000000000
for: 2m
labels:
severity: critical
annotations:
summary: "High memory usage detected"
description: "Memory usage is {{ $value }} bytes, above 4GB threshold"
# API response time alert
- alert: SlowAPI
expr: khedra_api_response_time_ms > 1000
for: 3m
labels:
severity: warning
annotations:
summary: "API responses are slow"
description: "Average response time is {{ $value }}ms"
Custom Monitoring Scripts
#!/bin/bash
# Simple monitoring script
METRICS_URL="http://localhost:8080/api/v1/metrics"
# Check blocks per second
BPS=$(curl -s $METRICS_URL | jq -r '.scraper.blocks_per_second // 0')
if (( $(echo "$BPS < 5" | bc -l) )); then
echo "WARNING: Slow indexing detected: $BPS blocks/sec"
fi
# Check memory usage
MEMORY=$(curl -s $METRICS_URL | jq -r '.scraper.memory_usage_bytes // 0')
if (( MEMORY > 4000000000 )); then
echo "CRITICAL: High memory usage: $((MEMORY/1024/1024))MB"
fi
# Check API health
API_TIME=$(curl -s $METRICS_URL | jq -r '.api.api_response_time_ms // 0')
if (( $(echo "$API_TIME > 1000" | bc -l) )); then
echo "WARNING: Slow API responses: ${API_TIME}ms"
fi
Best Practices for Metrics Monitoring
Regular Monitoring
- Establish Baselines: Monitor metrics during normal operation to establish performance baselines
- Set Appropriate Thresholds: Configure alerts based on your specific environment and requirements
- Monitor Trends: Look for gradual degradation over time, not just immediate issues
- Correlate Metrics: Use multiple metrics together to diagnose issues accurately
Performance Optimization
- Use Metrics for Tuning: Adjust batch sizes and sleep intervals based on actual performance metrics
- Monitor Resource Efficiency: Track resource usage to optimize system utilization
- Identify Bottlenecks: Use metrics to identify which component is limiting performance
- Validate Changes: Use metrics to verify that configuration changes improve performance
Operational Excellence
- Automate Monitoring: Set up automated alerts for critical metrics
- Create Dashboards: Visualize key metrics for easier monitoring
- Document Thresholds: Maintain documentation of what constitutes healthy vs. problematic metrics
- Regular Reviews: Periodically review and adjust monitoring thresholds based on operational experience