Performance and Scalability

This section details Khedra's performance characteristics, optimization strategies, and scalability considerations.

Performance Tuning Guide

Optimal Configuration for Different Use Cases

Light Usage (Personal Development/Testing)

services:
  scraper:
    enabled: true
    sleep: 30        # Longer sleep for less aggressive indexing
    batchSize: 100   # Smaller batches to reduce memory usage
  monitor:
    enabled: false   # Disable if not needed
    sleep: 60
    batchSize: 50
  api:
    enabled: true
    port: 8080       # Standard configuration
  ipfs:
    enabled: false   # Disable to reduce resource usage

Expected Performance: 5-10 blocks/sec indexing, minimal resource usage

Standard Usage (Regular Development/Analysis)

services:
  scraper:
    enabled: true
    sleep: 12        # Balanced sleep interval
    batchSize: 500   # Default batch size
  monitor:
    enabled: true    # Enable for address tracking
    sleep: 12
    batchSize: 100
  api:
    enabled: true
    port: 8080
  ipfs:
    enabled: true    # Enable for collaboration
    port: 8083

Expected Performance: 15-25 blocks/sec indexing, moderate resource usage

High-Performance Usage (Production/Heavy Analysis)

services:
  scraper:
    enabled: true
    sleep: 5         # Aggressive indexing
    batchSize: 2000  # Large batches for efficiency
  monitor:
    enabled: true
    sleep: 5         # Fast monitoring
    batchSize: 500
  api:
    enabled: true
    port: 8080
  ipfs:
    enabled: true
    port: 8083

Expected Performance: 25-40 blocks/sec indexing, high resource usage

Batch Size Optimization Guidelines

Factors Affecting Optimal Batch Size

RPC Endpoint Performance:
- Fast/unlimited RPC: 1000-5000 blocks
- Standard RPC: 500-1000 blocks
- Rate-limited RPC: 50-200 blocks
Available Memory:
- 8GB+ RAM: 1000-2000 blocks
- 4-8GB RAM: 500-1000 blocks
- <4GB RAM: 100-500 blocks
Network Latency:
- Local RPC node: 2000-5000 blocks
- Same region: 1000-2000 blocks
- Remote/high latency: 100-500 blocks

Batch Size Tuning Process

Start with defaults (500 blocks)
Monitor performance metrics
Adjust based on bottlenecks:
- If RPC timeouts: decrease batch size
- If memory issues: decrease batch size
- If CPU idle time: increase batch size
- If slow overall progress: increase batch size

Sleep Interval Recommendations

Based on System Resources

High-End Systems (8+ cores, 16GB+ RAM):

Scraper: 5-10 seconds
Monitor: 5-10 seconds

Mid-Range Systems (4-8 cores, 8-16GB RAM):

Scraper: 10-15 seconds
Monitor: 10-15 seconds

Resource-Constrained Systems (<4 cores, <8GB RAM):

Scraper: 20-30 seconds
Monitor: 30-60 seconds

Based on RPC Provider

Unlimited/Premium RPC:

Scraper: 5-10 seconds
Monitor: 5-10 seconds

Standard RPC with rate limits:

Scraper: 15-30 seconds
Monitor: 30-60 seconds

Free/heavily limited RPC:

Scraper: 60-120 seconds
Monitor: 120-300 seconds

RPC Endpoint Optimization

Choosing RPC Providers

Recommended for High Performance:

Local RPC node (best performance)
Premium providers (Alchemy, Infura Pro)
Archive nodes with trace support

Configuration for Multiple RPC Endpoints:

chains:
  mainnet:
    rpcs:
      - "https://eth-mainnet.alchemyapi.io/v2/YOUR_KEY"
      - "https://mainnet.infura.io/v3/YOUR_KEY"  
      - "https://rpc.ankr.com/eth"
    enabled: true

RPC Performance Testing

# Test RPC response time
time curl -X POST -H "Content-Type: application/json" \
  --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":1}' \
  https://your-rpc-endpoint

# Test batch request performance  
time curl -X POST -H "Content-Type: application/json" \
  --data '[{"jsonrpc":"2.0","method":"eth_getBlockByNumber","params":["0x1000000",false],"id":1}]' \
  https://your-rpc-endpoint

System Resource Monitoring

Key Metrics to Monitor

CPU Usage:
```
top -p $(pgrep khedra)
```

Memory Usage:

ps -o pid,vsz,rss,comm -p $(pgrep khedra)

Disk I/O:
```
iotop -p $(pgrep khedra)
```
Network Usage:
```
nethogs -p $(pgrep khedra)
```

Performance Thresholds

CPU Usage:

<50%: Can increase batch size or decrease sleep
50-80%: Optimal range
80%: Decrease batch size or increase sleep

Memory Usage:

<2GB: Can increase batch size
2-4GB: Monitor for memory leaks
4GB: Decrease batch size

Disk I/O:

High read: Index queries are efficient
High write: Indexing in progress (normal)
Very high write: May need to reduce batch size

Scaling Considerations

Horizontal Scaling Strategies

Chain Separation:
- Run separate Khedra instances per blockchain
- Distribute chains across multiple servers
- Use load balancer for API access
Service Separation:
- Run API service on separate instances
- Dedicated IPFS nodes for data sharing
- Centralized monitoring service
Geographic Distribution:
- Deploy close to RPC providers
- Regional API instances for lower latency
- IPFS network for global data sharing

Vertical Scaling Guidelines

Memory Scaling:

8GB: Single chain, moderate usage
16GB: Multiple chains or heavy usage
32GB+: High-performance production usage

CPU Scaling:

4 cores: Basic usage
8 cores: Standard production
16+ cores: High-performance or multiple chains

Storage Scaling:

SSD required for optimal performance
100GB per chain per year (estimate)
Consider compression and archival strategies

Service Metrics and Monitoring

Available Performance Metrics

Each Khedra service exposes performance metrics that can be accessed through the Control Service API. These metrics provide insight into service health, performance, and resource utilization.

Control Service Metrics

Service Status Metrics:

uptime: Service runtime duration since last start
state: Current service state (running, paused, stopped, etc.)
last_started: Timestamp of last service start
restart_count: Number of times service has been restarted
health_score: Overall service health indicator (0-100)

System Resource Metrics:

memory_usage_bytes: Current memory consumption
cpu_usage_percent: Current CPU utilization
goroutines_count: Number of active goroutines
gc_cycles: Garbage collection statistics

Scraper Service Metrics

Indexing Performance:

blocks_processed_total: Total number of blocks indexed
blocks_per_second: Current indexing throughput
batch_size_current: Current batch size setting
batch_processing_time_ms: Average time per batch
index_chunks_created: Number of index chunks generated
appearances_extracted_total: Total address appearances found

RPC Performance:

rpc_requests_total: Total RPC requests made
rpc_requests_failed: Number of failed RPC requests
rpc_response_time_ms: Average RPC response time
rpc_rate_limit_hits: Number of rate limit encounters
rpc_endpoint_health: Status of each configured RPC endpoint

Processing State:

current_block_number: Latest block being processed
target_block_number: Target block (chain tip)
blocks_behind: Number of blocks behind chain tip
indexing_progress_percent: Overall indexing completion percentage

Monitor Service Metrics

Monitoring Performance:

addresses_monitored: Number of addresses being tracked
monitoring_checks_total: Total monitoring checks performed
activity_detected_total: Number of activities detected
notifications_sent_total: Number of notifications dispatched
false_positives: Number of false positive detections

Detection Metrics:

detection_latency_ms: Time from block to activity detection
monitoring_batch_size: Current batch size for monitoring
monitoring_frequency_seconds: Current monitoring interval

API Service Metrics

Request Performance:

api_requests_total: Total API requests served
api_requests_per_second: Current request throughput
api_response_time_ms: Average response time
api_errors_total: Number of API errors
api_cache_hits: Number of cache hits
api_cache_misses: Number of cache misses

Endpoint Metrics:

status_endpoint_calls: Calls to status endpoints
index_endpoint_calls: Calls to index query endpoints
monitor_endpoint_calls: Calls to monitor endpoints
admin_endpoint_calls: Calls to admin endpoints

IPFS Service Metrics

Network Performance:

ipfs_peers_connected: Number of connected IPFS peers
ipfs_data_uploaded_bytes: Total data uploaded to IPFS
ipfs_data_downloaded_bytes: Total data downloaded from IPFS
ipfs_pin_operations: Number of pin operations performed
ipfs_chunks_shared: Number of index chunks shared

Synchronization Metrics:

ipfs_sync_operations: Number of sync operations
ipfs_sync_latency_ms: Average sync operation time
ipfs_failed_retrievals: Number of failed chunk retrievals

Accessing Service Metrics

REST API Access

Metrics are available through the Control Service API:

# Get metrics for all services
curl http://localhost:8080/api/v1/metrics

# Get metrics for specific service
curl http://localhost:8080/api/v1/services/scraper/metrics

# Get detailed metrics with verbose output
curl http://localhost:8080/api/v1/services/scraper?verbose=true&include=metrics

# Get metrics in different formats
curl http://localhost:8080/api/v1/metrics?format=json
curl http://localhost:8080/api/v1/metrics?format=prometheus

CLI Access

# Show basic service status with key metrics
khedra status --metrics

# Show detailed metrics for all services
khedra metrics

# Show metrics for specific service
khedra metrics --service=scraper

# Export metrics to file
khedra metrics --output=/path/to/metrics.json

# Watch metrics in real-time
khedra metrics --watch --interval=5s

Programmatic Access

// Example: Getting service metrics programmatically
import "github.com/TrueBlocks/trueblocks-khedra/v5/pkg/client"

client := client.NewKhedraClient("http://localhost:8080")
metrics, err := client.GetServiceMetrics("scraper")
if err != nil {
    log.Fatal(err)
}

fmt.Printf("Blocks per second: %f\n", metrics["blocks_per_second"])
fmt.Printf("Memory usage: %d bytes\n", metrics["memory_usage_bytes"])

Interpreting Metrics

Performance Health Indicators

Scraper Service Health:

Healthy: blocks_per_second > 10, rpc_response_time_ms < 500, memory_usage_bytes stable
Warning: blocks_per_second < 5, rpc_response_time_ms > 1000, blocks_behind > 1000
Critical: blocks_per_second < 1, rpc_requests_failed > 10%, memory_usage_bytes increasing rapidly

Monitor Service Health:

Healthy: detection_latency_ms < 30000, false_positives < 5%, all monitored addresses active
Warning: detection_latency_ms > 60000, false_positives > 10%, some addresses not responding
Critical: detection_latency_ms > 300000, false_positives > 25%, monitoring completely behind

API Service Health:

Healthy: api_response_time_ms < 100, api_errors_total < 1%, api_cache_hits > 80%
Warning: api_response_time_ms > 500, api_errors_total > 5%, api_cache_hits < 60%
Critical: api_response_time_ms > 2000, api_errors_total > 15%, service unresponsive

Resource Utilization Thresholds

Memory Usage:

Normal: < 2GB per service
High: 2-4GB per service (monitor for leaks)
Critical: > 4GB per service (immediate attention required)

CPU Usage:

Normal: < 50% average
High: 50-80% average (acceptable under load)
Critical: > 80% sustained (performance degradation likely)

Metrics-Based Troubleshooting

High Resource Usage

High Memory Usage:

# Check memory metrics
curl http://localhost:8080/api/v1/services/scraper/metrics | jq '.memory_usage_bytes'

# If memory usage is high:
# 1. Reduce batch size
# 2. Increase sleep interval
# 3. Check for memory leaks in logs
# 4. Restart service if memory continues growing

High CPU Usage:

# Check CPU metrics and goroutine count
curl http://localhost:8080/api/v1/metrics | jq '.cpu_usage_percent, .goroutines_count'

# If CPU usage is high:
# 1. Reduce batch size
# 2. Increase sleep interval
# 3. Check for infinite loops in logs
# 4. Verify RPC endpoint performance

Performance Degradation

Slow Indexing:

# Check indexing performance
curl http://localhost:8080/api/v1/services/scraper/metrics | jq '.blocks_per_second, .rpc_response_time_ms'

# Troubleshooting steps:
# 1. Check RPC response times
# 2. Verify network connectivity
# 3. Adjust batch size based on performance
# 4. Check for rate limiting

API Response Delays:

# Check API performance
curl http://localhost:8080/api/v1/services/api/metrics | jq '.api_response_time_ms, .api_cache_hits'

# Troubleshooting steps:
# 1. Check cache hit ratio
# 2. Verify index integrity
# 3. Monitor concurrent request load
# 4. Check for slow database queries

Service Failures

RPC Connection Issues:

# Check RPC health metrics
curl http://localhost:8080/api/v1/services/scraper/metrics | jq '.rpc_requests_failed, .rpc_rate_limit_hits'

# Troubleshooting steps:
# 1. Test RPC endpoints directly
# 2. Increase sleep intervals if rate limited
# 3. Switch to backup RPC endpoints
# 4. Check network connectivity

Alerting and Monitoring Setup

Prometheus Integration

Khedra can export metrics in Prometheus format for integration with monitoring systems:

# prometheus.yml configuration
scrape_configs:
  - job_name: 'khedra'
    static_configs:
      - targets: ['localhost:8080']
    metrics_path: '/api/v1/metrics'
    params:
      format: ['prometheus']
    scrape_interval: 30s

Grafana Dashboard

Key metrics to monitor in Grafana:

Performance Dashboard:

Blocks per second (Scraper)
API response times
Memory and CPU usage
RPC response times

Health Dashboard:

Service uptime
Error rates
Detection latency
System resource utilization

Alerting Rules

Example alerting rules for common issues:

# Slow indexing alert
- alert: SlowIndexing
  expr: khedra_blocks_per_second < 5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Khedra indexing is slow"
    description: "Indexing rate is {{ $value }} blocks/sec, below threshold"

# High memory usage alert
- alert: HighMemoryUsage
  expr: khedra_memory_usage_bytes > 4000000000
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "High memory usage detected"
    description: "Memory usage is {{ $value }} bytes, above 4GB threshold"

# API response time alert
- alert: SlowAPI
  expr: khedra_api_response_time_ms > 1000
  for: 3m
  labels:
    severity: warning
  annotations:
    summary: "API responses are slow"
    description: "Average response time is {{ $value }}ms"

Custom Monitoring Scripts

#!/bin/bash
# Simple monitoring script
METRICS_URL="http://localhost:8080/api/v1/metrics"

# Check blocks per second
BPS=$(curl -s $METRICS_URL | jq -r '.scraper.blocks_per_second // 0')
if (( $(echo "$BPS < 5" | bc -l) )); then
    echo "WARNING: Slow indexing detected: $BPS blocks/sec"
fi

# Check memory usage
MEMORY=$(curl -s $METRICS_URL | jq -r '.scraper.memory_usage_bytes // 0')
if (( MEMORY > 4000000000 )); then
    echo "CRITICAL: High memory usage: $((MEMORY/1024/1024))MB"
fi

# Check API health
API_TIME=$(curl -s $METRICS_URL | jq -r '.api.api_response_time_ms // 0')
if (( $(echo "$API_TIME > 1000" | bc -l) )); then
    echo "WARNING: Slow API responses: ${API_TIME}ms"
fi

Best Practices for Metrics Monitoring

Regular Monitoring

Establish Baselines: Monitor metrics during normal operation to establish performance baselines
Set Appropriate Thresholds: Configure alerts based on your specific environment and requirements
Monitor Trends: Look for gradual degradation over time, not just immediate issues
Correlate Metrics: Use multiple metrics together to diagnose issues accurately

Performance Optimization

Use Metrics for Tuning: Adjust batch sizes and sleep intervals based on actual performance metrics
Monitor Resource Efficiency: Track resource usage to optimize system utilization
Identify Bottlenecks: Use metrics to identify which component is limiting performance
Validate Changes: Use metrics to verify that configuration changes improve performance

Operational Excellence

Automate Monitoring: Set up automated alerts for critical metrics
Create Dashboards: Visualize key metrics for easier monitoring
Document Thresholds: Maintain documentation of what constitutes healthy vs. problematic metrics
Regular Reviews: Periodically review and adjust monitoring thresholds based on operational experience

The Khedra Book