Monitoring & Observability

Comprehensive monitoring is essential for validator operations. This guide covers metrics collection, alerting, dashboards, and troubleshooting.

Goal: Detect and resolve issues before they impact validator performance or rewards.

Monitoring Stack Overview

Architecture

Pilier Node
├─ Prometheus metrics endpoint (:9615)
│  └─ Exposes: Block height, peers, finality, CPU, memory
│
Prometheus Server
├─ Scrapes metrics every 15 seconds
├─ Stores time-series data (15 days retention)
└─ Evaluates alert rules
│
Alertmanager
├─ Receives alerts from Prometheus
├─ Routes to: Email, Telegram, PagerDuty
└─ Deduplicates and groups alerts
│
Grafana
├─ Visualizes metrics (dashboards)
├─ Queries Prometheus
└─ User interface for operators

Components:

Component	Purpose	Installation	Cost
Prometheus	Metrics storage & alerting	Self-hosted or cloud	Free (self-hosted)
Grafana	Dashboards & visualization	Self-hosted or cloud	Free (self-hosted)
Node Exporter	System metrics (CPU, disk, etc.)	Self-hosted	Free
Alertmanager	Alert routing & notification	Self-hosted	Free
Telegram Bot (optional)	Alert notifications	Cloud (Telegram API)	Free

Quick Start (5 Minutes)

For validators who want basic monitoring NOW:

Step 1: Enable Prometheus in Node

# Edit systemd service
sudo nano /etc/systemd/system/pilier.service

# Add these flags to ExecStart:
--prometheus-port 9615 \
--prometheus-external

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart pilier

Step 2: Verify Metrics Endpoint

# Check if metrics accessible
curl http://localhost:9615/metrics

# Should see output like:
# substrate_block_height{status="best"} 12345
# substrate_finality_grandpa_round 678
# substrate_peers_count 5
# ...hundreds of metrics...

If you see metrics → monitoring is enabled! ✅

Step 3: Use Public Monitoring (Temporary)

While you set up your own stack:

Pilier Telemetry:
└─ https://telemetry.pilier.net
   └─ Shows: Your validator name, block height, peers
   └─ Limitations: Only basic metrics, no alerts

Prometheus Cloud (free tier):
└─ Grafana Cloud: https://grafana.com/products/cloud/
   └─ Sign up → Add Prometheus data source → Point to your node
   └─ Limitations: 10k metrics/month free (sufficient for 1 validator)

Step 4: Set Up Alerts (via Grafana Cloud)

Grafana Cloud → Alerting → New Alert Rule
Condition: substrate_block_height (best) not increasing for 5 minutes
Notification: Email / Telegram
Save alert rule
Test by stopping node: sudo systemctl stop pilier
Should receive alert within 5 minutes

Done! You have basic monitoring. Continue reading for production setup.

Production Setup (Self-Hosted)

System Requirements

Monitoring server (separate from validator):

Recommended specs:
├─ CPU: 2 cores
├─ RAM: 4 GB
├─ Disk: 50 GB SSD (for 15-day metrics retention)
├─ Network: Same datacenter as validator (low latency)
└─ OS: Ubuntu 22.04 or Debian 12

Why separate server:
├─ Validator failure doesn't take down monitoring
├─ Monitoring overhead doesn't affect validator performance
└─ Can monitor multiple validators from one monitoring server

For single validator (cost-conscious):

Option: Co-locate on validator server
├─ Add 2 GB RAM (total 18 GB)
├─ Add 20 GB disk (total 270 GB)
└─ Minor CPU overhead (~5%)

Tradeoff: Less reliable (monitoring fails if validator fails)

Install Prometheus

Step 1: Create prometheus user

sudo useradd --no-create-home --shell /bin/false prometheus

Step 2: Download Prometheus

# Check latest version: https://github.com/prometheus/prometheus/releases
PROM_VERSION="2.48.0"

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz

# Verify checksum (from GitHub release page)
sha256sum prometheus-${PROM_VERSION}.linux-amd64.tar.gz

# Extract
tar xvf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROM_VERSION}.linux-amd64/

Step 3: Install binaries

# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/

# Set ownership
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

# Copy console templates
sudo mkdir -p /etc/prometheus
sudo cp -r consoles/ console_libraries/ /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus

Step 4: Create data directory

sudo mkdir -p /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Step 5: Configure Prometheus

sudo nano /etc/prometheus/prometheus.yml

Content:

global:
  scrape_interval: 15s # Scrape targets every 15 seconds
  evaluation_interval: 15s # Evaluate rules every 15 seconds
  external_labels:
    monitor: "pilier-validator"

# Alerting configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - localhost:9093 # Alertmanager (we'll install later)

# Alert rules
rule_files:
  - "/etc/prometheus/alert.rules.yml"

# Scrape targets
scrape_configs:
  # Pilier validator node
  - job_name: "pilier-node"
    static_configs:
      - targets:
          - "localhost:9615" # Change to validator IP if separate server
        labels:
          instance: "validator-lyon-01"
          role: "validator"

  # System metrics (via node_exporter)
  - job_name: "node-exporter"
    static_configs:
      - targets:
          - "localhost:9100"
        labels:
          instance: "validator-lyon-01"

  # Prometheus itself (self-monitoring)
  - job_name: "prometheus"
    static_configs:
      - targets:
          - "localhost:9090"

Step 6: Create systemd service

sudo nano /etc/systemd/system/prometheus.service

Content:

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/var/lib/prometheus/ \
  --storage.tsdb.retention.time=15d \
  --web.console.templates=/etc/prometheus/consoles \
  --web.console.libraries=/etc/prometheus/console_libraries \
  --web.listen-address=0.0.0.0:9090

SyslogIdentifier=prometheus
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Step 7: Start Prometheus

# Set permissions
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

# Start service
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

# Check status
sudo systemctl status prometheus

# Check logs
sudo journalctl -u prometheus -f

# Verify web UI (from monitoring server)
curl http://localhost:9090
# Should see: HTML page (Prometheus web UI)

Access web UI: http://YOUR_MONITORING_SERVER_IP:9090

Install Node Exporter

Node Exporter provides system metrics (CPU, memory, disk, network).

Step 1: Download

NODE_EXPORTER_VERSION="1.7.0"

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Verify checksum
sha256sum node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/

Step 2: Install

sudo cp node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Step 3: Create systemd service

sudo nano /etc/systemd/system/node-exporter.service

Content:

[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/node_exporter \
  --web.listen-address=0.0.0.0:9100

SyslogIdentifier=node_exporter
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Step 4: Start Node Exporter

sudo systemctl daemon-reload
sudo systemctl enable node-exporter
sudo systemctl start node-exporter

# Verify
curl http://localhost:9100/metrics | head -20
# Should see: node_cpu_seconds_total, node_memory_MemAvailable_bytes, etc.

Install Alertmanager

Alertmanager handles alert notifications (email, Telegram, PagerDuty).

Step 1: Download

ALERTMANAGER_VERSION="0.26.0"

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Verify checksum
sha256sum alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Extract
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
cd alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/

Step 2: Install

sudo cp alertmanager amtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/amtool

sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown prometheus:prometheus /etc/alertmanager /var/lib/alertmanager

Step 3: Configure Alertmanager

sudo nano /etc/alertmanager/alertmanager.yml

Content (Email notifications):

global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.gmail.com:587"
  smtp_from: "your-email@gmail.com"
  smtp_auth_username: "your-email@gmail.com"
  smtp_auth_password: "your-app-password" # Gmail: Use App Password, not regular password
  smtp_require_tls: true

route:
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 12h
  receiver: "email"

receivers:
  - name: "email"
    email_configs:
      - to: "your-email@gmail.com"
        headers:
          Subject: "🚨 Pilier Validator Alert: {{ .GroupLabels.alertname }}"

inhibit_rules:
  - source_match:
      severity: "critical"
    target_match:
      severity: "warning"
    equal: ["alertname", "instance"]

Step 4: Create systemd service

sudo nano /etc/systemd/system/alertmanager.service

Content:

[Unit]
Description=Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/alertmanager \
  --config.file=/etc/alertmanager/alertmanager.yml \
  --storage.path=/var/lib/alertmanager/ \
  --web.listen-address=0.0.0.0:9093

SyslogIdentifier=alertmanager
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Step 5: Start Alertmanager

sudo chown prometheus:prometheus /etc/alertmanager/alertmanager.yml

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

# Verify
curl http://localhost:9093
# Should see: Alertmanager web UI (HTML)

Access web UI: http://YOUR_MONITORING_SERVER_IP:9093

Configure Alert Rules

Alert rules define when Prometheus should fire alerts.

sudo nano /etc/prometheus/alert.rules.yml

Content:

groups:
  - name: validator_alerts
    interval: 30s
    rules:
      # Node is down
      - alert: ValidatorDown
        expr: up{job="pilier-node"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Validator {{ $labels.instance }} is down"
          description: "Validator node has been down for more than 2 minutes."

      # Block height not increasing (node stuck)
      - alert: BlockHeightNotIncreasing
        expr: rate(substrate_block_height{status="best"}[5m]) == 0
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Block height not increasing on {{ $labels.instance }}"
          description: "Best block height has not increased in 3 minutes. Node may be stuck or not syncing."

      # Low peer count
      - alert: LowPeerCount
        expr: substrate_sub_libp2p_peers_count < 2
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low peer count on {{ $labels.instance }}"
          description: "Validator has fewer than 2 peers for more than 5 minutes. Current: {{ $value }}."

      # Finality lag (finalized block behind best block)
      - alert: FinalityLag
        expr: (substrate_block_height{status="best"} - substrate_block_height{status="finalized"}) > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Finality lag on {{ $labels.instance }}"
          description: "Finalized block is {{ $value }} blocks behind best block."

      # High CPU usage
      - alert: HighCPUUsage
        expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 90% for more than 10 minutes. Current: {{ $value | humanize }}%."

      # High memory usage
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% for more than 10 minutes. Current: {{ $value | humanize }}%."

      # Low disk space
      - alert: LowDiskSpace
        expr: (1 - (node_filesystem_avail_bytes{mountpoint="/var/lib/pilier"} / node_filesystem_size_bytes{mountpoint="/var/lib/pilier"})) * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Disk usage is above 80%. Current: {{ $value | humanize }}%."

      # Node not validating (for active validators)
      - alert: NotValidating
        expr: substrate_sub_libp2p_is_major_syncing == 0 AND substrate_block_height{status="best"} > 100 AND rate(substrate_proposer_block_constructed_count[10m]) == 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "Validator {{ $labels.instance }} not producing blocks"
          description: "Validator has not produced any blocks in the last 30 minutes. Check session keys."

Reload Prometheus to apply rules:

# Validate rules
promtool check rules /etc/prometheus/alert.rules.yml

# If valid, reload Prometheus
sudo systemctl reload prometheus

# Verify rules loaded (Prometheus web UI)
# Visit: http://YOUR_MONITORING_SERVER_IP:9090/rules

Install Grafana

Grafana provides visual dashboards for metrics.

Step 1: Install via APT

# Add Grafana GPG key
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Add repository
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

# Update and install
sudo apt update
sudo apt install grafana -y

Step 2: Start Grafana

sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# Check status
sudo systemctl status grafana-server

Access web UI: http://YOUR_MONITORING_SERVER_IP:3000

Default login:

Username: admin
Password: admin (will prompt to change on first login)

Step 3: Add Prometheus data source

Login to Grafana (http://YOUR_IP:3000)
Left sidebar → Configuration (⚙️) → Data Sources
Click "Add data source"
Select "Prometheus"
Settings:
   - Name: Prometheus
   - URL: http://localhost:9090
   - Access: Server (default)
Click "Save & Test"
Should see: "Data source is working" ✅

Step 4: Import Pilier dashboard

Create dashboard file:

# Save this dashboard JSON (create file on your computer)
nano pilier-validator-dashboard.json

Dashboard JSON (copy entire content):

{
  "dashboard": {
    "title": "Pilier Validator Dashboard",
    "tags": ["pilier", "validator"],
    "timezone": "browser",
    "panels": [
      {
        "id": 1,
        "title": "Best Block Height",
        "type": "graph",
        "targets": [
          {
            "expr": "substrate_block_height{status=\"best\"}",
            "legendFormat": "Best Block"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
      },
      {
        "id": 2,
        "title": "Finalized Block Height",
        "type": "graph",
        "targets": [
          {
            "expr": "substrate_block_height{status=\"finalized\"}",
            "legendFormat": "Finalized Block"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
      },
      {
        "id": 3,
        "title": "Peer Count",
        "type": "graph",
        "targets": [
          {
            "expr": "substrate_sub_libp2p_peers_count",
            "legendFormat": "Peers"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
      },
      {
        "id": 4,
        "title": "CPU Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
            "legendFormat": "CPU %"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
      },
      {
        "id": 5,
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
            "legendFormat": "Memory %"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
      },
      {
        "id": 6,
        "title": "Disk Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "(1 - (node_filesystem_avail_bytes{mountpoint=\"/var/lib/pilier\"} / node_filesystem_size_bytes{mountpoint=\"/var/lib/pilier\"})) * 100",
            "legendFormat": "Disk %"
          }
        ],
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
      }
    ],
    "refresh": "30s",
    "time": { "from": "now-6h", "to": "now" }
  }
}

Import dashboard:

Grafana → Left sidebar → Dashboards → Import
Click "Upload JSON file"
Select pilier-validator-dashboard.json
Select data source: Prometheus
Click "Import"
Dashboard loaded! ✅

Alternative: Use community dashboards:

https://grafana.com/grafana/dashboards/
Search: "Substrate" or "Polkadot"
Import via ID (e.g., 13840 - Substrate Node Exporter)

Key Metrics Reference

Blockchain Metrics

Metric	Description	Normal Range	Alert Threshold
`substrate_block_height{status="best"}`	Best (head) block	Increasing	Not increasing >3 min
`substrate_block_height{status="finalized"}`	Finalized block	Increasing	Lag >10 blocks
`substrate_sub_libp2p_peers_count`	Connected peers	3-50	<2 peers
`substrate_finality_grandpa_round`	GRANDPA round	Increasing	Not increasing >5 min
`substrate_proposer_block_constructed_count`	Blocks proposed	Varies (validator turn)	0 for >30 min (if active validator)
`substrate_sync_syncing`	Is syncing?	0 (false)	1 (true) for >10 min

System Metrics

Metric	Description	Normal Range	Alert Threshold
`node_cpu_seconds_total`	CPU time	N/A	>90% usage for >10 min
`node_memory_MemAvailable_bytes`	Available memory	>2 GB	<10% free
`node_filesystem_avail_bytes`	Available disk space	>50 GB	<20% free
`node_network_receive_bytes_total`	Network traffic (received)	Varies	N/A (for monitoring only)
`node_network_transmit_bytes_total`	Network traffic (sent)	Varies	N/A (for monitoring only)
`node_load1`	1-minute load average	<CPU cores	>2x CPU cores

Query Examples (Prometheus)

Block production rate (blocks per minute):

rate(substrate_block_height{status="best"}[1m]) * 60

Finality lag (blocks behind):

substrate_block_height{status="best"} - substrate_block_height{status="finalized"}

Memory usage (percentage):

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk I/O (MB/s):

rate(node_disk_read_bytes_total[5m]) / 1024 / 1024

Telegram Alerts (Optional)

For instant notifications on mobile.

Step 1: Create Telegram Bot

1. Open Telegram → Search for "@BotFather"
2. Send: /newbot
3. Follow prompts:
   - Bot name: "Pilier Validator Alerts"
   - Bot username: "pilier_validator_alerts_bot" (must end with _bot)
4. BotFather replies with:
   - Token: 123456789:ABCdefGHIjklMNOpqrsTUVwxyz (SAVE THIS!)

Step 2: Get Your Chat ID

Start conversation with your bot (click link from BotFather)
Send any message: "Hello"
Visit: https://api.telegram.org/bot{TOKEN}/getUpdates
   - Replace {TOKEN} with your bot token
Find "chat":{"id":123456789} in JSON response
Save chat ID: 123456789

Step 3: Configure Alertmanager for Telegram

Install telegram-bot-notifier (separate tool):

# Clone repository
cd /opt
sudo git clone https://github.com/inCaller/prometheus_bot.git
cd prometheus_bot

# Install dependencies
sudo apt install python3-pip -y
sudo pip3 install -r requirements.txt

# Configure
sudo nano config.yaml

config.yaml:

telegram_token: "123456789:ABCdefGHIjklMNOpqrsTUVwxyz"
template_path: "templates/default.tmpl"
time_zone: "UTC"
split_token: "|"
split_msg_byte: 4000
send_only: false

Create systemd service:

sudo nano /etc/systemd/system/prometheus-bot.service

Content:

[Unit]
Description=Prometheus Telegram Bot
After=network.target

[Service]
Type=simple
User=prometheus
WorkingDirectory=/opt/prometheus_bot
ExecStart=/usr/bin/python3 /opt/prometheus_bot/prometheus_bot.py
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Start service:

sudo systemctl daemon-reload
sudo systemctl enable prometheus-bot
sudo systemctl start prometheus-bot

Update Alertmanager config:

sudo nano /etc/alertmanager/alertmanager.yml

Add Telegram receiver:

receivers:
  - name: "telegram"
    webhook_configs:
      - url: "http://localhost:9087/alert/YOUR_CHAT_ID"
        send_resolved: true

route:
  receiver: "telegram" # Change default receiver to Telegram

Reload Alertmanager:

sudo systemctl reload alertmanager

Step 4: Test Telegram Alerts

# Stop validator node (trigger ValidatorDown alert)
sudo systemctl stop pilier

# Wait 2 minutes (alert threshold)
# You should receive Telegram message:
# "🚨 FIRING: ValidatorDown
#  Validator validator-lyon-01 is down
#  Validator node has been down for more than 2 minutes."

# Restart validator
sudo systemctl start pilier

# Wait ~1 minute, should receive:
# "✅ RESOLVED: ValidatorDown
#  Validator validator-lyon-01 is down"

Advanced Monitoring

Multi-Validator Monitoring

Monitor multiple validators from single Prometheus:

# /etc/prometheus/prometheus.yml

scrape_configs:
  - job_name: "validator-01"
    static_configs:
      - targets:
          - "51.210.XXX.1:9615" # Validator 1 IP
        labels:
          instance: "validator-paris-01"

  - job_name: "validator-02"
    static_configs:
      - targets:
          - "51.210.XXX.2:9615" # Validator 2 IP
        labels:
          instance: "validator-lyon-01"

  - job_name: "validator-03"
    static_configs:
      - targets:
          - "51.210.XXX.3:9615" # Validator 3 IP
        labels:
          instance: "validator-tbilisi-01"

Create multi-validator dashboard in Grafana:

# Panel: Block height (all validators)
substrate_block_height{status="best"}

# Panel: Peer count (all validators)
substrate_sub_libp2p_peers_count

# Use legend: {{ instance }} (shows validator name)

Log Aggregation

Centralize logs from all validators.

Option A: Loki + Grafana (Recommended)

# Install Loki (log aggregation)
wget https://github.com/grafana/loki/releases/download/v2.9.3/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo cp loki-linux-amd64 /usr/local/bin/loki

# Install Promtail (log shipper, runs on validator)
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo cp promtail-linux-amd64 /usr/local/bin/promtail

# Configure Promtail (on validator server)
sudo nano /etc/promtail/config.yml

Promtail config:

server:
  http_listen_port: 9080
  grpc_listen_port: 0

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://MONITORING_SERVER_IP:3100/loki/api/v1/push

scrape_configs:
  - job_name: pilier-node
    static_configs:
      - targets:
          - localhost
        labels:
          job: pilier-node
          __path__: /var/log/syslog # Or use journald

Then in Grafana:

Add Loki data source
Query logs: {job="pilier-node"} |= "error"

Option B: ELK Stack (Elasticsearch + Logstash + Kibana)

(More complex, not covered here - use Loki for simplicity)

Uptime Monitoring (External)

Monitor from OUTSIDE your infrastructure (detects network issues).

Option A: UptimeRobot (Free)

1. Sign up: https://uptimerobot.com
2. Add monitor:
  - Type: HTTP(s)
  - URL: https://YOUR_VALIDATOR_IP:9615/metrics (if publicly exposed)
  - OR: Use custom port check (TCP 30333)
3. Alert contacts: Email, Telegram, webhook
4. Check interval: 5 minutes (free tier)

Option B: Self-Hosted Blackbox Exporter

# Install blackbox_exporter (Prometheus ecosystem)
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz

# Configure to ping validator server
# Then scrape from Prometheus

Troubleshooting Monitoring

Prometheus Not Scraping Targets

Symptom: Targets show as "DOWN" in Prometheus UI (Status → Targets).

Check:

# 1. Is Pilier node metrics endpoint accessible?
curl http://VALIDATOR_IP:9615/metrics
# Should return metrics, not connection refused

# 2. Firewall blocking?
# On validator server:
sudo ufw allow from MONITORING_SERVER_IP to any port 9615

# 3. Is --prometheus-external flag set?
sudo journalctl -u pilier | grep prometheus
# Should see: "Listening on 0.0.0.0:9615" (not 127.0.0.1:9615)

# 4. Check Prometheus config
promtool check config /etc/prometheus/prometheus.yml

# 5. Check Prometheus logs
sudo journalctl -u prometheus -n 50

Grafana Showing "No Data"

Symptom: Dashboards display "No data" or empty graphs.

Check:

# 1. Is Prometheus data source working?
# Grafana → Configuration → Data Sources → Prometheus → "Test"
# Should see: "Data source is working"

# 2. Are metrics actually in Prometheus?
# Prometheus UI: http://MONITORING_IP:9090/graph
# Query: substrate_block_height{status="best"}
# Should see data

# 3. Check time range in Grafana dashboard
# Top-right: Time range selector
# Set to "Last 6 hours" or "Last 24 hours"

# 4. Check PromQL query in panel
# Edit panel → Query tab
# Verify metric name is correct (copy from Prometheus UI)

Alerts Not Firing

Symptom: Expected alert (e.g., ValidatorDown) not triggered.

Check:

# 1. Are alert rules loaded?
# Prometheus UI: http://MONITORING_IP:9090/rules
# Should see: alert.rules.yml rules listed

# 2. Is alert condition met?
# Prometheus UI → Alerts tab
# Should see alert in "Pending" or "Firing" state

# 3. Check Alertmanager receiving alerts
# Alertmanager UI: http://MONITORING_IP:9093/#/alerts
# Should see alerts listed

# 4. Check Alertmanager logs
sudo journalctl -u alertmanager -n 50

# 5. Test email configuration
# Alertmanager UI → "Silence" → Create test alert

High Prometheus Memory Usage

Symptom: Prometheus using >8 GB RAM (excessive for single validator).

Causes:

Too many metrics (high cardinality)
Long retention time (>30 days)
Too frequent scraping (interval <10s)

Fix:

# 1. Reduce retention time
# Edit /etc/systemd/system/prometheus.service:
--storage.tsdb.retention.time=15d  # Was 30d

# 2. Increase scrape interval
# Edit /etc/prometheus/prometheus.yml:
global:
  scrape_interval: 30s  # Was 15s

# 3. Restart Prometheus
sudo systemctl restart prometheus

# 4. Monitor memory usage
free -h

Best Practices

Do's ✅

✅ Monitor from separate server (monitoring survives validator failure)
✅ Set up alerts early (don't wait for incident)
✅ Test alerts regularly (monthly: stop node, verify alert received)
✅ Monitor multiple channels (metrics + logs + external uptime)
✅ Review dashboards weekly (spot trends before they become issues)
✅ Document alert runbooks (what to do when alert fires)
✅ Keep retention short (15 days sufficient, saves disk space)
✅ Use Grafana annotations (mark upgrades, incidents on graphs)

Don'ts ❌

❌ Don't ignore warnings (warnings become critical if unchecked)
❌ Don't set thresholds too tight (avoid alert fatigue)
❌ Don't monitor without acting (alerts must have response plan)
❌ Don't expose Grafana publicly (without authentication + SSL)
❌ Don't collect PII in logs (GDPR compliance)
❌ Don't run monitoring on validator (unless no budget for separate server)

Monitoring Checklist

Daily:

Check Grafana dashboard (5 minutes)
Verify block height increasing
Verify finality not lagging
Check for any alerts (Telegram, email)

Weekly:

Review alert history (any recurring issues?)
Check disk space trends (will it fill in next month?)
Verify peer count stable (not declining over time)
Test one alert (stop node, verify alert received)

Monthly:

Review all metrics (spot long-term trends)
Update Prometheus/Grafana (security patches)
Clean up old alerts (silence resolved issues)
Document any incidents (post-mortems)

Quarterly:

Audit alert rules (are thresholds still appropriate?)
Review retention settings (adjust if needed)
Test disaster recovery (restore from backup)
Train backup operator (someone else should be able to monitor)

Support

Monitoring setup help?

Prometheus/Grafana:

Prometheus docs: https://prometheus.io/docs/
Grafana docs: https://grafana.com/docs/
Community dashboards: https://grafana.com/grafana/dashboards/

Document version: 1.0

Last updated: 2026-01-12

Tested on: Ubuntu 22.04, Debian 12

Monitoring Stack Overview​

Architecture​

Quick Start (5 Minutes)​

Step 1: Enable Prometheus in Node​

Step 2: Verify Metrics Endpoint​

Step 3: Use Public Monitoring (Temporary)​

Step 4: Set Up Alerts (via Grafana Cloud)​

Production Setup (Self-Hosted)​

System Requirements​

Install Prometheus​

Install Node Exporter​

Install Alertmanager​

Configure Alert Rules​

Install Grafana​

Key Metrics Reference​

Blockchain Metrics​

System Metrics​

Query Examples (Prometheus)​

Telegram Alerts (Optional)​

Step 1: Create Telegram Bot​

Step 2: Get Your Chat ID​

Step 3: Configure Alertmanager for Telegram​

Step 4: Test Telegram Alerts​

Advanced Monitoring​

Multi-Validator Monitoring​

Log Aggregation​

Uptime Monitoring (External)​

Troubleshooting Monitoring​

Prometheus Not Scraping Targets​

Grafana Showing "No Data"​

Alerts Not Firing​

High Prometheus Memory Usage​

Best Practices​

Do's ✅​

Don'ts ❌​

Monitoring Checklist​

Support​

Monitoring Stack Overview

Architecture

Quick Start (5 Minutes)

Step 1: Enable Prometheus in Node

Step 2: Verify Metrics Endpoint

Step 3: Use Public Monitoring (Temporary)

Step 4: Set Up Alerts (via Grafana Cloud)

Production Setup (Self-Hosted)

System Requirements

Install Prometheus

Install Node Exporter

Install Alertmanager

Configure Alert Rules

Install Grafana

Key Metrics Reference

Blockchain Metrics

System Metrics

Query Examples (Prometheus)

Telegram Alerts (Optional)

Step 1: Create Telegram Bot

Step 2: Get Your Chat ID

Step 3: Configure Alertmanager for Telegram

Step 4: Test Telegram Alerts

Advanced Monitoring

Multi-Validator Monitoring

Log Aggregation

Uptime Monitoring (External)

Troubleshooting Monitoring

Prometheus Not Scraping Targets

Grafana Showing "No Data"

Alerts Not Firing

High Prometheus Memory Usage

Best Practices

Do's ✅

Don'ts ❌

Monitoring Checklist

Support