Monitoring & Observability
Comprehensive monitoring is essential for validator operations. This guide covers metrics collection, alerting, dashboards, and troubleshooting.
Goal: Detect and resolve issues before they impact validator performance or rewards.
Monitoring Stack Overview
Architecture
Pilier Node
├─ Prometheus metrics endpoint (:9615)
│ └─ Exposes: Block height, peers, finality, CPU, memory
│
Prometheus Server
├─ Scrapes metrics every 15 seconds
├─ Stores time-series data (15 days retention)
└─ Evaluates alert rules
│
Alertmanager
├─ Receives alerts from Prometheus
├─ Routes to: Email, Telegram, PagerDuty
└─ Deduplicates and groups alerts
│
Grafana
├─ Visualizes metrics (dashboards)
├─ Queries Prometheus
└─ User interface for operators
Components:
| Component | Purpose | Installation | Cost |
|---|---|---|---|
| Prometheus | Metrics storage & alerting | Self-hosted or cloud | Free (self-hosted) |
| Grafana | Dashboards & visualization | Self-hosted or cloud | Free (self-hosted) |
| Node Exporter | System metrics (CPU, disk, etc.) | Self-hosted | Free |
| Alertmanager | Alert routing & notification | Self-hosted | Free |
| Telegram Bot (optional) | Alert notifications | Cloud (Telegram API) | Free |
Quick Start (5 Minutes)
For validators who want basic monitoring NOW:
Step 1: Enable Prometheus in Node
# Edit systemd service
sudo nano /etc/systemd/system/pilier.service
# Add these flags to ExecStart:
--prometheus-port 9615 \
--prometheus-external
# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart pilier
Step 2: Verify Metrics Endpoint
# Check if metrics accessible
curl http://localhost:9615/metrics
# Should see output like:
# substrate_block_height{status="best"} 12345
# substrate_finality_grandpa_round 678
# substrate_peers_count 5
# ...hundreds of metrics...
If you see metrics → monitoring is enabled! ✅
Step 3: Use Public Monitoring (Temporary)
While you set up your own stack:
Pilier Telemetry:
└─ https://telemetry.pilier.net
└─ Shows: Your validator name, block height, peers
└─ Limitations: Only basic metrics, no alerts
Prometheus Cloud (free tier):
└─ Grafana Cloud: https://grafana.com/products/cloud/
└─ Sign up → Add Prometheus data source → Point to your node
└─ Limitations: 10k metrics/month free (sufficient for 1 validator)
Step 4: Set Up Alerts (via Grafana Cloud)
1. Grafana Cloud → Alerting → New Alert Rule
2. Condition: substrate_block_height (best) not increasing for 5 minutes
3. Notification: Email / Telegram
4. Save alert rule
5. Test by stopping node: sudo systemctl stop pilier
6. Should receive alert within 5 minutes
Done! You have basic monitoring. Continue reading for production setup.
Production Setup (Self-Hosted)
System Requirements
Monitoring server (separate from validator):
Recommended specs:
├─ CPU: 2 cores
├─ RAM: 4 GB
├─ Disk: 50 GB SSD (for 15-day metrics retention)
├─ Network: Same datacenter as validator (low latency)
└─ OS: Ubuntu 22.04 or Debian 12
Why separate server:
├─ Validator failure doesn't take down monitoring
├─ Monitoring overhead doesn't affect validator performance
└─ Can monitor multiple validators from one monitoring server
For single validator (cost-conscious):
Option: Co-locate on validator server
├─ Add 2 GB RAM (total 18 GB)
├─ Add 20 GB disk (total 270 GB)
└─ Minor CPU overhead (~5%)
Tradeoff: Less reliable (monitoring fails if validator fails)
Install Prometheus
Step 1: Create prometheus user
sudo useradd --no-create-home --shell /bin/false prometheus
Step 2: Download Prometheus
# Check latest version: https://github.com/prometheus/prometheus/releases
PROM_VERSION="2.48.0"
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
# Verify checksum (from GitHub release page)
sha256sum prometheus-${PROM_VERSION}.linux-amd64.tar.gz
# Extract
tar xvf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROM_VERSION}.linux-amd64/
Step 3: Install binaries
# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/
# Set ownership
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool
# Copy console templates
sudo mkdir -p /etc/prometheus
sudo cp -r consoles/ console_libraries/ /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus
Step 4: Create data directory
sudo mkdir -p /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
Step 5: Configure Prometheus
sudo nano /etc/prometheus/prometheus.yml
Content:
global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
external_labels:
monitor: "pilier-validator"
# Alerting configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093 # Alertmanager (we'll install later)
# Alert rules
rule_files:
- "/etc/prometheus/alert.rules.yml"
# Scrape targets
scrape_configs:
# Pilier validator node
- job_name: "pilier-node"
static_configs:
- targets:
- "localhost:9615" # Change to validator IP if separate server
labels:
instance: "validator-lyon-01"
role: "validator"
# System metrics (via node_exporter)
- job_name: "node-exporter"
static_configs:
- targets:
- "localhost:9100"
labels:
instance: "validator-lyon-01"
# Prometheus itself (self-monitoring)
- job_name: "prometheus"
static_configs:
- targets:
- "localhost:9090"
Step 6: Create systemd service
sudo nano /etc/systemd/system/prometheus.service
Content:
[Unit]
Description=Prometheus
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=15d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090
SyslogIdentifier=prometheus
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
Step 7: Start Prometheus
# Set permissions
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
# Start service
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
# Check status
sudo systemctl status prometheus
# Check logs
sudo journalctl -u prometheus -f
# Verify web UI (from monitoring server)
curl http://localhost:9090
# Should see: HTML page (Prometheus web UI)
Access web UI: http://YOUR_MONITORING_SERVER_IP:9090
Install Node Exporter
Node Exporter provides system metrics (CPU, memory, disk, network).
Step 1: Download
NODE_EXPORTER_VERSION="1.7.0"
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
# Verify checksum
sha256sum node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
# Extract
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/
Step 2: Install
sudo cp node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter
Step 3: Create systemd service
sudo nano /etc/systemd/system/node-exporter.service
Content:
[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=0.0.0.0:9100
SyslogIdentifier=node_exporter
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
Step 4: Start Node Exporter
sudo systemctl daemon-reload
sudo systemctl enable node-exporter
sudo systemctl start node-exporter
# Verify
curl http://localhost:9100/metrics | head -20
# Should see: node_cpu_seconds_total, node_memory_MemAvailable_bytes, etc.
Install Alertmanager
Alertmanager handles alert notifications (email, Telegram, PagerDuty).
Step 1: Download
ALERTMANAGER_VERSION="0.26.0"
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
# Verify checksum
sha256sum alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
# Extract
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
cd alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/
Step 2: Install
sudo cp alertmanager amtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/amtool
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown prometheus:prometheus /etc/alertmanager /var/lib/alertmanager
Step 3: Configure Alertmanager
sudo nano /etc/alertmanager/alertmanager.yml
Content (Email notifications):
global:
resolve_timeout: 5m
smtp_smarthost: "smtp.gmail.com:587"
smtp_from: "your-email@gmail.com"
smtp_auth_username: "your-email@gmail.com"
smtp_auth_password: "your-app-password" # Gmail: Use App Password, not regular password
smtp_require_tls: true
route:
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: "email"
receivers:
- name: "email"
email_configs:
- to: "your-email@gmail.com"
headers:
Subject: "🚨 Pilier Validator Alert: {{ .GroupLabels.alertname }}"
inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]
Step 4: Create systemd service
sudo nano /etc/systemd/system/alertmanager.service
Content:
[Unit]
Description=Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
Wants=network-online.target
After=network-online.target
[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/ \
--web.listen-address=0.0.0.0:9093
SyslogIdentifier=alertmanager
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
Step 5: Start Alertmanager
sudo chown prometheus:prometheus /etc/alertmanager/alertmanager.yml
sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager
# Verify
curl http://localhost:9093
# Should see: Alertmanager web UI (HTML)
Access web UI: http://YOUR_MONITORING_SERVER_IP:9093
Configure Alert Rules
Alert rules define when Prometheus should fire alerts.
sudo nano /etc/prometheus/alert.rules.yml
Content:
groups:
- name: validator_alerts
interval: 30s
rules:
# Node is down
- alert: ValidatorDown
expr: up{job="pilier-node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Validator {{ $labels.instance }} is down"
description: "Validator node has been down for more than 2 minutes."
# Block height not increasing (node stuck)
- alert: BlockHeightNotIncreasing
expr: rate(substrate_block_height{status="best"}[5m]) == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Block height not increasing on {{ $labels.instance }}"
description: "Best block height has not increased in 3 minutes. Node may be stuck or not syncing."
# Low peer count
- alert: LowPeerCount
expr: substrate_sub_libp2p_peers_count < 2
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count on {{ $labels.instance }}"
description: "Validator has fewer than 2 peers for more than 5 minutes. Current: {{ $value }}."
# Finality lag (finalized block behind best block)
- alert: FinalityLag
expr: (substrate_block_height{status="best"} - substrate_block_height{status="finalized"}) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Finality lag on {{ $labels.instance }}"
description: "Finalized block is {{ $value }} blocks behind best block."
# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 90% for more than 10 minutes. Current: {{ $value | humanize }}%."
# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 10 minutes. Current: {{ $value | humanize }}%."
# Low disk space
- alert: LowDiskSpace
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/var/lib/pilier"} / node_filesystem_size_bytes{mountpoint="/var/lib/pilier"})) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 80%. Current: {{ $value | humanize }}%."
# Node not validating (for active validators)
- alert: NotValidating
expr: substrate_sub_libp2p_is_major_syncing == 0 AND substrate_block_height{status="best"} > 100 AND rate(substrate_proposer_block_constructed_count[10m]) == 0
for: 30m
labels:
severity: critical
annotations:
summary: "Validator {{ $labels.instance }} not producing blocks"
description: "Validator has not produced any blocks in the last 30 minutes. Check session keys."
Reload Prometheus to apply rules:
# Validate rules
promtool check rules /etc/prometheus/alert.rules.yml
# If valid, reload Prometheus
sudo systemctl reload prometheus
# Verify rules loaded (Prometheus web UI)
# Visit: http://YOUR_MONITORING_SERVER_IP:9090/rules
Install Grafana
Grafana provides visual dashboards for metrics.
Step 1: Install via APT
# Add Grafana GPG key
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Add repository
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list
# Update and install
sudo apt update
sudo apt install grafana -y
Step 2: Start Grafana
sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
# Check status
sudo systemctl status grafana-server
Access web UI: http://YOUR_MONITORING_SERVER_IP:3000
Default login:
- Username:
admin - Password:
admin(will prompt to change on first login)
Step 3: Add Prometheus data source
1. Login to Grafana (http://YOUR_IP:3000)
2. Left sidebar → Configuration (⚙️) → Data Sources
3. Click "Add data source"
4. Select "Prometheus"
5. Settings:
- Name: Prometheus
- URL: http://localhost:9090
- Access: Server (default)
6. Click "Save & Test"
7. Should see: "Data source is working" ✅
Step 4: Import Pilier dashboard
Create dashboard file:
# Save this dashboard JSON (create file on your computer)
nano pilier-validator-dashboard.json
Dashboard JSON (copy entire content):
{
"dashboard": {
"title": "Pilier Validator Dashboard",
"tags": ["pilier", "validator"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Best Block Height",
"type": "graph",
"targets": [
{
"expr": "substrate_block_height{status=\"best\"}",
"legendFormat": "Best Block"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"id": 2,
"title": "Finalized Block Height",
"type": "graph",
"targets": [
{
"expr": "substrate_block_height{status=\"finalized\"}",
"legendFormat": "Finalized Block"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
},
{
"id": 3,
"title": "Peer Count",
"type": "graph",
"targets": [
{
"expr": "substrate_sub_libp2p_peers_count",
"legendFormat": "Peers"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
},
{
"id": 4,
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU %"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
},
{
"id": 5,
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "Memory %"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
},
{
"id": 6,
"title": "Disk Usage",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes{mountpoint=\"/var/lib/pilier\"} / node_filesystem_size_bytes{mountpoint=\"/var/lib/pilier\"})) * 100",
"legendFormat": "Disk %"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
}
],
"refresh": "30s",
"time": { "from": "now-6h", "to": "now" }
}
}
Import dashboard:
1. Grafana → Left sidebar → Dashboards → Import
2. Click "Upload JSON file"
3. Select pilier-validator-dashboard.json
4. Select data source: Prometheus
5. Click "Import"
6. Dashboard loaded! ✅
Alternative: Use community dashboards:
- https://grafana.com/grafana/dashboards/
- Search: "Substrate" or "Polkadot"
- Import via ID (e.g., 13840 - Substrate Node Exporter)
Key Metrics Reference
Blockchain Metrics
| Metric | Description | Normal Range | Alert Threshold |
|---|---|---|---|
substrate_block_height{status="best"} | Best (head) block | Increasing | Not increasing >3 min |
substrate_block_height{status="finalized"} | Finalized block | Increasing | Lag >10 blocks |
substrate_sub_libp2p_peers_count | Connected peers | 3-50 | <2 peers |
substrate_finality_grandpa_round | GRANDPA round | Increasing | Not increasing >5 min |
substrate_proposer_block_constructed_count | Blocks proposed | Varies (validator turn) | 0 for >30 min (if active validator) |
substrate_sync_syncing | Is syncing? | 0 (false) | 1 (true) for >10 min |
System Metrics
| Metric | Description | Normal Range | Alert Threshold |
|---|---|---|---|
node_cpu_seconds_total | CPU time | N/A | >90% usage for >10 min |
node_memory_MemAvailable_bytes | Available memory | >2 GB | <10% free |
node_filesystem_avail_bytes | Available disk space | >50 GB | <20% free |
node_network_receive_bytes_total | Network traffic (received) | Varies | N/A (for monitoring only) |
node_network_transmit_bytes_total | Network traffic (sent) | Varies | N/A (for monitoring only) |
node_load1 | 1-minute load average | <CPU cores | >2x CPU cores |
Query Examples (Prometheus)
Block production rate (blocks per minute):
rate(substrate_block_height{status="best"}[1m]) * 60
Finality lag (blocks behind):
substrate_block_height{status="best"} - substrate_block_height{status="finalized"}
Memory usage (percentage):
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100
Disk I/O (MB/s):
rate(node_disk_read_bytes_total[5m]) / 1024 / 1024
Telegram Alerts (Optional)
For instant notifications on mobile.
Step 1: Create Telegram Bot
1. Open Telegram → Search for "@BotFather"
2. Send: /newbot
3. Follow prompts:
- Bot name: "Pilier Validator Alerts"
- Bot username: "pilier_validator_alerts_bot" (must end with _bot)
4. BotFather replies with:
- Token: 123456789:ABCdefGHIjklMNOpqrsTUVwxyz (SAVE THIS!)
Step 2: Get Your Chat ID
1. Start conversation with your bot (click link from BotFather)
2. Send any message: "Hello"
3. Visit: https://api.telegram.org/bot{TOKEN}/getUpdates
- Replace {TOKEN} with your bot token
4. Find "chat":{"id":123456789} in JSON response
5. Save chat ID: 123456789
Step 3: Configure Alertmanager for Telegram
Install telegram-bot-notifier (separate tool):
# Clone repository
cd /opt
sudo git clone https://github.com/inCaller/prometheus_bot.git
cd prometheus_bot
# Install dependencies
sudo apt install python3-pip -y
sudo pip3 install -r requirements.txt
# Configure
sudo nano config.yaml
config.yaml:
telegram_token: "123456789:ABCdefGHIjklMNOpqrsTUVwxyz"
template_path: "templates/default.tmpl"
time_zone: "UTC"
split_token: "|"
split_msg_byte: 4000
send_only: false
Create systemd service:
sudo nano /etc/systemd/system/prometheus-bot.service
Content:
[Unit]
Description=Prometheus Telegram Bot
After=network.target
[Service]
Type=simple
User=prometheus
WorkingDirectory=/opt/prometheus_bot
ExecStart=/usr/bin/python3 /opt/prometheus_bot/prometheus_bot.py
Restart=always
RestartSec=10s
[Install]
WantedBy=multi-user.target
Start service:
sudo systemctl daemon-reload
sudo systemctl enable prometheus-bot
sudo systemctl start prometheus-bot
Update Alertmanager config:
sudo nano /etc/alertmanager/alertmanager.yml
Add Telegram receiver:
receivers:
- name: "telegram"
webhook_configs:
- url: "http://localhost:9087/alert/YOUR_CHAT_ID"
send_resolved: true
route:
receiver: "telegram" # Change default receiver to Telegram
Reload Alertmanager:
sudo systemctl reload alertmanager
Step 4: Test Telegram Alerts
# Stop validator node (trigger ValidatorDown alert)
sudo systemctl stop pilier
# Wait 2 minutes (alert threshold)
# You should receive Telegram message:
# "🚨 FIRING: ValidatorDown
# Validator validator-lyon-01 is down
# Validator node has been down for more than 2 minutes."
# Restart validator
sudo systemctl start pilier
# Wait ~1 minute, should receive:
# "✅ RESOLVED: ValidatorDown
# Validator validator-lyon-01 is down"
Advanced Monitoring
Multi-Validator Monitoring
Monitor multiple validators from single Prometheus:
# /etc/prometheus/prometheus.yml
scrape_configs:
- job_name: "validator-01"
static_configs:
- targets:
- "51.210.XXX.1:9615" # Validator 1 IP
labels:
instance: "validator-paris-01"
- job_name: "validator-02"
static_configs:
- targets:
- "51.210.XXX.2:9615" # Validator 2 IP
labels:
instance: "validator-lyon-01"
- job_name: "validator-03"
static_configs:
- targets:
- "51.210.XXX.3:9615" # Validator 3 IP
labels:
instance: "validator-tbilisi-01"
Create multi-validator dashboard in Grafana:
# Panel: Block height (all validators)
substrate_block_height{status="best"}
# Panel: Peer count (all validators)
substrate_sub_libp2p_peers_count
# Use legend: {{ instance }} (shows validator name)
Log Aggregation
Centralize logs from all validators.
Option A: Loki + Grafana (Recommended)
# Install Loki (log aggregation)
wget https://github.com/grafana/loki/releases/download/v2.9.3/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo cp loki-linux-amd64 /usr/local/bin/loki
# Install Promtail (log shipper, runs on validator)
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo cp promtail-linux-amd64 /usr/local/bin/promtail
# Configure Promtail (on validator server)
sudo nano /etc/promtail/config.yml
Promtail config:
server:
http_listen_port: 9080
grpc_listen_port: 0
positions:
filename: /tmp/positions.yaml
clients:
- url: http://MONITORING_SERVER_IP:3100/loki/api/v1/push
scrape_configs:
- job_name: pilier-node
static_configs:
- targets:
- localhost
labels:
job: pilier-node
__path__: /var/log/syslog # Or use journald
Then in Grafana:
- Add Loki data source
- Query logs:
{job="pilier-node"} |= "error"
Option B: ELK Stack (Elasticsearch + Logstash + Kibana)
(More complex, not covered here - use Loki for simplicity)
Uptime Monitoring (External)
Monitor from OUTSIDE your infrastructure (detects network issues).
Option A: UptimeRobot (Free)
1. Sign up: https://uptimerobot.com
2. Add monitor:
- Type: HTTP(s)
- URL: https://YOUR_VALIDATOR_IP:9615/metrics (if publicly exposed)
- OR: Use custom port check (TCP 30333)
3. Alert contacts: Email, Telegram, webhook
4. Check interval: 5 minutes (free tier)
Option B: Self-Hosted Blackbox Exporter
# Install blackbox_exporter (Prometheus ecosystem)
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz
# Configure to ping validator server
# Then scrape from Prometheus
Troubleshooting Monitoring
Prometheus Not Scraping Targets
Symptom: Targets show as "DOWN" in Prometheus UI (Status → Targets).
Check:
# 1. Is Pilier node metrics endpoint accessible?
curl http://VALIDATOR_IP:9615/metrics
# Should return metrics, not connection refused
# 2. Firewall blocking?
# On validator server:
sudo ufw allow from MONITORING_SERVER_IP to any port 9615
# 3. Is --prometheus-external flag set?
sudo journalctl -u pilier | grep prometheus
# Should see: "Listening on 0.0.0.0:9615" (not 127.0.0.1:9615)
# 4. Check Prometheus config
promtool check config /etc/prometheus/prometheus.yml
# 5. Check Prometheus logs
sudo journalctl -u prometheus -n 50
Grafana Showing "No Data"
Symptom: Dashboards display "No data" or empty graphs.
Check:
# 1. Is Prometheus data source working?
# Grafana → Configuration → Data Sources → Prometheus → "Test"
# Should see: "Data source is working"
# 2. Are metrics actually in Prometheus?
# Prometheus UI: http://MONITORING_IP:9090/graph
# Query: substrate_block_height{status="best"}
# Should see data
# 3. Check time range in Grafana dashboard
# Top-right: Time range selector
# Set to "Last 6 hours" or "Last 24 hours"
# 4. Check PromQL query in panel
# Edit panel → Query tab
# Verify metric name is correct (copy from Prometheus UI)
Alerts Not Firing
Symptom: Expected alert (e.g., ValidatorDown) not triggered.
Check:
# 1. Are alert rules loaded?
# Prometheus UI: http://MONITORING_IP:9090/rules
# Should see: alert.rules.yml rules listed
# 2. Is alert condition met?
# Prometheus UI → Alerts tab
# Should see alert in "Pending" or "Firing" state
# 3. Check Alertmanager receiving alerts
# Alertmanager UI: http://MONITORING_IP:9093/#/alerts
# Should see alerts listed
# 4. Check Alertmanager logs
sudo journalctl -u alertmanager -n 50
# 5. Test email configuration
# Alertmanager UI → "Silence" → Create test alert
High Prometheus Memory Usage
Symptom: Prometheus using >8 GB RAM (excessive for single validator).
Causes:
1. Too many metrics (high cardinality)
2. Long retention time (>30 days)
3. Too frequent scraping (interval <10s)
Fix:
# 1. Reduce retention time
# Edit /etc/systemd/system/prometheus.service:
--storage.tsdb.retention.time=15d # Was 30d
# 2. Increase scrape interval
# Edit /etc/prometheus/prometheus.yml:
global:
scrape_interval: 30s # Was 15s
# 3. Restart Prometheus
sudo systemctl restart prometheus
# 4. Monitor memory usage
free -h
Best Practices
Do's ✅
- ✅ Monitor from separate server (monitoring survives validator failure)
- ✅ Set up alerts early (don't wait for incident)
- ✅ Test alerts regularly (monthly: stop node, verify alert received)
- ✅ Monitor multiple channels (metrics + logs + external uptime)
- ✅ Review dashboards weekly (spot trends before they become issues)
- ✅ Document alert runbooks (what to do when alert fires)
- ✅ Keep retention short (15 days sufficient, saves disk space)
- ✅ Use Grafana annotations (mark upgrades, incidents on graphs)
Don'ts ❌
- ❌ Don't ignore warnings (warnings become critical if unchecked)
- ❌ Don't set thresholds too tight (avoid alert fatigue)
- ❌ Don't monitor without acting (alerts must have response plan)
- ❌ Don't expose Grafana publicly (without authentication + SSL)
- ❌ Don't collect PII in logs (GDPR compliance)
- ❌ Don't run monitoring on validator (unless no budget for separate server)
Monitoring Checklist
Daily:
- Check Grafana dashboard (5 minutes)
- Verify block height increasing
- Verify finality not lagging
- Check for any alerts (Telegram, email)
Weekly:
- Review alert history (any recurring issues?)
- Check disk space trends (will it fill in next month?)
- Verify peer count stable (not declining over time)
- Test one alert (stop node, verify alert received)
Monthly:
- Review all metrics (spot long-term trends)
- Update Prometheus/Grafana (security patches)
- Clean up old alerts (silence resolved issues)
- Document any incidents (post-mortems)
Quarterly:
- Audit alert rules (are thresholds still appropriate?)
- Review retention settings (adjust if needed)
- Test disaster recovery (restore from backup)
- Train backup operator (someone else should be able to monitor)
Support
Monitoring setup help?
- Forum: forum.pilier.net/operations
- Telegram: t.me/pilier_validators
- Email: validators@pilier.net
Prometheus/Grafana:
- Prometheus docs: https://prometheus.io/docs/
- Grafana docs: https://grafana.com/docs/
- Community dashboards: https://grafana.com/grafana/dashboards/
Document version: 1.0
Last updated: 2026-01-12
Tested on: Ubuntu 22.04, Debian 12