Skip to main content

Monitoring & Observability

Comprehensive monitoring is essential for validator operations. This guide covers metrics collection, alerting, dashboards, and troubleshooting.

Goal: Detect and resolve issues before they impact validator performance or rewards.


Monitoring Stack Overview

Architecture

Pilier Node
├─ Prometheus metrics endpoint (:9615)
│ └─ Exposes: Block height, peers, finality, CPU, memory

Prometheus Server
├─ Scrapes metrics every 15 seconds
├─ Stores time-series data (15 days retention)
└─ Evaluates alert rules

Alertmanager
├─ Receives alerts from Prometheus
├─ Routes to: Email, Telegram, PagerDuty
└─ Deduplicates and groups alerts

Grafana
├─ Visualizes metrics (dashboards)
├─ Queries Prometheus
└─ User interface for operators

Components:

ComponentPurposeInstallationCost
PrometheusMetrics storage & alertingSelf-hosted or cloudFree (self-hosted)
GrafanaDashboards & visualizationSelf-hosted or cloudFree (self-hosted)
Node ExporterSystem metrics (CPU, disk, etc.)Self-hostedFree
AlertmanagerAlert routing & notificationSelf-hostedFree
Telegram Bot (optional)Alert notificationsCloud (Telegram API)Free

Quick Start (5 Minutes)

For validators who want basic monitoring NOW:

Step 1: Enable Prometheus in Node

# Edit systemd service
sudo nano /etc/systemd/system/pilier.service

# Add these flags to ExecStart:
--prometheus-port 9615 \
--prometheus-external

# Reload and restart
sudo systemctl daemon-reload
sudo systemctl restart pilier

Step 2: Verify Metrics Endpoint

# Check if metrics accessible
curl http://localhost:9615/metrics

# Should see output like:
# substrate_block_height{status="best"} 12345
# substrate_finality_grandpa_round 678
# substrate_peers_count 5
# ...hundreds of metrics...

If you see metrics → monitoring is enabled!


Step 3: Use Public Monitoring (Temporary)

While you set up your own stack:

Pilier Telemetry:
└─ https://telemetry.pilier.net
└─ Shows: Your validator name, block height, peers
└─ Limitations: Only basic metrics, no alerts

Prometheus Cloud (free tier):
└─ Grafana Cloud: https://grafana.com/products/cloud/
└─ Sign up → Add Prometheus data source → Point to your node
└─ Limitations: 10k metrics/month free (sufficient for 1 validator)

Step 4: Set Up Alerts (via Grafana Cloud)

1. Grafana Cloud → Alerting → New Alert Rule
2. Condition: substrate_block_height (best) not increasing for 5 minutes
3. Notification: Email / Telegram
4. Save alert rule
5. Test by stopping node: sudo systemctl stop pilier
6. Should receive alert within 5 minutes

Done! You have basic monitoring. Continue reading for production setup.


Production Setup (Self-Hosted)

System Requirements

Monitoring server (separate from validator):

Recommended specs:
├─ CPU: 2 cores
├─ RAM: 4 GB
├─ Disk: 50 GB SSD (for 15-day metrics retention)
├─ Network: Same datacenter as validator (low latency)
└─ OS: Ubuntu 22.04 or Debian 12

Why separate server:
├─ Validator failure doesn't take down monitoring
├─ Monitoring overhead doesn't affect validator performance
└─ Can monitor multiple validators from one monitoring server

For single validator (cost-conscious):

Option: Co-locate on validator server
├─ Add 2 GB RAM (total 18 GB)
├─ Add 20 GB disk (total 270 GB)
└─ Minor CPU overhead (~5%)

Tradeoff: Less reliable (monitoring fails if validator fails)

Install Prometheus

Step 1: Create prometheus user

sudo useradd --no-create-home --shell /bin/false prometheus

Step 2: Download Prometheus

# Check latest version: https://github.com/prometheus/prometheus/releases
PROM_VERSION="2.48.0"

cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz

# Verify checksum (from GitHub release page)
sha256sum prometheus-${PROM_VERSION}.linux-amd64.tar.gz

# Extract
tar xvf prometheus-${PROM_VERSION}.linux-amd64.tar.gz
cd prometheus-${PROM_VERSION}.linux-amd64/

Step 3: Install binaries

# Copy binaries
sudo cp prometheus /usr/local/bin/
sudo cp promtool /usr/local/bin/

# Set ownership
sudo chown prometheus:prometheus /usr/local/bin/prometheus
sudo chown prometheus:prometheus /usr/local/bin/promtool

# Copy console templates
sudo mkdir -p /etc/prometheus
sudo cp -r consoles/ console_libraries/ /etc/prometheus/
sudo chown -R prometheus:prometheus /etc/prometheus

Step 4: Create data directory

sudo mkdir -p /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus

Step 5: Configure Prometheus

sudo nano /etc/prometheus/prometheus.yml

Content:

global:
scrape_interval: 15s # Scrape targets every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
external_labels:
monitor: "pilier-validator"

# Alerting configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093 # Alertmanager (we'll install later)

# Alert rules
rule_files:
- "/etc/prometheus/alert.rules.yml"

# Scrape targets
scrape_configs:
# Pilier validator node
- job_name: "pilier-node"
static_configs:
- targets:
- "localhost:9615" # Change to validator IP if separate server
labels:
instance: "validator-lyon-01"
role: "validator"

# System metrics (via node_exporter)
- job_name: "node-exporter"
static_configs:
- targets:
- "localhost:9100"
labels:
instance: "validator-lyon-01"

# Prometheus itself (self-monitoring)
- job_name: "prometheus"
static_configs:
- targets:
- "localhost:9090"

Step 6: Create systemd service

sudo nano /etc/systemd/system/prometheus.service

Content:

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/docs/introduction/overview/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecReload=/bin/kill -HUP $MAINPID
ExecStart=/usr/local/bin/prometheus \
--config.file=/etc/prometheus/prometheus.yml \
--storage.tsdb.path=/var/lib/prometheus/ \
--storage.tsdb.retention.time=15d \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090

SyslogIdentifier=prometheus
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Step 7: Start Prometheus

# Set permissions
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml

# Start service
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

# Check status
sudo systemctl status prometheus

# Check logs
sudo journalctl -u prometheus -f

# Verify web UI (from monitoring server)
curl http://localhost:9090
# Should see: HTML page (Prometheus web UI)

Access web UI: http://YOUR_MONITORING_SERVER_IP:9090


Install Node Exporter

Node Exporter provides system metrics (CPU, memory, disk, network).

Step 1: Download

NODE_EXPORTER_VERSION="1.7.0"

cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Verify checksum
sha256sum node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz

# Extract
tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz
cd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64/

Step 2: Install

sudo cp node_exporter /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/node_exporter

Step 3: Create systemd service

sudo nano /etc/systemd/system/node-exporter.service

Content:

[Unit]
Description=Node Exporter
Documentation=https://github.com/prometheus/node_exporter
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=0.0.0.0:9100

SyslogIdentifier=node_exporter
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Step 4: Start Node Exporter

sudo systemctl daemon-reload
sudo systemctl enable node-exporter
sudo systemctl start node-exporter

# Verify
curl http://localhost:9100/metrics | head -20
# Should see: node_cpu_seconds_total, node_memory_MemAvailable_bytes, etc.

Install Alertmanager

Alertmanager handles alert notifications (email, Telegram, PagerDuty).

Step 1: Download

ALERTMANAGER_VERSION="0.26.0"

cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v${ALERTMANAGER_VERSION}/alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Verify checksum
sha256sum alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz

# Extract
tar xvf alertmanager-${ALERTMANAGER_VERSION}.linux-amd64.tar.gz
cd alertmanager-${ALERTMANAGER_VERSION}.linux-amd64/

Step 2: Install

sudo cp alertmanager amtool /usr/local/bin/
sudo chown prometheus:prometheus /usr/local/bin/alertmanager
sudo chown prometheus:prometheus /usr/local/bin/amtool

sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown prometheus:prometheus /etc/alertmanager /var/lib/alertmanager

Step 3: Configure Alertmanager

sudo nano /etc/alertmanager/alertmanager.yml

Content (Email notifications):

global:
resolve_timeout: 5m
smtp_smarthost: "smtp.gmail.com:587"
smtp_from: "your-email@gmail.com"
smtp_auth_username: "your-email@gmail.com"
smtp_auth_password: "your-app-password" # Gmail: Use App Password, not regular password
smtp_require_tls: true

route:
group_by: ["alertname", "instance"]
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: "email"

receivers:
- name: "email"
email_configs:
- to: "your-email@gmail.com"
headers:
Subject: "🚨 Pilier Validator Alert: {{ .GroupLabels.alertname }}"

inhibit_rules:
- source_match:
severity: "critical"
target_match:
severity: "warning"
equal: ["alertname", "instance"]

Step 4: Create systemd service

sudo nano /etc/systemd/system/alertmanager.service

Content:

[Unit]
Description=Alertmanager
Documentation=https://prometheus.io/docs/alerting/alertmanager/
Wants=network-online.target
After=network-online.target

[Service]
Type=simple
User=prometheus
Group=prometheus
ExecStart=/usr/local/bin/alertmanager \
--config.file=/etc/alertmanager/alertmanager.yml \
--storage.path=/var/lib/alertmanager/ \
--web.listen-address=0.0.0.0:9093

SyslogIdentifier=alertmanager
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Step 5: Start Alertmanager

sudo chown prometheus:prometheus /etc/alertmanager/alertmanager.yml

sudo systemctl daemon-reload
sudo systemctl enable alertmanager
sudo systemctl start alertmanager

# Verify
curl http://localhost:9093
# Should see: Alertmanager web UI (HTML)

Access web UI: http://YOUR_MONITORING_SERVER_IP:9093


Configure Alert Rules

Alert rules define when Prometheus should fire alerts.

sudo nano /etc/prometheus/alert.rules.yml

Content:

groups:
- name: validator_alerts
interval: 30s
rules:
# Node is down
- alert: ValidatorDown
expr: up{job="pilier-node"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Validator {{ $labels.instance }} is down"
description: "Validator node has been down for more than 2 minutes."

# Block height not increasing (node stuck)
- alert: BlockHeightNotIncreasing
expr: rate(substrate_block_height{status="best"}[5m]) == 0
for: 3m
labels:
severity: critical
annotations:
summary: "Block height not increasing on {{ $labels.instance }}"
description: "Best block height has not increased in 3 minutes. Node may be stuck or not syncing."

# Low peer count
- alert: LowPeerCount
expr: substrate_sub_libp2p_peers_count < 2
for: 5m
labels:
severity: warning
annotations:
summary: "Low peer count on {{ $labels.instance }}"
description: "Validator has fewer than 2 peers for more than 5 minutes. Current: {{ $value }}."

# Finality lag (finalized block behind best block)
- alert: FinalityLag
expr: (substrate_block_height{status="best"} - substrate_block_height{status="finalized"}) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "Finality lag on {{ $labels.instance }}"
description: "Finalized block is {{ $value }} blocks behind best block."

# High CPU usage
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is above 90% for more than 10 minutes. Current: {{ $value | humanize }}%."

# High memory usage
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 90% for more than 10 minutes. Current: {{ $value | humanize }}%."

# Low disk space
- alert: LowDiskSpace
expr: (1 - (node_filesystem_avail_bytes{mountpoint="/var/lib/pilier"} / node_filesystem_size_bytes{mountpoint="/var/lib/pilier"})) * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is above 80%. Current: {{ $value | humanize }}%."

# Node not validating (for active validators)
- alert: NotValidating
expr: substrate_sub_libp2p_is_major_syncing == 0 AND substrate_block_height{status="best"} > 100 AND rate(substrate_proposer_block_constructed_count[10m]) == 0
for: 30m
labels:
severity: critical
annotations:
summary: "Validator {{ $labels.instance }} not producing blocks"
description: "Validator has not produced any blocks in the last 30 minutes. Check session keys."

Reload Prometheus to apply rules:

# Validate rules
promtool check rules /etc/prometheus/alert.rules.yml

# If valid, reload Prometheus
sudo systemctl reload prometheus

# Verify rules loaded (Prometheus web UI)
# Visit: http://YOUR_MONITORING_SERVER_IP:9090/rules

Install Grafana

Grafana provides visual dashboards for metrics.

Step 1: Install via APT

# Add Grafana GPG key
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -

# Add repository
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee /etc/apt/sources.list.d/grafana.list

# Update and install
sudo apt update
sudo apt install grafana -y

Step 2: Start Grafana

sudo systemctl daemon-reload
sudo systemctl enable grafana-server
sudo systemctl start grafana-server

# Check status
sudo systemctl status grafana-server

Access web UI: http://YOUR_MONITORING_SERVER_IP:3000

Default login:

  • Username: admin
  • Password: admin (will prompt to change on first login)

Step 3: Add Prometheus data source

1. Login to Grafana (http://YOUR_IP:3000)
2. Left sidebar → Configuration (⚙️) → Data Sources
3. Click "Add data source"
4. Select "Prometheus"
5. Settings:
- Name: Prometheus
- URL: http://localhost:9090
- Access: Server (default)
6. Click "Save & Test"
7. Should see: "Data source is working" ✅

Step 4: Import Pilier dashboard

Create dashboard file:

# Save this dashboard JSON (create file on your computer)
nano pilier-validator-dashboard.json

Dashboard JSON (copy entire content):

{
"dashboard": {
"title": "Pilier Validator Dashboard",
"tags": ["pilier", "validator"],
"timezone": "browser",
"panels": [
{
"id": 1,
"title": "Best Block Height",
"type": "graph",
"targets": [
{
"expr": "substrate_block_height{status=\"best\"}",
"legendFormat": "Best Block"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 }
},
{
"id": 2,
"title": "Finalized Block Height",
"type": "graph",
"targets": [
{
"expr": "substrate_block_height{status=\"finalized\"}",
"legendFormat": "Finalized Block"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 }
},
{
"id": 3,
"title": "Peer Count",
"type": "graph",
"targets": [
{
"expr": "substrate_sub_libp2p_peers_count",
"legendFormat": "Peers"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 }
},
{
"id": 4,
"title": "CPU Usage",
"type": "graph",
"targets": [
{
"expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)",
"legendFormat": "CPU %"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 8 }
},
{
"id": 5,
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "Memory %"
}
],
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 }
},
{
"id": 6,
"title": "Disk Usage",
"type": "graph",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes{mountpoint=\"/var/lib/pilier\"} / node_filesystem_size_bytes{mountpoint=\"/var/lib/pilier\"})) * 100",
"legendFormat": "Disk %"
}
],
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 16 }
}
],
"refresh": "30s",
"time": { "from": "now-6h", "to": "now" }
}
}

Import dashboard:

1. Grafana → Left sidebar → Dashboards → Import
2. Click "Upload JSON file"
3. Select pilier-validator-dashboard.json
4. Select data source: Prometheus
5. Click "Import"
6. Dashboard loaded!

Alternative: Use community dashboards:


Key Metrics Reference

Blockchain Metrics

MetricDescriptionNormal RangeAlert Threshold
substrate_block_height{status="best"}Best (head) blockIncreasingNot increasing >3 min
substrate_block_height{status="finalized"}Finalized blockIncreasingLag >10 blocks
substrate_sub_libp2p_peers_countConnected peers3-50<2 peers
substrate_finality_grandpa_roundGRANDPA roundIncreasingNot increasing >5 min
substrate_proposer_block_constructed_countBlocks proposedVaries (validator turn)0 for >30 min (if active validator)
substrate_sync_syncingIs syncing?0 (false)1 (true) for >10 min

System Metrics

MetricDescriptionNormal RangeAlert Threshold
node_cpu_seconds_totalCPU timeN/A>90% usage for >10 min
node_memory_MemAvailable_bytesAvailable memory>2 GB<10% free
node_filesystem_avail_bytesAvailable disk space>50 GB<20% free
node_network_receive_bytes_totalNetwork traffic (received)VariesN/A (for monitoring only)
node_network_transmit_bytes_totalNetwork traffic (sent)VariesN/A (for monitoring only)
node_load11-minute load average<CPU cores>2x CPU cores

Query Examples (Prometheus)

Block production rate (blocks per minute):

rate(substrate_block_height{status="best"}[1m]) * 60

Finality lag (blocks behind):

substrate_block_height{status="best"} - substrate_block_height{status="finalized"}

Memory usage (percentage):

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Disk I/O (MB/s):

rate(node_disk_read_bytes_total[5m]) / 1024 / 1024

Telegram Alerts (Optional)

For instant notifications on mobile.

Step 1: Create Telegram Bot

1. Open Telegram → Search for "@BotFather"
2. Send: /newbot
3. Follow prompts:
- Bot name: "Pilier Validator Alerts"
- Bot username: "pilier_validator_alerts_bot" (must end with _bot)
4. BotFather replies with:
- Token: 123456789:ABCdefGHIjklMNOpqrsTUVwxyz (SAVE THIS!)

Step 2: Get Your Chat ID

1. Start conversation with your bot (click link from BotFather)
2. Send any message: "Hello"
3. Visit: https://api.telegram.org/bot{TOKEN}/getUpdates
- Replace {TOKEN} with your bot token
4. Find "chat":{"id":123456789} in JSON response
5. Save chat ID: 123456789

Step 3: Configure Alertmanager for Telegram

Install telegram-bot-notifier (separate tool):

# Clone repository
cd /opt
sudo git clone https://github.com/inCaller/prometheus_bot.git
cd prometheus_bot

# Install dependencies
sudo apt install python3-pip -y
sudo pip3 install -r requirements.txt

# Configure
sudo nano config.yaml

config.yaml:

telegram_token: "123456789:ABCdefGHIjklMNOpqrsTUVwxyz"
template_path: "templates/default.tmpl"
time_zone: "UTC"
split_token: "|"
split_msg_byte: 4000
send_only: false

Create systemd service:

sudo nano /etc/systemd/system/prometheus-bot.service

Content:

[Unit]
Description=Prometheus Telegram Bot
After=network.target

[Service]
Type=simple
User=prometheus
WorkingDirectory=/opt/prometheus_bot
ExecStart=/usr/bin/python3 /opt/prometheus_bot/prometheus_bot.py
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

Start service:

sudo systemctl daemon-reload
sudo systemctl enable prometheus-bot
sudo systemctl start prometheus-bot

Update Alertmanager config:

sudo nano /etc/alertmanager/alertmanager.yml

Add Telegram receiver:

receivers:
- name: "telegram"
webhook_configs:
- url: "http://localhost:9087/alert/YOUR_CHAT_ID"
send_resolved: true

route:
receiver: "telegram" # Change default receiver to Telegram

Reload Alertmanager:

sudo systemctl reload alertmanager

Step 4: Test Telegram Alerts

# Stop validator node (trigger ValidatorDown alert)
sudo systemctl stop pilier

# Wait 2 minutes (alert threshold)
# You should receive Telegram message:
# "🚨 FIRING: ValidatorDown
# Validator validator-lyon-01 is down
# Validator node has been down for more than 2 minutes."

# Restart validator
sudo systemctl start pilier

# Wait ~1 minute, should receive:
# "✅ RESOLVED: ValidatorDown
# Validator validator-lyon-01 is down"

Advanced Monitoring

Multi-Validator Monitoring

Monitor multiple validators from single Prometheus:

# /etc/prometheus/prometheus.yml

scrape_configs:
- job_name: "validator-01"
static_configs:
- targets:
- "51.210.XXX.1:9615" # Validator 1 IP
labels:
instance: "validator-paris-01"

- job_name: "validator-02"
static_configs:
- targets:
- "51.210.XXX.2:9615" # Validator 2 IP
labels:
instance: "validator-lyon-01"

- job_name: "validator-03"
static_configs:
- targets:
- "51.210.XXX.3:9615" # Validator 3 IP
labels:
instance: "validator-tbilisi-01"

Create multi-validator dashboard in Grafana:

# Panel: Block height (all validators)
substrate_block_height{status="best"}

# Panel: Peer count (all validators)
substrate_sub_libp2p_peers_count

# Use legend: {{ instance }} (shows validator name)

Log Aggregation

Centralize logs from all validators.

Option A: Loki + Grafana (Recommended)

# Install Loki (log aggregation)
wget https://github.com/grafana/loki/releases/download/v2.9.3/loki-linux-amd64.zip
unzip loki-linux-amd64.zip
sudo cp loki-linux-amd64 /usr/local/bin/loki

# Install Promtail (log shipper, runs on validator)
wget https://github.com/grafana/loki/releases/download/v2.9.3/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip
sudo cp promtail-linux-amd64 /usr/local/bin/promtail

# Configure Promtail (on validator server)
sudo nano /etc/promtail/config.yml

Promtail config:

server:
http_listen_port: 9080
grpc_listen_port: 0

positions:
filename: /tmp/positions.yaml

clients:
- url: http://MONITORING_SERVER_IP:3100/loki/api/v1/push

scrape_configs:
- job_name: pilier-node
static_configs:
- targets:
- localhost
labels:
job: pilier-node
__path__: /var/log/syslog # Or use journald

Then in Grafana:

  • Add Loki data source
  • Query logs: {job="pilier-node"} |= "error"

Option B: ELK Stack (Elasticsearch + Logstash + Kibana)

(More complex, not covered here - use Loki for simplicity)


Uptime Monitoring (External)

Monitor from OUTSIDE your infrastructure (detects network issues).

Option A: UptimeRobot (Free)

1. Sign up: https://uptimerobot.com
2. Add monitor:
- Type: HTTP(s)
- URL: https://YOUR_VALIDATOR_IP:9615/metrics (if publicly exposed)
- OR: Use custom port check (TCP 30333)
3. Alert contacts: Email, Telegram, webhook
4. Check interval: 5 minutes (free tier)

Option B: Self-Hosted Blackbox Exporter

# Install blackbox_exporter (Prometheus ecosystem)
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.24.0/blackbox_exporter-0.24.0.linux-amd64.tar.gz

# Configure to ping validator server
# Then scrape from Prometheus

Troubleshooting Monitoring

Prometheus Not Scraping Targets

Symptom: Targets show as "DOWN" in Prometheus UI (Status → Targets).

Check:

# 1. Is Pilier node metrics endpoint accessible?
curl http://VALIDATOR_IP:9615/metrics
# Should return metrics, not connection refused

# 2. Firewall blocking?
# On validator server:
sudo ufw allow from MONITORING_SERVER_IP to any port 9615

# 3. Is --prometheus-external flag set?
sudo journalctl -u pilier | grep prometheus
# Should see: "Listening on 0.0.0.0:9615" (not 127.0.0.1:9615)

# 4. Check Prometheus config
promtool check config /etc/prometheus/prometheus.yml

# 5. Check Prometheus logs
sudo journalctl -u prometheus -n 50

Grafana Showing "No Data"

Symptom: Dashboards display "No data" or empty graphs.

Check:

# 1. Is Prometheus data source working?
# Grafana → Configuration → Data Sources → Prometheus → "Test"
# Should see: "Data source is working"

# 2. Are metrics actually in Prometheus?
# Prometheus UI: http://MONITORING_IP:9090/graph
# Query: substrate_block_height{status="best"}
# Should see data

# 3. Check time range in Grafana dashboard
# Top-right: Time range selector
# Set to "Last 6 hours" or "Last 24 hours"

# 4. Check PromQL query in panel
# Edit panel → Query tab
# Verify metric name is correct (copy from Prometheus UI)

Alerts Not Firing

Symptom: Expected alert (e.g., ValidatorDown) not triggered.

Check:

# 1. Are alert rules loaded?
# Prometheus UI: http://MONITORING_IP:9090/rules
# Should see: alert.rules.yml rules listed

# 2. Is alert condition met?
# Prometheus UI → Alerts tab
# Should see alert in "Pending" or "Firing" state

# 3. Check Alertmanager receiving alerts
# Alertmanager UI: http://MONITORING_IP:9093/#/alerts
# Should see alerts listed

# 4. Check Alertmanager logs
sudo journalctl -u alertmanager -n 50

# 5. Test email configuration
# Alertmanager UI → "Silence" → Create test alert

High Prometheus Memory Usage

Symptom: Prometheus using >8 GB RAM (excessive for single validator).

Causes:

1. Too many metrics (high cardinality)
2. Long retention time (>30 days)
3. Too frequent scraping (interval <10s)

Fix:

# 1. Reduce retention time
# Edit /etc/systemd/system/prometheus.service:
--storage.tsdb.retention.time=15d # Was 30d

# 2. Increase scrape interval
# Edit /etc/prometheus/prometheus.yml:
global:
scrape_interval: 30s # Was 15s

# 3. Restart Prometheus
sudo systemctl restart prometheus

# 4. Monitor memory usage
free -h

Best Practices

Do's ✅

  • Monitor from separate server (monitoring survives validator failure)
  • Set up alerts early (don't wait for incident)
  • Test alerts regularly (monthly: stop node, verify alert received)
  • Monitor multiple channels (metrics + logs + external uptime)
  • Review dashboards weekly (spot trends before they become issues)
  • Document alert runbooks (what to do when alert fires)
  • Keep retention short (15 days sufficient, saves disk space)
  • Use Grafana annotations (mark upgrades, incidents on graphs)

Don'ts ❌

  • Don't ignore warnings (warnings become critical if unchecked)
  • Don't set thresholds too tight (avoid alert fatigue)
  • Don't monitor without acting (alerts must have response plan)
  • Don't expose Grafana publicly (without authentication + SSL)
  • Don't collect PII in logs (GDPR compliance)
  • Don't run monitoring on validator (unless no budget for separate server)

Monitoring Checklist

Daily:

  • Check Grafana dashboard (5 minutes)
  • Verify block height increasing
  • Verify finality not lagging
  • Check for any alerts (Telegram, email)

Weekly:

  • Review alert history (any recurring issues?)
  • Check disk space trends (will it fill in next month?)
  • Verify peer count stable (not declining over time)
  • Test one alert (stop node, verify alert received)

Monthly:

  • Review all metrics (spot long-term trends)
  • Update Prometheus/Grafana (security patches)
  • Clean up old alerts (silence resolved issues)
  • Document any incidents (post-mortems)

Quarterly:

  • Audit alert rules (are thresholds still appropriate?)
  • Review retention settings (adjust if needed)
  • Test disaster recovery (restore from backup)
  • Train backup operator (someone else should be able to monitor)

Support

Monitoring setup help?

Prometheus/Grafana:


Document version: 1.0

Last updated: 2026-01-12

Tested on: Ubuntu 22.04, Debian 12