Security Procedures
Incident response protocols, emergency procedures, and communication templates for Pilier validators.
Reading time: 12 minutes
Incident Classification
All incidents are classified by severity and response time.
Severity Levels
| Level | Name | Response Time | Examples |
|---|---|---|---|
| 🔴 Critical | Network-breaking | <2 hours | Key compromise, consensus failure, network attack |
| 🟠 High | Service degraded | <24 hours | Hardware failure, performance issues, missed blocks |
| 🟡 Medium | Minor disruption | <7 days | Monitoring alerts, certificate expiration, log rotation |
| 🟢 Low | Informational | <30 days | Routine maintenance, documentation updates |
Critical Incidents (2-Hour Response)
1. Key Compromise
Definition: Session keys or validator account credentials exposed or suspected compromised.
Immediate actions (within 15 minutes):
1. STOP validator node immediately
└─ systemctl stop pilier-node
2. Rotate session keys
└─ Generate new keys on secure offline machine
3. Alert other validators
└─ Telegram: @pilier_validators
└─ Subject: "URGENT: validator-{id} key compromise suspected"
4. Notify core team
└─ Email: security@pilier.org
└─ Phone: +33 X XX XX XX XX (24/7 hotline)
Within 2 hours:
1. Complete forensic analysis
├─ How were keys compromised? (phishing, malware, insider?)
├─ Review access logs (who accessed server?)
├─ Check for unauthorized transactions
└─ Document timeline
2. Generate new session keys (secure ceremony)
├─ Use air-gapped machine if available
├─ Store in hardware security module (HSM) if available
└─ Backup encrypted with strong passphrase
3. Submit governance proposal: "Rotate session keys for validator-{id}"
├─ Explain incident (transparency)
├─ Provide new session keys
└─ Request emergency approval (fast-track: 48 hours instead of 14 days)
4. Document incident
└─ Use Incident Report Template (see below)
Follow-up (within 7 days):
1. Security audit
├─ Review all access controls
├─ Scan for malware/backdoors
├─ Update passwords, SSH keys, firewall rules
└─ Consider engaging external security firm
2. Post-mortem report
├─ What happened? (root cause)
├─ How was it detected?
├─ What was the impact?
├─ How do we prevent recurrence?
└─ Publish on forum (transparency)
3. Insurance claim (if applicable)
└─ Notify cyber liability insurer within 72 hours
2. Network Attack (DDoS, Eclipse)
Definition: Malicious traffic targeting validator or network-wide attack.
DDoS (Distributed Denial of Service):
Symptoms:
├─ Abnormally high inbound traffic (10-100× normal)
├─ Node unresponsive (cannot sync blocks)
├─ CPU/bandwidth maxed out
└─ Peers disconnecting
Immediate actions (within 30 minutes):
1. Enable DDoS mitigation
├─ Cloudflare: Enable "I'm Under Attack" mode
├─ Firewall: Rate-limit connections (iptables / ufw)
├─ Null-route attacking IPs (if identifiable)
└─ Switch to backup IP if available
2. Alert other validators
└─ Telegram: "validator-{id} under DDoS, investigating"
3. Contact hosting provider
├─ Request upstream DDoS protection
├─ Consider temporary IP change
└─ Log attack traffic (for analysis)
Within 2 hours:
1. Assess impact
├─ How long was validator offline?
├─ Missed blocks / finality votes?
└─ Any data loss?
2. Restore service
├─ Bring node back online (with mitigation active)
├─ Verify sync status (check latest block)
└─ Monitor for continued attack
3. Document attack
├─ Attack duration (start / end time)
├─ Attack vector (UDP flood, SYN flood, application layer?)
├─ Source IPs (if known)
└─ Mitigation effectiveness
Eclipse Attack:
Symptoms:
├─ Node isolated from legitimate peers
├─ Only connected to attacker-controlled peers
├─ Receives invalid blocks / false data
└─ Appears to be syncing but on wrong chain
Immediate actions (within 15 minutes):
1. Disconnect all peers
└─ Restart node with --reserved-only flag
2. Connect to known-good validators
└─ Use explicit --reserved-nodes list (trusted validators only)
3. Verify chain state
├─ Compare block hash with other validators
├─ Check finality (via telemetry or block explorer)
└─ Re-sync if on wrong fork
4. Alert network
└─ Telegram: "Eclipse attack detected on validator-{id}"
3. Runtime Bug (Consensus-Breaking)
Definition: Critical bug in blockchain runtime causing network halt or invalid state.
Symptoms:
Network-wide:
├─ Finality stalled (no new finalized blocks)
├─ Validators producing conflicting blocks
├─ Invalid state transitions
└─ Nodes crashing repeatedly
Immediate actions (within 1 hour):
1. STOP validator node (if instructed by core team)
└─ Prevent further damage to chain state
2. Join emergency coordination
└ ─ Telegram: @pilier_validators_emergency
└─ Core team will provide instructions
3. Test proposed fix on local testnet
├─ Core team provides patched runtime
├─ Validator tests on isolated node
└─ Verify fix resolves issue
4. Coordinate upgrade
├─ All validators must upgrade simultaneously
├─ Agree on block height for activation
└─ Execute on schedule (no early/late upgrades)
Within 2 hours:
1. Execute emergency governance vote
├─ Proposal: "Emergency runtime upgrade to fix [bug]"
├─ Fast-track voting: 24-48 hours (instead of 14 days)
├─ Validators vote based on testnet results
└─ Requires 80% approval (high bar for emergency)
2. Deploy fix
├─ Update node binary
├─ Restart validator
├─ Verify network recovers
└─ Monitor for 24 hours
3. Document incident
└─ Post-mortem published within 7 days
4. Hardware Failure
Definition: Critical hardware component failed (disk, RAM, CPU, network).
Symptoms:
Common failures:
├─ Disk failure (I/O errors, filesystem corruption)
├─ RAM failure (kernel panics, random crashes)
├─ Network card failure (no connectivity)
└─ Power supply failure (unexpected shutdowns)
Immediate actions (within 30 minutes):
1. Diagnose failure
├─ Check system logs: journalctl -xe
├─ Test hardware: smartctl (disks), memtest (RAM)
└─ Identify failed component
2. Failover to backup (if available)
├─ Switch DNS to backup server IP
├─ Sync blockchain data from snapshot/backup
├─ Start validator on backup hardware
└─ Estimate: 1-4 hours to restore
3. Alert other validators
└─ Telegram: "validator-{id} hardware failure, restoring from backup, ETA 2 hours"
Within 6 hours:
1. If no backup: Emergency hardware replacement
├─ Order replacement part (same-day delivery if possible)
├─ Or rent temporary cloud server (OVH, Hetzner)
├─ Sync blockchain (may take 6-24 hours for full sync)
└─ Resume validation
2. Document downtime
├─ Failure timestamp
├─ Root cause (component failure)
├─ Restoration time
└─ Missed blocks / votes
3. Post-incident review
├─ Why no backup? (if applicable)
├─ How to prevent? (RAID, redundant PSU, monitoring)
└─ Update disaster recovery plan
5. Persistent Downtime (>10 Days)
Definition: Validator offline for more than 10 consecutive days without communication.
This triggers governance removal process.
Validator obligations during extended downtime:
If you know you'll be offline >24 hours:
1. Notify other validators IMMEDIATELY
└─ Telegram: "validator-{id} will be offline [duration] due to [reason]"
2. Provide ETA for restoration
└─ "Expect to be back online by [date/time]"
3. Daily status updates (if downtime extends)
└─ "Still working on [issue], ETA now [new date]"
If downtime exceeds 10 days:
└─ Expect governance proposal: "Remove validator-{id} for persistent downtime"
└─ You can submit counter-proposal: "Extend grace period, validator returning [date]"
For other validators:
If peer validator offline >10 days with no communication:
1. Attempt contact (all channels)
├─ Email: validator-ops@entity.org
├─ Phone: Emergency contact number
├─ Social media: LinkedIn, Twitter (last resort)
└─ Document contact attempts (for governance proposal)
2. Submit removal proposal (if no response after 15 days)
├─ Evidence: On-chain telemetry (last heartbeat)
├─ Justification: Non-responsive, Charter violation
├─ Grace period: 30-day notice before removal
└─ Voting period: 14 days
3. Execute removal (if approved)
├─ Remove from session keys (runtime call)
├─ Archive validator data (for transparency)
└─ Redistribute block production among remaining validators
High Priority Incidents (24-Hour Response)
1. Performance Degradation
Symptoms:
Validator producing <90% of expected blocks:
├─ Expected: ~20% of blocks (if 5 validators)
├─ Actual: <18% of blocks
└─ Duration: >7 consecutive days
Actions (within 24 hours):
1. Diagnose root cause
├─ Check CPU usage: top, htop
├─ Check disk I/O: iostat, iotop
├─ Check network: ping latency, packet loss
├─ Check peers: how many connected? (should be 10+)
└─ Check logs: any errors? warnings?
2. Apply fixes
├─ If CPU bound: Upgrade to higher core count
├─ If disk I/O bound: Switch to NVMe SSD
├─ If network: Optimize firewall rules, switch ISP
├─ If peer issues: Add more bootnodes
└─ If logs show errors: Update node binary, clear cache
3. Monitor improvement
├─ Track block production rate (next 48 hours)
├─ Should return to >95% expected blocks
└─ If not: Consider hardware upgrade or hosting change
2. Missed Runtime Upgrade
Scenario: Network upgraded to new runtime, but validator still running old version.
Symptoms:
├─ Validator producing blocks, but they're being rejected
├─ "Invalid runtime version" errors in logs
├─ Finality participation drops to 0%
└─ Telemetry shows "outdated" status
Actions (within 6 hours):
1. Identify missed upgrade
├─ Check governance proposals: Was there a runtime upgrade?
├─ Check current runtime version: curl rpc.pilier.net (compare to your node)
└─ Check Telegram announcements (core team posts upgrade notices)
2. Upgrade immediately
├─ Download latest binary: wget https://releases.pilier.net/v1.x.x
├─ Stop node: systemctl stop pilier-node
├─ Replace binary: mv pilier-node /usr/local/bin/
├─ Start node: systemctl start pilier-node
└─ Verify sync: check logs for successful block import
3. Apologize + document
├─ Telegram: "validator-{id} missed runtime upgrade, now fixed"
├─ Forum post: Explain why missed (monitoring gap? missed announcement?)
└─ Update procedures to prevent recurrence (subscribe to announcements)
Medium Priority Incidents (7-Day Response)
1. Certificate Expiration
TLS/SSL certificates expire (if using HTTPS for RPC/telemetry).
Actions (within 7 days before expiry):
1. Renew certificate
├─ Let's Encrypt: certbot renew
├─ Or manual: Generate new CSR, get signed cert
└─ Update nginx/apache config
2. Restart web server
└─ systemctl restart nginx
3. Verify
└─ Check expiry: openssl s_client -connect validator.pilier.net:443
2. Monitoring Alerts Misconfigured
False positives or missing alerts.
Actions (within 7 days):
1. Review alert thresholds
├─ Too sensitive? (alerting on every minor spike)
├─ Too lax? (missed actual outage)
└─ Adjust in Prometheus / Grafana
2. Test alerts
└─ Simulate failure (stop node briefly, verify alert fires)
3. Document tuning
└─ Update monitoring runbook
Low Priority Incidents (30-Day Response)
1. Routine Maintenance
Scheduled server updates, OS patches.
Actions (within 30 days):
1. Plan maintenance window
├─ Choose low-traffic period (weekends)
├─ Notify other validators 48 hours in advance
└─ Expected downtime: <1 hour
2. Execute maintenance
├─ apt update && apt upgrade (Ubuntu)
├─ Restart server if kernel updated
└─ Verify validator resumes normally
3. Document
└─ Log maintenance in runbook (for auditing)
Emergency Contacts
24/7 Hotline
Critical incidents only (key compromise, network attack):
📞 Phone: +33 X XX XX XX XX
📧 Email: security@pilier.net
⏰ Response time: <30 minutes
Validator Communication
All validators:
💬 Telegram: @pilier_validators (private channel)
📧 Email: validators@pilier.org