Skip to main content

Security Procedures

Incident response protocols, emergency procedures, and communication templates for Pilier validators.

Reading time: 12 minutes


Incident Classification

All incidents are classified by severity and response time.

Severity Levels

LevelNameResponse TimeExamples
🔴 CriticalNetwork-breaking<2 hoursKey compromise, consensus failure, network attack
🟠 HighService degraded<24 hoursHardware failure, performance issues, missed blocks
🟡 MediumMinor disruption<7 daysMonitoring alerts, certificate expiration, log rotation
🟢 LowInformational<30 daysRoutine maintenance, documentation updates

Critical Incidents (2-Hour Response)

1. Key Compromise

Definition: Session keys or validator account credentials exposed or suspected compromised.

Immediate actions (within 15 minutes):

1. STOP validator node immediately
└─ systemctl stop pilier-node

2. Rotate session keys
└─ Generate new keys on secure offline machine

3. Alert other validators
└─ Telegram: @pilier_validators
└─ Subject: "URGENT: validator-{id} key compromise suspected"

4. Notify core team
└─ Email: security@pilier.org
└─ Phone: +33 X XX XX XX XX (24/7 hotline)

Within 2 hours:

1. Complete forensic analysis
├─ How were keys compromised? (phishing, malware, insider?)
├─ Review access logs (who accessed server?)
├─ Check for unauthorized transactions
└─ Document timeline

2. Generate new session keys (secure ceremony)
├─ Use air-gapped machine if available
├─ Store in hardware security module (HSM) if available
└─ Backup encrypted with strong passphrase

3. Submit governance proposal: "Rotate session keys for validator-{id}"
├─ Explain incident (transparency)
├─ Provide new session keys
└─ Request emergency approval (fast-track: 48 hours instead of 14 days)

4. Document incident
└─ Use Incident Report Template (see below)

Follow-up (within 7 days):

1. Security audit
├─ Review all access controls
├─ Scan for malware/backdoors
├─ Update passwords, SSH keys, firewall rules
└─ Consider engaging external security firm

2. Post-mortem report
├─ What happened? (root cause)
├─ How was it detected?
├─ What was the impact?
├─ How do we prevent recurrence?
└─ Publish on forum (transparency)

3. Insurance claim (if applicable)
└─ Notify cyber liability insurer within 72 hours

2. Network Attack (DDoS, Eclipse)

Definition: Malicious traffic targeting validator or network-wide attack.

DDoS (Distributed Denial of Service):

Symptoms:
├─ Abnormally high inbound traffic (10-100× normal)
├─ Node unresponsive (cannot sync blocks)
├─ CPU/bandwidth maxed out
└─ Peers disconnecting

Immediate actions (within 30 minutes):

1. Enable DDoS mitigation
├─ Cloudflare: Enable "I'm Under Attack" mode
├─ Firewall: Rate-limit connections (iptables / ufw)
├─ Null-route attacking IPs (if identifiable)
└─ Switch to backup IP if available

2. Alert other validators
└─ Telegram: "validator-{id} under DDoS, investigating"

3. Contact hosting provider
├─ Request upstream DDoS protection
├─ Consider temporary IP change
└─ Log attack traffic (for analysis)

Within 2 hours:

1. Assess impact
├─ How long was validator offline?
├─ Missed blocks / finality votes?
└─ Any data loss?

2. Restore service
├─ Bring node back online (with mitigation active)
├─ Verify sync status (check latest block)
└─ Monitor for continued attack

3. Document attack
├─ Attack duration (start / end time)
├─ Attack vector (UDP flood, SYN flood, application layer?)
├─ Source IPs (if known)
└─ Mitigation effectiveness

Eclipse Attack:

Symptoms:
├─ Node isolated from legitimate peers
├─ Only connected to attacker-controlled peers
├─ Receives invalid blocks / false data
└─ Appears to be syncing but on wrong chain

Immediate actions (within 15 minutes):

1. Disconnect all peers
└─ Restart node with --reserved-only flag

2. Connect to known-good validators
└─ Use explicit --reserved-nodes list (trusted validators only)

3. Verify chain state
├─ Compare block hash with other validators
├─ Check finality (via telemetry or block explorer)
└─ Re-sync if on wrong fork

4. Alert network
└─ Telegram: "Eclipse attack detected on validator-{id}"

3. Runtime Bug (Consensus-Breaking)

Definition: Critical bug in blockchain runtime causing network halt or invalid state.

Symptoms:

Network-wide:
├─ Finality stalled (no new finalized blocks)
├─ Validators producing conflicting blocks
├─ Invalid state transitions
└─ Nodes crashing repeatedly

Immediate actions (within 1 hour):

1. STOP validator node (if instructed by core team)
└─ Prevent further damage to chain state

2. Join emergency coordination
└─ Telegram: @pilier_validators_emergency
└─ Core team will provide instructions

3. Test proposed fix on local testnet
├─ Core team provides patched runtime
├─ Validator tests on isolated node
└─ Verify fix resolves issue

4. Coordinate upgrade
├─ All validators must upgrade simultaneously
├─ Agree on block height for activation
└─ Execute on schedule (no early/late upgrades)

Within 2 hours:

1. Execute emergency governance vote
├─ Proposal: "Emergency runtime upgrade to fix [bug]"
├─ Fast-track voting: 24-48 hours (instead of 14 days)
├─ Validators vote based on testnet results
└─ Requires 80% approval (high bar for emergency)

2. Deploy fix
├─ Update node binary
├─ Restart validator
├─ Verify network recovers
└─ Monitor for 24 hours

3. Document incident
└─ Post-mortem published within 7 days

4. Hardware Failure

Definition: Critical hardware component failed (disk, RAM, CPU, network).

Symptoms:

Common failures:
├─ Disk failure (I/O errors, filesystem corruption)
├─ RAM failure (kernel panics, random crashes)
├─ Network card failure (no connectivity)
└─ Power supply failure (unexpected shutdowns)

Immediate actions (within 30 minutes):

1. Diagnose failure
├─ Check system logs: journalctl -xe
├─ Test hardware: smartctl (disks), memtest (RAM)
└─ Identify failed component

2. Failover to backup (if available)
├─ Switch DNS to backup server IP
├─ Sync blockchain data from snapshot/backup
├─ Start validator on backup hardware
└─ Estimate: 1-4 hours to restore

3. Alert other validators
└─ Telegram: "validator-{id} hardware failure, restoring from backup, ETA 2 hours"

Within 6 hours:

1. If no backup: Emergency hardware replacement
├─ Order replacement part (same-day delivery if possible)
├─ Or rent temporary cloud server (OVH, Hetzner)
├─ Sync blockchain (may take 6-24 hours for full sync)
└─ Resume validation

2. Document downtime
├─ Failure timestamp
├─ Root cause (component failure)
├─ Restoration time
└─ Missed blocks / votes

3. Post-incident review
├─ Why no backup? (if applicable)
├─ How to prevent? (RAID, redundant PSU, monitoring)
└─ Update disaster recovery plan

5. Persistent Downtime (>10 Days)

Definition: Validator offline for more than 10 consecutive days without communication.

This triggers governance removal process.

Validator obligations during extended downtime:

If you know you'll be offline >24 hours:

1. Notify other validators IMMEDIATELY
└─ Telegram: "validator-{id} will be offline [duration] due to [reason]"

2. Provide ETA for restoration
└─ "Expect to be back online by [date/time]"

3. Daily status updates (if downtime extends)
└─ "Still working on [issue], ETA now [new date]"

If downtime exceeds 10 days:
└─ Expect governance proposal: "Remove validator-{id} for persistent downtime"
└─ You can submit counter-proposal: "Extend grace period, validator returning [date]"

For other validators:

If peer validator offline >10 days with no communication:

1. Attempt contact (all channels)
├─ Email: validator-ops@entity.org
├─ Phone: Emergency contact number
├─ Social media: LinkedIn, Twitter (last resort)
└─ Document contact attempts (for governance proposal)

2. Submit removal proposal (if no response after 15 days)
├─ Evidence: On-chain telemetry (last heartbeat)
├─ Justification: Non-responsive, Charter violation
├─ Grace period: 30-day notice before removal
└─ Voting period: 14 days

3. Execute removal (if approved)
├─ Remove from session keys (runtime call)
├─ Archive validator data (for transparency)
└─ Redistribute block production among remaining validators

High Priority Incidents (24-Hour Response)

1. Performance Degradation

Symptoms:

Validator producing <90% of expected blocks:
├─ Expected: ~20% of blocks (if 5 validators)
├─ Actual: <18% of blocks
└─ Duration: >7 consecutive days

Actions (within 24 hours):

1. Diagnose root cause
├─ Check CPU usage: top, htop
├─ Check disk I/O: iostat, iotop
├─ Check network: ping latency, packet loss
├─ Check peers: how many connected? (should be 10+)
└─ Check logs: any errors? warnings?

2. Apply fixes
├─ If CPU bound: Upgrade to higher core count
├─ If disk I/O bound: Switch to NVMe SSD
├─ If network: Optimize firewall rules, switch ISP
├─ If peer issues: Add more bootnodes
└─ If logs show errors: Update node binary, clear cache

3. Monitor improvement
├─ Track block production rate (next 48 hours)
├─ Should return to >95% expected blocks
└─ If not: Consider hardware upgrade or hosting change

2. Missed Runtime Upgrade

Scenario: Network upgraded to new runtime, but validator still running old version.

Symptoms:

├─ Validator producing blocks, but they're being rejected
├─ "Invalid runtime version" errors in logs
├─ Finality participation drops to 0%
└─ Telemetry shows "outdated" status

Actions (within 6 hours):

1. Identify missed upgrade
├─ Check governance proposals: Was there a runtime upgrade?
├─ Check current runtime version: curl rpc.pilier.net (compare to your node)
└─ Check Telegram announcements (core team posts upgrade notices)

2. Upgrade immediately
├─ Download latest binary: wget https://releases.pilier.net/v1.x.x
├─ Stop node: systemctl stop pilier-node
├─ Replace binary: mv pilier-node /usr/local/bin/
├─ Start node: systemctl start pilier-node
└─ Verify sync: check logs for successful block import

3. Apologize + document
├─ Telegram: "validator-{id} missed runtime upgrade, now fixed"
├─ Forum post: Explain why missed (monitoring gap? missed announcement?)
└─ Update procedures to prevent recurrence (subscribe to announcements)

Medium Priority Incidents (7-Day Response)

1. Certificate Expiration

TLS/SSL certificates expire (if using HTTPS for RPC/telemetry).

Actions (within 7 days before expiry):

1. Renew certificate
├─ Let's Encrypt: certbot renew
├─ Or manual: Generate new CSR, get signed cert
└─ Update nginx/apache config

2. Restart web server
└─ systemctl restart nginx

3. Verify
└─ Check expiry: openssl s_client -connect validator.pilier.net:443

2. Monitoring Alerts Misconfigured

False positives or missing alerts.

Actions (within 7 days):

1. Review alert thresholds
├─ Too sensitive? (alerting on every minor spike)
├─ Too lax? (missed actual outage)
└─ Adjust in Prometheus / Grafana

2. Test alerts
└─ Simulate failure (stop node briefly, verify alert fires)

3. Document tuning
└─ Update monitoring runbook

Low Priority Incidents (30-Day Response)

1. Routine Maintenance

Scheduled server updates, OS patches.

Actions (within 30 days):

1. Plan maintenance window
├─ Choose low-traffic period (weekends)
├─ Notify other validators 48 hours in advance
└─ Expected downtime: <1 hour

2. Execute maintenance
├─ apt update && apt upgrade (Ubuntu)
├─ Restart server if kernel updated
└─ Verify validator resumes normally

3. Document
└─ Log maintenance in runbook (for auditing)

Emergency Contacts

24/7 Hotline

Critical incidents only (key compromise, network attack):

📞 Phone: +33 X XX XX XX XX
📧 Email: security@pilier.net
Response time: <30 minutes


Validator Communication

All validators:

💬 Telegram: @pilier_validators (private channel)
📧 Email: validators@pilier.org
🌐 Forum: forum.pilier.net/validators

Emergency coordination:

💬 Telegram: @pilier_validators_emergency (critical incidents only)


Core Team

Technical support:

📧 tech-support@pilier.net
⏰ Response: <24 hours (business days)

Governance questions:

📧 governance@pilier.net


Communication Requirements

When to Alert Other Validators

Always alert for:

  • ✅ Critical incidents (key compromise, attack, network issue)
  • ✅ Planned downtime >1 hour
  • ✅ Performance degradation (producing <90% blocks)
  • ✅ Hardware failures (even if quickly resolved)

Optional (but recommended) alert for:

  • ⚠️ Routine maintenance (<1 hour downtime)
  • ⚠️ Minor issues (resolved within 30 minutes)

Alert Format (Telegram)

Quick alert:

🔴 validator-lyon-01: CRITICAL
Issue: Key compromise suspected
Status: Node stopped, rotating keys
ETA: 2 hours
Contact: ops@univ-lyon.fr

Update:

🟢 validator-lyon-01: RESOLVED
Issue: Key compromise (root cause: phishing)
Actions: Keys rotated, governance proposal submitted
Status: Node back online
Post-mortem: Will publish within 48 hours

Incident Report Format (Email)

Send to: validators@pilier.org
Subject: Incident Report - validator-{id} - [Date]

Template:

## Incident Summary

**Validator:** validator-lyon-01
**Date:** 2027-03-15
**Severity:** Critical
**Duration:** 3 hours 45 minutes
**Impact:** Missed 225 blocks, 0 finality votes during outage

## Timeline

2027-03-15 14:23 UTC: Phishing email received by operator
2027-03-15 14:35 UTC: Operator clicked link, entered credentials
2027-03-15 15:00 UTC: Suspicious activity detected (IP from unknown location)
2027-03-15 15:10 UTC: Validator node stopped (security measure)
2027-03-15 15:15 UTC: Other validators alerted via Telegram
2027-03-15 15:30 UTC: Session keys rotated (offline machine)
2027-03-15 16:45 UTC: Governance proposal submitted (fast-track)
2027-03-15 18:05 UTC: Validator back online (new keys)

## Root Cause

Phishing attack targeting validator operator.
Email impersonated Pilier core team, requesting "urgent security update."

## Impact Assessment

**Network impact:**
├─ Validator offline: 3h 45min
├─ Missed blocks: 225 / 900 (25% of validator's responsibility during outage)
├─ Finality: Unaffected (4/5 validators still active, >2/3 threshold maintained)
└─ Network continued operating normally

**Validator impact:**
├─ Uptime: 99.48% (month-to-date, including this incident)
├─ Reputation: Minor hit (first incident in 6 months)
└─ Financial: No loss (insurance covers incident response costs)

## Corrective Actions

**Immediate:**
├─ Session keys rotated and stored in HSM
├─ 2FA enabled on all accounts
├─ Email filters updated (block phishing domains)
└─ Operator re-trained on security awareness

**Long-term:**
├─ Hardware security keys (YubiKey) ordered for all operators
├─ Phishing simulation training (quarterly)
├─ Consider moving to hardware-isolated signing (air-gapped)
└─ Update security runbook

## Lessons Learned

1. Phishing remains #1 attack vector (human error)
2. Response time was good (detected within 25 minutes)
3. Network resilience confirmed (4/5 validators sufficient)
4. Need better operator training (scheduled for Q2)

## Attachments

├─ Phishing email (screenshot)
├─ Access logs (sanitized)
└─ Governance proposal #127 (key rotation)

---

Submitted by: ops@univ-lyon.fr
Date: 2027-03-16


Post-Incident Procedures

1. Post-Mortem Report (Within 48-72 Hours)

Required for:

  • Critical incidents
  • High-impact incidents (>4 hours downtime)
  • Security breaches

Published on:

  • Forum: forum.pilier.net/validators/incidents
  • Governance portal (linked to incident)

Template: See Incident Report Format above


2. Update Runbooks

After every incident:


1. Review what worked / didn't work
2. Update procedures in internal runbook
3. Share improvements with other validators (forum post)
4. Update monitoring / alerting (prevent recurrence)


3. Insurance Claims (If Applicable)

For covered incidents:


1. Notify insurer within 72 hours
└─ Email: claims@cyber-insurer.com
└─ Reference: Policy #XXX-YYYY

2. Provide documentation
├─ Incident report (see template above)
├─ Forensic analysis (if available)
├─ Cost breakdown (consultant fees, hardware replacement)
└─ Proof of incident (logs, screenshots)

3. Claim processing
└─ Typical timeline: 30-60 days
└─ Reimbursement: Direct deposit or check


Security Checklist (Quarterly Review)

Every 3 months, validators should:


□ Review access controls (who has SSH access?)
□ Rotate passwords / SSH keys
□ Update firewall rules (any new IPs to whitelist?)
□ Test backup restoration (can you actually restore from backup?)
□ Review monitoring alerts (any false positives / missed alerts?)
□ Update node binary (latest stable version?)
□ Check certificate expiry (renew if <30 days remaining)
□ Review insurance coverage (still adequate? any incidents to report?)
□ Test emergency contact (core team sends test alert)
□ Update disaster recovery plan (any changes to infrastructure?)


TEMPLATES

Template 1: Incident Report

# Incident Report: [Brief Description]

**Validator:** validator-{id}
**Date:** YYYY-MM-DD
**Severity:** Critical / High / Medium / Low
**Duration:** X hours Y minutes
**Impact:** [Brief summary]

## Timeline

YYYY-MM-DD HH:MM UTC: [Event 1]
YYYY-MM-DD HH:MM UTC: [Event 2]
...

## Root Cause

[What caused the incident?]

## Impact Assessment

**Network impact:**
├─ [Metric 1]
├─ [Metric 2]
└─ [Conclusion]

**Validator impact:**
├─ [Metric 1]
├─ [Metric 2]
└─ [Conclusion]

## Corrective Actions

**Immediate:**
├─ [Action 1]
├─ [Action 2]
└─ [Action 3]

**Long-term:**
├─ [Action 1]
├─ [Action 2]
└─ [Action 3]

## Lessons Learned

1. [Lesson 1]
2. [Lesson 2]
3. [Lesson 3]

## Attachments

├─ [File 1]
└─ [File 2]

---

Submitted by: [email]
Date: YYYY-MM-DD

Template 2: Downtime Notification (Email)

To: validators@pilier.org
Subject: Planned Downtime - validator-{id} - [Date]

Hello fellow validators,

This is to notify you of planned maintenance for validator-lyon-01.

**Maintenance Window:**
├─ Start: 2027-04-20 02:00 UTC (Saturday)
├─ Duration: ~1 hour
└─ End: 2027-04-20 03:00 UTC (expected)

**Reason:**
Routine OS updates (security patches) + hardware inspection.

**Expected Impact:**
├─ Validator offline during maintenance window
├─ Missed blocks: ~600 (negligible, network continues with 4/5 validators)
└─ Finality: Unaffected (>2/3 threshold maintained)

**Contact:**
If any issues arise, reach me at:
├─ Email: ops@univ-lyon.fr
├─ Telegram: @lyon_validator_ops
└─ Phone: +33 X XX XX XX XX

Thank you for your understanding.

---
University of Lyon Validator Team
validator-lyon-01

Template 3: Emergency Contact List

# Emergency Contacts - validator-lyon-01

**Last updated:** 2027-03-01

## Primary Contacts

**Validator Operator:**
├─ Name: Jean Dupont
├─ Email: ops@univ-lyon.fr
├─ Phone: +33 6 XX XX XX XX (24/7)
├─ Telegram: @lyon_validator_ops
└─ Backup: Marie Martin (backup-ops@univ-lyon.fr, +33 6 YY YY YY YY)

**University IT Department:**
├─ Email: support-it@univ-lyon.fr
├─ Phone: +33 4 XX XX XX XX (business hours)
└─ Emergency: +33 6 ZZ ZZ ZZ ZZ (after hours)

## External Contacts

**Pilier Core Team:**
├─ Security hotline: +33 X XX XX XX XX
├─ Email: security@pilier.org
└─ Telegram: @pilier_validators_emergency

**Hosting Provider (OVH):**
├─ Support: +33 9 XX XX XX XX
├─ Email: support@ovh.com
└─ Customer ID: ABC123456

**Insurance (Cyber Liability):**
├─ Provider: Hiscox
├─ Policy #: XXX-YYYY-ZZZZ
├─ Claims: +33 1 XX XX XX XX
└─ Email: claims@hiscox.fr

## Internal Escalation

**Level 1:** Validator operator (Jean Dupont)
**Level 2:** Backup operator (Marie Martin)
**Level 3:** University IT Manager (Pierre Lefebvre, it-manager@univ-lyon.fr)
**Level 4:** University CIO (Sophie Bernard, cio@univ-lyon.fr)

---

**Test Schedule:**
Emergency contacts tested quarterly (Jan, Apr, Jul, Oct)
Last test: 2027-01-15 (PASSED)
Next test: 2027-04-15

Template 4: Post-Mortem

# Post-Mortem: [Incident Title]

**Date:** YYYY-MM-DD
**Authors:** [Name(s)]
**Status:** Draft / Final
**Related Incident:** [Link to incident report]

---

## Executive Summary

[2-3 sentences: What happened, impact, resolution]

---

## What Happened?

[Detailed narrative: chronological story of incident]

---

## Root Cause Analysis

**Primary cause:**
[What was the immediate trigger?]

**Contributing factors:**
├─ [Factor 1: e.g., insufficient monitoring]
├─ [Factor 2: e.g., lack of backup]
└─ [Factor 3: e.g., operator error]

**5 Whys:**

1. Why did X happen? → Because Y
2. Why did Y happen? → Because Z
3. Why did Z happen? → Because A
4. Why did A happen? → Because B
5. Why did B happen? → **Root cause: C**

---

## What Went Well?

├─ [Positive 1: e.g., quick detection]
├─ [Positive 2: e.g., good communication]
└─ [Positive 3: e.g., backup worked]

---

## What Didn't Go Well?

├─ [Negative 1: e.g., slow response]
├─ [Negative 2: e.g., missing documentation]
└─ [Negative 3: e.g., unclear responsibilities]

---

## Action Items

| Action | Owner | Deadline | Priority |
| ---------- | ------ | ---------- | -------- |
| [Action 1] | [Name] | YYYY-MM-DD | High |
| [Action 2] | [Name] | YYYY-MM-DD | Medium |
| [Action 3] | [Name] | YYYY-MM-DD | Low |

---

## Lessons Learned

1. [Lesson 1]
2. [Lesson 2]
3. [Lesson 3]

---

## Timeline (Detailed)

| Time (UTC) | Event | Action Taken |
| ---------- | --------- | ------------ |
| HH:MM | [Event 1] | [Action 1] |
| HH:MM | [Event 2] | [Action 2] |
| ... | ... | ... |

---

## Metrics

**Downtime:** X hours Y minutes
**Missed blocks:** N / M (N%)
**Finality impact:** Yes / No
**User impact:** [Description]
**Cost:** €X (hardware, consulting, etc.)

---

**Published:** YYYY-MM-DD
**Forum:** [Link]
**Governance:** [Link if proposal submitted]

Summary

Incident classification:

  • 🔴 Critical (2h): Key compromise, network attack, runtime bug, hardware failure
  • 🟠 High (24h): Performance issues, missed upgrades
  • 🟡 Medium (7d): Certs, monitoring
  • 🟢 Low (30d): Routine maintenance

Communication:

  • Alert validators for all critical incidents
  • Provide ETA and regular updates
  • Post incident reports within 48 hours

Documentation:

  • Use templates (incident report, downtime notice, post-mortem)
  • Publish on forum (transparency)
  • Update runbooks (continuous improvement)

Emergency contacts:

  • 📞 Security hotline: +33 X XX XX XX XX
  • 💬 Telegram: @pilier_validators_emergency
  • 📧 Email: security@pilier.org

Next Steps

For validators:

  1. ✅ Save emergency contact list (Template 3)
  2. ✅ Test incident response (simulate hardware failure)
  3. ✅ Set up monitoring alerts (Prometheus + Grafana)
  4. ✅ Review quarterly security checklist
  5. 📧 Questions? Email: validators@pilier.org

Support

📧 Security: security@pilier.org
💬 Telegram: @pilier_validators
🌐 Forum: forum.pilier.net/validators
📞 Emergency: +33 X XX XX XX XX (24/7)