Files
openwrt/docs/implementation-plan.md
Dan Head 4c0982f854 chore: initial repo setup with baseline config backup
- Pull current config from router (OpenWRT 24.10.2)
- Add backup, safe-apply, and push-all scripts
- Add CLAUDE.md with workflow rules and context
- Add network-map.md with current topology and planned VLANs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 23:14:06 +01:00

22 KiB
Raw Blame History

VLAN Implementation Plan

Guiding Principles

  • Every risky change goes through safe-apply.sh with a revert window
  • Build alongside, then cut over — new VLANs and SSIDs are created while the existing flat network stays up; the cutover is a single planned step
  • Servers migrate before clients — HA and other services need stable IPs before IoT/media devices reconnect to them
  • Have a fallback — keep a phone on mobile data during the cutover so you can SSH into the router if WiFi drops and doesn't recover

Prerequisites (Complete Before Any Router Changes)

  • Fill in all MAC addresses in vlan-requirements.md
  • Note Shield TV's current hostname/IP from LuCI
  • Document all current port forwards (see docs/network-map.md → Port Forwards)
  • Note any hardcoded IPs in Home Assistant — Frigate (10.0.0.12) and Enphase Envoy (10.0.0.144); Frigate also has doorbell camera IP (10.0.0.41) hardcoded in its config
  • DNS records confirmed — managed in router config/dhcp, not PiHole (no local DNS records in PiHole UI or pihole.toml)
  • Add PiHole Local DNS records (Settings → Local DNS → DNS Records) for split-horizon DNS — internal clients resolve service hostnames to everlost's internal IP directly, bypassing hairpin and keeping services reachable during WAN outages:
    • jester.danielhead.com10.0.0.2
    • wayfaerer.danielhead.com10.0.0.2
    • wg0.danielhead.com10.0.0.2
    • (add any future service subdomains here too)
  • Push updated config/dhcp to remove now-redundant dnsmasq domain entries: ./scripts/safe-apply.sh dhcp 5
  • Collect MAC addresses for internet-allowed IoT devices from LuCI → Network → DHCP Leases (Hypervolt, OCTO-CADLITE, HP printer, Alarmo, Envoy) — fill into vlan-requirements.md
  • Complete the br-guest port assignment test (see docs/pre-implementation-findings.md → Pending Validation Test)
  • Push updated config/network to remove LAN4 from br-guest
  • Run ./scripts/backup.sh to snapshot current working config

Phase 0 — Upgrade router to openwrt-25.12.2

Upgrade OpenWRT to latest stable version using sysupgrade. The ramips/mt7621 target supports config-preserving upgrades but this must be explicitly requested — without the -k flag sysupgrade will factory reset the router.

Pre-flight:

# Snapshot current config into the repo first
./scripts/backup.sh

# Verify the backup looks correct before proceeding
git diff config/

Copy firmware to the router and verify checksum:

# Check available space first
ssh openwrt "df -h /tmp"

# Copy the firmware binary from this repo to the router
scp openwrt-25.12.2-ramips-mt7621-tplink_archer-ax23-v1-squashfs-sysupgrade.bin openwrt:/tmp/

# Verify checksum matches the value on downloads.openwrt.org
ssh openwrt "sha256sum /tmp/openwrt-25.12.2-ramips-mt7621-tplink_archer-ax23-v1-squashfs-sysupgrade.bin"

Apply the upgrade:

# -k preserves /etc/config/* — without this it factory resets
ssh openwrt "sysupgrade -k /tmp/openwrt-25.12.2-ramips-mt7621-tplink_archer-ax23-v1-squashfs-sysupgrade.bin"

The router will reboot. Reconnect after ~2 minutes.

Verify:

ssh openwrt "cat /etc/openwrt_release"    # confirm new version
ssh openwrt "uci show network.lan.ipaddr" # confirm LAN IP intact
./scripts/backup.sh                        # confirm config still matches repo

Rollback: sysupgrade does not support automatic rollback. If the router becomes unreachable after upgrading, connect via ethernet and access it at 192.168.1.1 (default IP after a reset). Restore config using the Clean Restore steps at the bottom of this document.

Phase 1 — Install Required Packages

Low risk. Packages are additive, nothing changes until configured.

ssh openwrt "opkg update && opkg install avahi-daemon kmod-bridge"
  • avahi-daemon — mDNS reflection across VLANs
  • kmod-bridge — kernel bridging support for VLAN interfaces (may already be present)

Verify: ssh openwrt "avahi-daemon --version"


Phase 2 — Create VLAN Interfaces (network config)

Edit config/network to add VLAN bridge interfaces alongside the existing br-lan.

New interfaces to add:

Interface Bridge Subnet VLAN ID
lan_trusted br-trusted 10.0.1.1/24 1
lan_servers br-servers 10.0.10.1/24 10
lan_iot br-iot 10.0.20.1/24 20
lan_media br-media 10.0.30.1/24 30
lan_guest br-guest 10.0.40.1/24 40

The existing flat br-lan (10.0.0.1/24) stays untouched until cutover.

./scripts/safe-apply.sh network 10

Verify: ssh openwrt "ip addr show" — new bridge interfaces should appear Rollback: If router becomes unreachable, it auto-reverts in 10 minutes


Phase 3 — Configure DHCP Pools

Edit config/dhcp to add a pool for each new VLAN interface. Each pool advertises:

  • Gateway: the router's IP on that VLAN (e.g. 10.0.1.1)
  • DNS: PiHole (10.0.10.2)
  • Static leases for servers, Shield TV, and doorbell camera
./scripts/safe-apply.sh dhcp 5

Verify: Connect a test device to the router via ethernet, manually set IP to e.g. 10.0.1.100/24 gateway 10.0.1.1 — confirm it can ping the gateway.


Phase 4 — Configure Firewall Zones and Rules

Edit config/firewall to add zones for each VLAN and the cross-VLAN rules from vlan-requirements.md. The existing lan zone stays in place.

Key rules to implement:

  • trusted → internet allow
  • trusted → media allow (Cast ports + Sonos ports)
  • trusted → servers allow (SSH + Nginx)
  • servers → iot allow all
  • servers → media allow all
  • media → servers allow (Plex TCP 32400, Jellyfin TCP 8096)
  • iot → internet block by default — set IoT zone forward policy to REJECT
  • iot → internet explicit allows for: Hypervolt (10.0.20.2), OCTO-CADLITE (10.0.20.3), HP printer (10.0.20.4), Alarmo (10.0.20.5), Envoy (10.0.20.6)
  • guest → internet allow only
  • DNS hijack: redirect all outbound TCP/UDP 53 to PiHole (10.0.10.2)

Note: The per-device IoT allow rules depend on static leases being in place (Phase 3) so those devices have predictable IPs. Verify static leases are active before applying firewall rules.

./scripts/safe-apply.sh firewall 10

Verify: Zones appear in LuCI → Network → Firewall


Phase 5 — Add New SSIDs

Edit config/wireless to add new SSIDs mapped to VLAN bridge interfaces. Do not change Moonshield yet — it stays on the flat br-lan for now.

New SSIDs to add:

SSID Interface Band
Cloud Connected br-iot 2.4GHz
Pinball Map br-media 5GHz + 2.4GHz
Passenger br-guest 2.4GHz
./scripts/safe-apply.sh wireless 5

Verify: New SSIDs appear on a phone. Connect a test device to each and confirm it gets an IP in the right subnet (e.g. Passenger → 10.0.40.x).


Phase 6 — Migrate Servers (Maintenance Window Begins)

From this point, brief outages are expected. Ensure your phone is on mobile data.

Update static DHCP leases in config/dhcp to assign new IPs (10.0.10.x) to server devices. Move them from the flat br-lan DHCP to the lan_servers DHCP.

For each server (everlost, homeassistant, frigate, jester, wayfaerer):

  1. Push updated DHCP config
  2. SSH into the server and run sudo dhclient -r && sudo dhclient (or reboot) to renew its lease
  3. Confirm it gets its new 10.0.10.x IP

After all servers have new IPs:

Update config/firewall port forwards to reflect new server IPs:

Name Proto WAN Port Old Dest IP New Dest IP
HTTP TCP 80 10.0.0.2 10.0.10.2
HTTPS TCP 443 10.0.0.2 10.0.10.2
Wireguard UDP 51820 10.0.0.2 10.0.10.2
SSH - Everlost TCP 22563 10.0.0.2 10.0.10.2
SSH - Home Assistant TCP 22553 10.0.0.11 10.0.10.3
SSH - Frigate TCP 22583 10.0.0.12 10.0.10.4
SSH - Jester TCP 22573 10.0.0.21 10.0.10.10
SSH - Wayfaerer TCP 22593 10.0.0.22 10.0.10.11
Plex - Jester TCP 32400 10.0.0.21 10.0.10.10
Plex - Wayfaerer TCP 32450 10.0.0.22 10.0.10.11
./scripts/safe-apply.sh firewall 5
  • Update hardcoded IPs in Home Assistant integrations:
    • Frigate (Settings → Integrations → Frigate): change host from 10.0.0.1210.0.10.4
  • Confirm PiHole dashboard is reachable at 10.0.10.2

Update PiHole Local DNS records (Settings → Local DNS → DNS Records) to point to everlost's new IP:

Name Old IP New IP
jester.danielhead.com 10.0.0.2 10.0.10.2
wayfaerer.danielhead.com 10.0.0.2 10.0.10.2
wg0.danielhead.com 10.0.0.2 10.0.10.2

Update WireGuard config on everlost:

  1. Update wg-easy client DNS setting from 10.0.0.210.0.10.2 and regenerate client configs
  2. Verify from a WG-connected device: nslookup homeassistant.danielhead.com should return 10.0.10.2
  3. Verify WireGuard-connected devices can still reach proxied services

Verify: Home Assistant loads, all integrations show as connected, Nginx proxy still routes external traffic correctly, WireGuard clients can reach internal services.

Add temporary lan → servers firewall rule:

IoT and media devices are still on Moonshield (br-lan, 10.0.0.x) and need to keep reaching HA, Frigate etc. while you migrate them at your own pace. Add a temporary allow-all forwarding rule from the lan zone to the servers zone:

uci add firewall rule
uci set firewall.@rule[-1].name='temp_lan_to_servers'
uci set firewall.@rule[-1].src='lan'
uci set firewall.@rule[-1].dest='servers'
uci set firewall.@rule[-1].target='ACCEPT'
uci commit firewall
./scripts/safe-apply.sh firewall 5

Remember to remove this rule after Phase 7 — once all IoT and media devices have migrated off Moonshield, this rule is no longer needed and leaves an unintended hole.


Phase 7 — Migrate IoT Devices

  1. Connect each IoT device to Cloud Connected SSID
    • ESPHome devices: forget current WiFi in ESPHome config and re-provision, or just update SSID in the ESPHome dashboard
    • Other devices: reconnect via their app or settings
  2. Devices will get IPs in 10.0.20.x
  3. HA should rediscover ESPHome devices automatically via mDNS within a few minutes
  4. Confirm each device shows as available in HA

After IoT devices have new IPs:

  • Update hardcoded IPs in Home Assistant integrations:
    • Enphase Envoy (Settings → Integrations → Enphase Envoy): change host from 10.0.0.14410.0.20.2
  • Update doorbell camera IP in Frigate's config: change from 10.0.0.4110.0.20.1, then restart Frigate

Remove the temporary lan → servers rule (added at end of Phase 6) once all IoT and media devices are off Moonshield:

# Find and delete the rule by name
uci delete firewall.$(uci show firewall | grep 'temp_lan_to_servers' | cut -d. -f2)
uci commit firewall
./scripts/safe-apply.sh firewall 5

Verify: All ESPHome entities, voice assistants, blinds, and sensors show as available in Home Assistant. Test a blind, a sensor reading, and a voice command. Confirm Frigate shows the doorbell camera stream.


Phase 8 — Migrate Media Devices

  1. Connect Shield TV to Pinball Map SSID
    • It will get 10.0.30.2 (static lease)
    • Open Plex and Jellyfin — update server address to 10.0.10.21 (jester.lan) if not auto-discovered
  2. Connect consoles and speakers to Pinball Map SSID
  3. Test casting from a phone (still on flat network at this point) to speakers and Shield

Verify: Plex/Jellyfin plays content, Cast works from phone, Music Assistant in HA can control speakers, HA Shield integration shows as connected.


Phase 9 — Cutover: Move Moonshield to Trusted VLAN

This is the final disruptive step. Moonshield will briefly drop all connected devices while it moves to br-trusted.

Before starting: Plug your laptop into LAN 3 (reserved for trusted VLAN). This gives you a wired fallback — if Moonshield doesn't come back up cleanly, you keep your connection to the router and can intervene.

Edit config/wireless — change Moonshield's interface from br-lan to br-trusted.

./scripts/safe-apply.sh wireless 5

All phones and laptops on Moonshield will disconnect and immediately reconnect to the same SSID — they'll get new IPs in 10.0.1.x. This typically takes 515 seconds.

Verify: Phone reconnects to Moonshield, gets 10.0.1.x IP, internet works, can cast to speakers/Shield, can reach Nginx-proxied services.


Phase 10 — DNS Hijacking

Confirm DNS hijacking rule is active:

ssh openwrt "nft list ruleset | grep -A2 'dns'"

Test it's working by temporarily setting a device's DNS to 8.8.8.8 — it should still resolve via PiHole (check PiHole query logs).


Phase 11 — avahi-daemon (mDNS Reflection)

Reflects mDNS across trusted, servers, media and IoT VLANs so that:

  • Phones (trusted) can discover Cast devices and speakers (media)
  • HA (servers) can discover IoT and media devices
  • Phones (trusted) can discover the HP printer (IoT) via AirPrint

The config is stored at files/avahi-daemon.conf in this repo. It is not a UCI file — it must be pushed manually and is not covered by safe-apply.sh.

# Install package (if not already done in Phase 1)
ssh openwrt "opkg update && opkg install avahi-daemon"

# Push config
scp files/avahi-daemon.conf openwrt:/etc/avahi/avahi-daemon.conf

# Enable and restart
ssh openwrt "/etc/init.d/avahi-daemon enable && /etc/init.d/avahi-daemon restart"

Note: There is no auto-revert safety net for this file. If avahi causes problems, disable it with ssh openwrt "/etc/init.d/avahi-daemon stop" — it is not load-bearing for routing or connectivity.

Verify: Cast devices (speakers, Shield) appear in Google Home app and in Music Assistant from a phone on Moonshield (trusted). Confirm the HP printer is discoverable via AirPrint from a phone.


Phase 12 — Clean Up Flat Network

Once everything is verified on the new VLANs, remove the old flat br-lan interface and its DHCP pool from the config.

./scripts/safe-apply.sh network 10
./scripts/safe-apply.sh dhcp 5

Run ./scripts/backup.sh to commit the final clean state.


Phase 13 — WAN Failover (Separate Session)

Once VLANs are stable and bedded in, tackle failover as a standalone change:

Device: GL-XE300 (Puli) 4G router, currently at 192.168.8.1 running GL.iNet 4.3.27 (OpenWRT 22.03.4).

Pre-flight: reconfigure XE300 subnet

Before wiring it in, change the XE300's LAN subnet from 192.168.8.0/24 to a 10.0.x.x range consistent with the VLAN layout. A sensible choice is 10.0.100.0/24 (XE300 at 10.0.100.1). Do this via the GL.iNet web UI (Network → LAN IP) before connecting it to the main router.

Steps:

  1. Install mwan3 package
  2. Repurpose a LAN port as WAN2 (network config change)
  3. Connect XE300 LAN port to that repurposed port
  4. Configure mwan3 health checks and failover policy
  5. Test by temporarily unplugging the primary WAN

XE300 management access

By default, LAN devices cannot reach the XE300 web UI or SSH because WAN interfaces are in the untrusted firewall zone. To retain management access from the trusted VLAN, add to the main router config:

  • A static route for 10.0.100.0/24 via the WAN2 interface (OpenWRT may add this automatically when the interface comes up)
  • A firewall rule: trusted → 10.0.100.1 allow TCP 22, 80, 443

Without this, the only way to reach the XE300 is via SSH on the main router itself (which is directly on the 10.0.50.x subnet via WAN2).


DDNS — WireGuard Endpoint on Failover

When WAN2 takes over, the public IP changes. The only service that needs to remain reachable externally during a failover is WireGuard — once connected to the VPN, split-horizon DNS handles everything else internally.

Pre-flight: dedicated WireGuard hostname

Create a Cloudflare A record for a dedicated WireGuard endpoint hostname (e.g. wg0.danielhead.com) pointing to the current fibre WAN IP. Set TTL to 60 seconds. Update all WireGuard client configs to use this hostname as their endpoint if they don't already.

Pre-flight: Cloudflare API token

In Cloudflare dashboard → My Profile → API Tokens, create a token with:

  • Permission: Zone → DNS → Edit
  • Zone: danielhead.com only

Steps:

  1. Install packages:

    ssh openwrt "opkg update && opkg install ddns-scripts ddns-scripts-cloudflare"
    
  2. Add to config/ddns (create file if it doesn't exist):

    config ddns 'wg_endpoint'
        option service_name    'cloudflare.com-v4'
        option enabled         '1'
        option lookup_host     'wg0.danielhead.com'
        option domain          'wg0.danielhead.com'
        option zone            'danielhead.com'
        option username        'Bearer'
        option password        '<CLOUDFLARE_API_TOKEN from .env>'
        option ip_source       'web'
        option ip_url          'https://checkip.amazonaws.com https://icanhazip.com https://ifconfig.me'
        option check_interval  '5'
        option unit_check      'minutes'
        option force_interval  '72'
        option unit_force      'hours'
    

    ip_source web queries an external service to get the current public IP regardless of which WAN interface is active — the correct approach for mwan3 setups where the active interface changes dynamically.

    Credentials: CLOUDFLARE_API_TOKEN is in .env (gitignored). When applying, substitute the value manually — do not commit the token into config/ddns.

  3. Enable and start the ddns service:

    ssh openwrt "/etc/init.d/ddns enable && /etc/init.d/ddns start"
    
  4. Push config:

    ./scripts/safe-apply.sh ddns 5
    

Behaviour:

  • ddns polls every 5 minutes via ifconfig.me
  • While WAN1 is up, the public IP matches the Cloudflare record — no update
  • When WAN2 takes over, within 5 minutes ddns detects the new IP and updates wg0.danielhead.com in Cloudflare
  • WireGuard clients re-resolve the hostname (within ~60s due to TTL) and reconnect
  • When WAN1 recovers and mwan3 fails back, the record is updated back to the fibre IP within 5 minutes

Verify:

Simulate a failover by unplugging the primary WAN. After 5 minutes check that wg0.danielhead.com has updated to the 4G IP:

nslookup wg0.danielhead.com 9.9.9.9

Confirm a WireGuard client can reconnect after the DNS TTL expires.


Future: Managed Switch Migration

When a managed switch is added, the migration is a config/network-only change. Firewall zones, DHCP pools and wireless config are all unaffected - the VLAN identities and IP ranges stay identical.

Current approach - one physical port per VLAN:

config device
    option name 'br-servers'
    option type 'bridge'
    list ports 'lan2'

config device
    option name 'br-iot'
    option type 'bridge'
    list ports 'lan3'

config interface 'lan_servers'
    option device 'br-servers'
    ...

config interface 'lan_iot'
    option device 'br-iot'
    ...

With managed switch - single trunk port, 802.1Q VLAN filtering:

config device
    option name 'br-trunk'
    option type 'bridge'
    list ports 'lan2'           # single cable to managed switch
    option vlan_filtering '1'

config bridge-vlan
    option device 'br-trunk'
    option vlan '10'            # servers VLAN ID
    list ports 'lan2:t'         # tagged on trunk

config bridge-vlan
    option device 'br-trunk'
    option vlan '20'            # IoT VLAN ID
    list ports 'lan2:t'

config interface 'lan_servers'
    option device 'br-trunk.10' # was: 'br-servers'
    ...

config interface 'lan_iot'
    option device 'br-trunk.20' # was: 'br-iot'
    ...

On the managed switch side, set the uplink port as a tagged trunk for VLANs 10, 20, 30 etc., and set each downstream port as an untagged access port for whichever VLAN it belongs to.


Rollback Reference

Situation Action
Router unreachable after a change Wait for auto-revert (510 min window set in safe-apply.sh)
Rolled back but want to retry Fix the config file, run safe-apply.sh again
Something subtle is broken after confirming git diff config/ to see what changed, ./scripts/safe-apply.sh <file> to re-push a previous version
Complete disaster SSH in and run firstboot (factory reset) — then restore from git using the sequence below

Clean Restore from Git

Use this after a factory reset (firstboot) or a clean firmware flash. After either, the router is at its default IP 192.168.1.1 - ssh openwrt won't work until the network config is pushed first.

Requirements: laptop connected via ethernet to a LAN port on the router.

# 1. Push network config to restore the correct LAN IP (10.0.0.1)
ssh root@192.168.1.1 "cat > /etc/config/network" < config/network
ssh root@192.168.1.1 "uci commit network && reload_config"

# 2. Wait a few seconds for the interface to come back, then push everything else
./scripts/push-all.sh

# 3. Reinstall packages (adjust list to what was installed at time of restore)
ssh openwrt "opkg update && opkg install avahi-daemon kmod-bridge"

What the repo covers: all six UCI config files (dhcp, dropbear, firewall, network, system, wireless).

What it does not cover:

  • Packages - must be reinstalled manually (see step 3)
  • /etc/avahi/avahi-daemon.conf - not a UCI file, push manually with scp files/avahi-daemon.conf openwrt:/etc/avahi/avahi-daemon.conf (config stored in files/ in this repo)
  • SSH host keys - regenerated on clean flash; first reconnect will show a known_hosts warning, clear with ssh-keygen -R openwrt