Files
openwrt/docs/implementation-plan.md
Dan Head 1c59ca4af4 chore: initial repo setup with baseline config backup
- Pull current config from router (OpenWRT 24.10.2)
- Add backup, safe-apply, and push-all scripts
- Add CLAUDE.md with workflow rules and context
- Add network-map.md with current topology and planned VLANs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-04-02 22:22:08 +01:00

548 lines
22 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# VLAN Implementation Plan
## Guiding Principles
- **Every risky change goes through `safe-apply.sh`** with a revert window
- **Build alongside, then cut over** — new VLANs and SSIDs are created while the existing flat network stays up; the cutover is a single planned step
- **Servers migrate before clients** — HA and other services need stable IPs before IoT/media devices reconnect to them
- **Have a fallback** — keep a phone on mobile data during the cutover so you can SSH into the router if WiFi drops and doesn't recover
---
## Prerequisites (Complete Before Any Router Changes)
- [x] Fill in all MAC addresses in `vlan-requirements.md`
- [x] Note Shield TV's current hostname/IP from LuCI
- [x] Document all current port forwards (see `docs/network-map.md` → Port Forwards)
- [x] Note any hardcoded IPs in Home Assistant — Frigate (`10.0.0.12`) and Enphase Envoy (`10.0.0.144`); Frigate also has doorbell camera IP (`10.0.0.41`) hardcoded in its config
- [x] DNS records confirmed — managed in router `config/dhcp`, not PiHole (no local DNS records in PiHole UI or pihole.toml)
- [ ] Add PiHole Local DNS records (Settings → Local DNS → DNS Records) for split-horizon DNS — internal clients resolve service hostnames to everlost's internal IP directly, bypassing hairpin and keeping services reachable during WAN outages:
- `jester.danielhead.com``10.0.0.2`
- `wayfaerer.danielhead.com``10.0.0.2`
- `wg0.danielhead.com``10.0.0.2`
- (add any future service subdomains here too)
- [x] Push updated `config/dhcp` to remove now-redundant dnsmasq domain entries: `./scripts/safe-apply.sh dhcp 5`
- [x] Collect MAC addresses for internet-allowed IoT devices from LuCI → Network → DHCP Leases (Hypervolt, OCTO-CADLITE, HP printer, Alarmo, Envoy) — fill into `vlan-requirements.md`
- [x] Complete the br-guest port assignment test (see `docs/pre-implementation-findings.md` → Pending Validation Test)
- [ ] Push updated `config/network` to remove LAN4 from br-guest
- [ ] Run `./scripts/backup.sh` to snapshot current working config
---
## Phase 0 — Upgrade router to openwrt-25.12.2
Upgrade OpenWRT to latest stable version using sysupgrade. The ramips/mt7621 target supports config-preserving upgrades but this must be explicitly requested — without the `-k` flag sysupgrade will factory reset the router.
**Pre-flight:**
```bash
# Snapshot current config into the repo first
./scripts/backup.sh
# Verify the backup looks correct before proceeding
git diff config/
```
**Copy firmware to the router and verify checksum:**
```bash
# Check available space first
ssh openwrt "df -h /tmp"
# Copy the firmware binary from this repo to the router
scp openwrt-25.12.2-ramips-mt7621-tplink_archer-ax23-v1-squashfs-sysupgrade.bin openwrt:/tmp/
# Verify checksum matches the value on downloads.openwrt.org
ssh openwrt "sha256sum /tmp/openwrt-25.12.2-ramips-mt7621-tplink_archer-ax23-v1-squashfs-sysupgrade.bin"
```
**Apply the upgrade:**
```bash
# -k preserves /etc/config/* — without this it factory resets
ssh openwrt "sysupgrade -k /tmp/openwrt-25.12.2-ramips-mt7621-tplink_archer-ax23-v1-squashfs-sysupgrade.bin"
```
The router will reboot. Reconnect after ~2 minutes.
**Verify:**
```bash
ssh openwrt "cat /etc/openwrt_release" # confirm new version
ssh openwrt "uci show network.lan.ipaddr" # confirm LAN IP intact
./scripts/backup.sh # confirm config still matches repo
```
> **Rollback:** sysupgrade does not support automatic rollback. If the router becomes unreachable after upgrading, connect via ethernet and access it at `192.168.1.1` (default IP after a reset). Restore config using the Clean Restore steps at the bottom of this document.
## Phase 1 — Install Required Packages
Low risk. Packages are additive, nothing changes until configured.
```bash
ssh openwrt "opkg update && opkg install avahi-daemon kmod-bridge"
```
- `avahi-daemon` — mDNS reflection across VLANs
- `kmod-bridge` — kernel bridging support for VLAN interfaces (may already be present)
**Verify:** `ssh openwrt "avahi-daemon --version"`
---
## Phase 2 — Create VLAN Interfaces (network config)
Edit `config/network` to add VLAN bridge interfaces alongside the existing `br-lan`.
**New interfaces to add:**
| Interface | Bridge | Subnet | VLAN ID |
|---------------|--------------|--------------|---------|
| `lan_trusted` | `br-trusted` | 10.0.1.1/24 | 1 |
| `lan_servers` | `br-servers` | 10.0.10.1/24 | 10 |
| `lan_iot` | `br-iot` | 10.0.20.1/24 | 20 |
| `lan_media` | `br-media` | 10.0.30.1/24 | 30 |
| `lan_guest` | `br-guest` | 10.0.40.1/24 | 40 |
The existing flat `br-lan` (10.0.0.1/24) stays untouched until cutover.
```bash
./scripts/safe-apply.sh network 10
```
**Verify:** `ssh openwrt "ip addr show"` — new bridge interfaces should appear
**Rollback:** If router becomes unreachable, it auto-reverts in 10 minutes
---
## Phase 3 — Configure DHCP Pools
Edit `config/dhcp` to add a pool for each new VLAN interface. Each pool advertises:
- Gateway: the router's IP on that VLAN (e.g. `10.0.1.1`)
- DNS: PiHole (`10.0.10.2`)
- Static leases for servers, Shield TV, and doorbell camera
```bash
./scripts/safe-apply.sh dhcp 5
```
**Verify:** Connect a test device to the router via ethernet, manually set IP to e.g. `10.0.1.100/24` gateway `10.0.1.1` — confirm it can ping the gateway.
---
## Phase 4 — Configure Firewall Zones and Rules
Edit `config/firewall` to add zones for each VLAN and the cross-VLAN rules from `vlan-requirements.md`. The existing `lan` zone stays in place.
Key rules to implement:
- `trusted → internet` allow
- `trusted → media` allow (Cast ports + Sonos ports)
- `trusted → servers` allow (SSH + Nginx)
- `servers → iot` allow all
- `servers → media` allow all
- `media → servers` allow (Plex TCP 32400, Jellyfin TCP 8096)
- `iot → internet` **block by default** — set IoT zone forward policy to REJECT
- `iot → internet` explicit allows for: Hypervolt (`10.0.20.2`), OCTO-CADLITE (`10.0.20.3`), HP printer (`10.0.20.4`), Alarmo (`10.0.20.5`), Envoy (`10.0.20.6`)
- `guest → internet` allow only
- DNS hijack: redirect all outbound TCP/UDP 53 to PiHole (`10.0.10.2`)
> **Note:** The per-device IoT allow rules depend on static leases being in place (Phase 3) so those devices have predictable IPs. Verify static leases are active before applying firewall rules.
```bash
./scripts/safe-apply.sh firewall 10
```
**Verify:** Zones appear in LuCI → Network → Firewall
---
## Phase 5 — Add New SSIDs
Edit `config/wireless` to add new SSIDs mapped to VLAN bridge interfaces. **Do not change Moonshield yet** — it stays on the flat `br-lan` for now.
New SSIDs to add:
| SSID | Interface | Band |
|-----------------|------------|---------------|
| Cloud Connected | `br-iot` | 2.4GHz |
| Pinball Map | `br-media` | 5GHz + 2.4GHz |
| Passenger | `br-guest` | 2.4GHz |
```bash
./scripts/safe-apply.sh wireless 5
```
**Verify:** New SSIDs appear on a phone. Connect a test device to each and confirm it gets an IP in the right subnet (e.g. Passenger → 10.0.40.x).
---
## Phase 6 — Migrate Servers (Maintenance Window Begins)
> From this point, brief outages are expected. Ensure your phone is on mobile data.
Update static DHCP leases in `config/dhcp` to assign new IPs (10.0.10.x) to server devices. Move them from the flat `br-lan` DHCP to the `lan_servers` DHCP.
**For each server (everlost, homeassistant, frigate, jester, wayfaerer):**
1. Push updated DHCP config
2. SSH into the server and run `sudo dhclient -r && sudo dhclient` (or reboot) to renew its lease
3. Confirm it gets its new `10.0.10.x` IP
**After all servers have new IPs:**
Update `config/firewall` port forwards to reflect new server IPs:
| Name | Proto | WAN Port | Old Dest IP | New Dest IP |
|----------------------|-------|----------|-------------|-------------|
| HTTP | TCP | 80 | 10.0.0.2 | 10.0.10.2 |
| HTTPS | TCP | 443 | 10.0.0.2 | 10.0.10.2 |
| Wireguard | UDP | 51820 | 10.0.0.2 | 10.0.10.2 |
| SSH - Everlost | TCP | 22563 | 10.0.0.2 | 10.0.10.2 |
| SSH - Home Assistant | TCP | 22553 | 10.0.0.11 | 10.0.10.3 |
| SSH - Frigate | TCP | 22583 | 10.0.0.12 | 10.0.10.4 |
| SSH - Jester | TCP | 22573 | 10.0.0.21 | 10.0.10.10 |
| SSH - Wayfaerer | TCP | 22593 | 10.0.0.22 | 10.0.10.11 |
| Plex - Jester | TCP | 32400 | 10.0.0.21 | 10.0.10.10 |
| Plex - Wayfaerer | TCP | 32450 | 10.0.0.22 | 10.0.10.11 |
```bash
./scripts/safe-apply.sh firewall 5
```
- Update hardcoded IPs in Home Assistant integrations:
- **Frigate** (Settings → Integrations → Frigate): change host from `10.0.0.12``10.0.10.4`
- Confirm PiHole dashboard is reachable at `10.0.10.2`
**Update PiHole Local DNS records** (Settings → Local DNS → DNS Records) to point to everlost's new IP:
| Name | Old IP | New IP |
|------|--------|--------|
| jester.danielhead.com | 10.0.0.2 | 10.0.10.2 |
| wayfaerer.danielhead.com | 10.0.0.2 | 10.0.10.2 |
| wg0.danielhead.com | 10.0.0.2 | 10.0.10.2 |
**Update WireGuard config on everlost:**
1. Update wg-easy client DNS setting from `10.0.0.2``10.0.10.2` and regenerate client configs
2. Verify from a WG-connected device: `nslookup homeassistant.danielhead.com` should return `10.0.10.2`
3. Verify WireGuard-connected devices can still reach proxied services
**Verify:** Home Assistant loads, all integrations show as connected, Nginx proxy still routes external traffic correctly, WireGuard clients can reach internal services.
**Add temporary `lan → servers` firewall rule:**
IoT and media devices are still on Moonshield (`br-lan`, 10.0.0.x) and need to keep reaching HA, Frigate etc. while you migrate them at your own pace. Add a temporary allow-all forwarding rule from the `lan` zone to the `servers` zone:
```bash
uci add firewall rule
uci set firewall.@rule[-1].name='temp_lan_to_servers'
uci set firewall.@rule[-1].src='lan'
uci set firewall.@rule[-1].dest='servers'
uci set firewall.@rule[-1].target='ACCEPT'
uci commit firewall
./scripts/safe-apply.sh firewall 5
```
> **Remember to remove this rule after Phase 7** — once all IoT and media devices have migrated off Moonshield, this rule is no longer needed and leaves an unintended hole.
---
## Phase 7 — Migrate IoT Devices
1. Connect each IoT device to **Cloud Connected** SSID
- ESPHome devices: forget current WiFi in ESPHome config and re-provision, or just update SSID in the ESPHome dashboard
- Other devices: reconnect via their app or settings
2. Devices will get IPs in `10.0.20.x`
3. HA should rediscover ESPHome devices automatically via mDNS within a few minutes
4. Confirm each device shows as available in HA
**After IoT devices have new IPs:**
- Update hardcoded IPs in Home Assistant integrations:
- **Enphase Envoy** (Settings → Integrations → Enphase Envoy): change host from `10.0.0.144``10.0.20.2`
- Update doorbell camera IP in Frigate's config: change from `10.0.0.41``10.0.20.1`, then restart Frigate
**Remove the temporary `lan → servers` rule** (added at end of Phase 6) once all IoT and media devices are off Moonshield:
```bash
# Find and delete the rule by name
uci delete firewall.$(uci show firewall | grep 'temp_lan_to_servers' | cut -d. -f2)
uci commit firewall
./scripts/safe-apply.sh firewall 5
```
**Verify:** All ESPHome entities, voice assistants, blinds, and sensors show as available in Home Assistant. Test a blind, a sensor reading, and a voice command. Confirm Frigate shows the doorbell camera stream.
---
## Phase 8 — Migrate Media Devices
1. Connect Shield TV to **Pinball Map** SSID
- It will get `10.0.30.2` (static lease)
- Open Plex and Jellyfin — update server address to `10.0.10.21` (jester.lan) if not auto-discovered
2. Connect consoles and speakers to **Pinball Map** SSID
3. Test casting from a phone (still on flat network at this point) to speakers and Shield
**Verify:** Plex/Jellyfin plays content, Cast works from phone, Music Assistant in HA can control speakers, HA Shield integration shows as connected.
---
## Phase 9 — Cutover: Move Moonshield to Trusted VLAN
This is the final disruptive step. Moonshield will briefly drop all connected devices while it moves to `br-trusted`.
**Before starting:** Plug your laptop into **LAN 3** (reserved for trusted VLAN). This gives you a wired fallback — if Moonshield doesn't come back up cleanly, you keep your connection to the router and can intervene.
Edit `config/wireless` — change Moonshield's interface from `br-lan` to `br-trusted`.
```bash
./scripts/safe-apply.sh wireless 5
```
All phones and laptops on Moonshield will disconnect and immediately reconnect to the same SSID — they'll get new IPs in `10.0.1.x`. This typically takes 515 seconds.
**Verify:** Phone reconnects to Moonshield, gets `10.0.1.x` IP, internet works, can cast to speakers/Shield, can reach Nginx-proxied services.
---
## Phase 10 — DNS Hijacking
Confirm DNS hijacking rule is active:
```bash
ssh openwrt "nft list ruleset | grep -A2 'dns'"
```
Test it's working by temporarily setting a device's DNS to `8.8.8.8` — it should still resolve via PiHole (check PiHole query logs).
---
## Phase 11 — avahi-daemon (mDNS Reflection)
Reflects mDNS across trusted, servers, media and IoT VLANs so that:
- Phones (trusted) can discover Cast devices and speakers (media)
- HA (servers) can discover IoT and media devices
- Phones (trusted) can discover the HP printer (IoT) via AirPrint
The config is stored at `files/avahi-daemon.conf` in this repo. It is **not** a UCI file — it must be pushed manually and is not covered by `safe-apply.sh`.
```bash
# Install package (if not already done in Phase 1)
ssh openwrt "opkg update && opkg install avahi-daemon"
# Push config
scp files/avahi-daemon.conf openwrt:/etc/avahi/avahi-daemon.conf
# Enable and restart
ssh openwrt "/etc/init.d/avahi-daemon enable && /etc/init.d/avahi-daemon restart"
```
> **Note:** There is no auto-revert safety net for this file. If avahi causes problems, disable it with `ssh openwrt "/etc/init.d/avahi-daemon stop"` — it is not load-bearing for routing or connectivity.
**Verify:** Cast devices (speakers, Shield) appear in Google Home app and in Music Assistant from a phone on Moonshield (trusted). Confirm the HP printer is discoverable via AirPrint from a phone.
---
## Phase 12 — Clean Up Flat Network
Once everything is verified on the new VLANs, remove the old flat `br-lan` interface and its DHCP pool from the config.
```bash
./scripts/safe-apply.sh network 10
./scripts/safe-apply.sh dhcp 5
```
Run `./scripts/backup.sh` to commit the final clean state.
---
## Phase 13 — WAN Failover (Separate Session)
Once VLANs are stable and bedded in, tackle failover as a standalone change:
**Device:** GL-XE300 (Puli) 4G router, currently at `192.168.8.1` running GL.iNet 4.3.27 (OpenWRT 22.03.4).
**Pre-flight: reconfigure XE300 subnet**
Before wiring it in, change the XE300's LAN subnet from `192.168.8.0/24` to a `10.0.x.x` range consistent with the VLAN layout. A sensible choice is `10.0.100.0/24` (XE300 at `10.0.100.1`). Do this via the GL.iNet web UI (Network → LAN IP) before connecting it to the main router.
**Steps:**
1. Install `mwan3` package
2. Repurpose a LAN port as WAN2 (network config change)
3. Connect XE300 LAN port to that repurposed port
4. Configure `mwan3` health checks and failover policy
5. Test by temporarily unplugging the primary WAN
**XE300 management access**
By default, LAN devices cannot reach the XE300 web UI or SSH because WAN interfaces are in the untrusted firewall zone. To retain management access from the trusted VLAN, add to the main router config:
- A static route for `10.0.100.0/24` via the WAN2 interface (OpenWRT may add this automatically when the interface comes up)
- A firewall rule: `trusted → 10.0.100.1` allow TCP 22, 80, 443
Without this, the only way to reach the XE300 is via SSH on the main router itself (which is directly on the `10.0.50.x` subnet via WAN2).
---
### DDNS — WireGuard Endpoint on Failover
When WAN2 takes over, the public IP changes. The only service that needs to remain reachable externally during a failover is WireGuard — once connected to the VPN, split-horizon DNS handles everything else internally.
**Pre-flight: dedicated WireGuard hostname**
Create a Cloudflare A record for a dedicated WireGuard endpoint hostname (e.g. `wg0.danielhead.com`) pointing to the current fibre WAN IP. Set TTL to 60 seconds. Update all WireGuard client configs to use this hostname as their endpoint if they don't already.
**Pre-flight: Cloudflare API token**
In Cloudflare dashboard → My Profile → API Tokens, create a token with:
- Permission: `Zone → DNS → Edit`
- Zone: `danielhead.com` only
**Steps:**
1. Install packages:
```bash
ssh openwrt "opkg update && opkg install ddns-scripts ddns-scripts-cloudflare"
```
2. Add to `config/ddns` (create file if it doesn't exist):
```
config ddns 'wg_endpoint'
option service_name 'cloudflare.com-v4'
option enabled '1'
option lookup_host 'wg0.danielhead.com'
option domain 'wg0.danielhead.com'
option zone 'danielhead.com'
option username 'Bearer'
option password '<CLOUDFLARE_API_TOKEN from .env>'
option ip_source 'web'
option ip_url 'https://checkip.amazonaws.com https://icanhazip.com https://ifconfig.me'
option check_interval '5'
option unit_check 'minutes'
option force_interval '72'
option unit_force 'hours'
```
`ip_source web` queries an external service to get the current public IP regardless of which WAN interface is active — the correct approach for mwan3 setups where the active interface changes dynamically.
> **Credentials:** `CLOUDFLARE_API_TOKEN` is in `.env` (gitignored). When applying, substitute the value manually — do not commit the token into `config/ddns`.
3. Enable and start the ddns service:
```bash
ssh openwrt "/etc/init.d/ddns enable && /etc/init.d/ddns start"
```
4. Push config:
```bash
./scripts/safe-apply.sh ddns 5
```
**Behaviour:**
- ddns polls every 5 minutes via `ifconfig.me`
- While WAN1 is up, the public IP matches the Cloudflare record — no update
- When WAN2 takes over, within 5 minutes ddns detects the new IP and updates `wg0.danielhead.com` in Cloudflare
- WireGuard clients re-resolve the hostname (within ~60s due to TTL) and reconnect
- When WAN1 recovers and mwan3 fails back, the record is updated back to the fibre IP within 5 minutes
**Verify:**
Simulate a failover by unplugging the primary WAN. After 5 minutes check that `wg0.danielhead.com` has updated to the 4G IP:
```bash
nslookup wg0.danielhead.com 9.9.9.9
```
Confirm a WireGuard client can reconnect after the DNS TTL expires.
---
## Future: Managed Switch Migration
When a managed switch is added, the migration is a `config/network`-only change. Firewall zones, DHCP pools and wireless config are all unaffected - the VLAN identities and IP ranges stay identical.
**Current approach - one physical port per VLAN:**
```
config device
option name 'br-servers'
option type 'bridge'
list ports 'lan2'
config device
option name 'br-iot'
option type 'bridge'
list ports 'lan3'
config interface 'lan_servers'
option device 'br-servers'
...
config interface 'lan_iot'
option device 'br-iot'
...
```
**With managed switch - single trunk port, 802.1Q VLAN filtering:**
```
config device
option name 'br-trunk'
option type 'bridge'
list ports 'lan2' # single cable to managed switch
option vlan_filtering '1'
config bridge-vlan
option device 'br-trunk'
option vlan '10' # servers VLAN ID
list ports 'lan2:t' # tagged on trunk
config bridge-vlan
option device 'br-trunk'
option vlan '20' # IoT VLAN ID
list ports 'lan2:t'
config interface 'lan_servers'
option device 'br-trunk.10' # was: 'br-servers'
...
config interface 'lan_iot'
option device 'br-trunk.20' # was: 'br-iot'
...
```
On the managed switch side, set the uplink port as a tagged trunk for VLANs 10, 20, 30 etc., and set each downstream port as an untagged access port for whichever VLAN it belongs to.
---
## Rollback Reference
| Situation | Action |
|---------------------------------------------|--------------------------------------------------------------------------------------------------------|
| Router unreachable after a change | Wait for auto-revert (510 min window set in safe-apply.sh) |
| Rolled back but want to retry | Fix the config file, run safe-apply.sh again |
| Something subtle is broken after confirming | `git diff config/` to see what changed, `./scripts/safe-apply.sh <file>` to re-push a previous version |
| Complete disaster | SSH in and run `firstboot` (factory reset) — then restore from git using the sequence below |
---
## Clean Restore from Git
Use this after a factory reset (`firstboot`) or a clean firmware flash. After either, the router is at its default IP `192.168.1.1` - `ssh openwrt` won't work until the network config is pushed first.
**Requirements:** laptop connected via ethernet to a LAN port on the router.
```bash
# 1. Push network config to restore the correct LAN IP (10.0.0.1)
ssh root@192.168.1.1 "cat > /etc/config/network" < config/network
ssh root@192.168.1.1 "uci commit network && reload_config"
# 2. Wait a few seconds for the interface to come back, then push everything else
./scripts/push-all.sh
# 3. Reinstall packages (adjust list to what was installed at time of restore)
ssh openwrt "opkg update && opkg install avahi-daemon kmod-bridge"
```
**What the repo covers:** all six UCI config files (`dhcp`, `dropbear`, `firewall`, `network`, `system`, `wireless`).
**What it does not cover:**
- Packages - must be reinstalled manually (see step 3)
- `/etc/avahi/avahi-daemon.conf` - not a UCI file, push manually with `scp files/avahi-daemon.conf openwrt:/etc/avahi/avahi-daemon.conf` (config stored in `files/` in this repo)
- SSH host keys - regenerated on clean flash; first reconnect will show a `known_hosts` warning, clear with `ssh-keygen -R openwrt`