All notable changes to MNM are documented in this file. Format follows Keep a Changelog.
Pre-v1.0 baseline tag. No code changes from prior commit — this tag marks the reference point for the v1.0 architectural rework documented in CLAUDE.md.
This release includes all completed work through Phase 2.9 (see CLAUDE.md "Completed Phases" section for the full list).
- Controller-DB race on cold start — controller now retries DB connection for up to 5 minutes (5s intervals) instead of failing immediately and falling back to JSON mode. Eliminates the need to manually restart the controller after bootstrap completes. If the DB is unreachable for the full timeout, a prominent warning banner appears on the dashboard.
- Degraded-mode dashboard banner — dashboard now shows a visible warning when
db_connectedis false, explaining the impact and recovery steps.
- Added
docs/CODING_STANDARDS.mdestablishing project coding standards. Pre-commit self-check checklist will be applied to all future code-touching work.
- Node/endpoint terminology rework — introduces the distinction between "nodes" (onboarded infrastructure that feeds MNM data) and "endpoints" (passively-discovered devices). New
GET /api/nodesendpoint returns onboarded devices with poll health status.GET /api/devicesnow returns a301redirect to/api/nodes(deprecated). - Nodes page (
/nodes) — dedicated page listing all onboarded infrastructure with health dots (green/yellow/red/gray), platform, primary IP, role, location, per-job poll status dots, last polled time, and Poll Now button. Uses DataTable component with search and filtering. GET /api/nodes/macs— returns MAC addresses belonging to onboarded nodes, used by the endpoints page to filter out infrastructure devices.- UI navigation reorganization — nav bar grouped into sections: Intelligence (Nodes, Endpoints, Events), Operations (Discovery, Jobs, Logs), with visual separators and group labels. Discovery moved from top-level to Operations. Updated on all 8 pages.
- Endpoints page node filtering — endpoints page now excludes MACs belonging to onboarded nodes, showing only passively-discovered endpoints.
- Multi-IP endpoint correlation fix — builds cross-device MAC→IPs map from ALL nodes' ARP tables and sweep observations before upserting endpoints. Devices like SRX320 with multiple IPs across different interfaces now get all IPs merged into
additional_ips. Newget_mac_ip_map_from_observations()helper supplements ARP data with sweep-discovered IP→MAC associations. - Routing collection — three new polling job types:
collect_routes(NAPALMget_route_to),collect_bgp(NAPALMget_bgp_neighbors), and VRF context viaget_network_instances. Routes and BGP neighbors stored in controller PostgreSQL (routesandbgp_neighborstables) with upsert-on-conflict semantics. Default intervals:MNM_POLL_ROUTES_INTERVAL=3600,MNM_POLL_BGP_INTERVAL=3600. - Route and BGP API endpoints —
GET /api/routes,GET /api/routes/{node},GET /api/routes/advisories,GET /api/bgp,GET /api/bgp/{node}. Route advisories surface next-hops not in IPAM as discovery candidates. - Route advisories dashboard card — shows routes pointing to unknown next-hop gateways with "Add to sweep" action. Similar pattern to subnet and LLDP advisories.
- Dashboard polling table — added Routes and BGP columns to polling status card. Maintenance card shows route/BGP row counts.
- Prune integration — daily prune job now cleans old
routesandbgp_neighborsrows past retention window. Preview endpoint includes route/BGP counts. - Raw network data persistence — ARP, MAC, and LLDP collection now persists per-node to
node_arp_entries,node_mac_entries,node_lldp_entriestables (in addition to feeding the correlation engine). Per-node API endpoints:GET /api/nodes/{name}/arp,/mac-table,/lldp,/fib. FIB populated from SNMP route data. All tables included in daily prune. - Node detail page (
/nodes/{name}) — tabbed view with ARP Table, MAC Address Table, Routing Table, LLDP Neighbors, BGP Peers, Poll History. Each section uses DataTable with search/filter and count badges. Replaces Nautobot link on nodes list. - Global network search (
/search) — auto-detecting search bar: MAC, IP, CIDR prefix, or hostname. Cross-node queries viaGET /api/search?q={query}. Search icon added to nav on all pages. Results grouped by type with links to node detail and endpoint pages. - Hop-limited auto-discovery — opt-in LLDP neighbor walk during sweep or manual trigger. New
auto_discover_hopsparameter (0–5, default 0) on sweep form and API. Sequential onboarding with BFS traversal, visited-set loop prevention, exclusion list respect,MNM_AUTO_DISCOVER_MAXhard cap (default 10). IP resolved from LLDP chassis ID, Nautobot lookup, or DNS. New moduleauto_discover.py. - Auto-discovery API —
POST /api/discover/auto(manual trigger),GET /api/discover/auto/history,GET /api/discover/auto/recent(dashboard advisory). Progress appears in sweep scan log. - Recently Auto-Discovered dashboard card — shows nodes auto-discovered in the last 24 hours with parent node, hop depth, and onboarding status. Rule 6 compliance visibility.
- Discovery page auto-discover field — number input for hop limit (0–5) in the sweep form.
- Modular per-device polling — replaces monolithic endpoint collector with independent
collect_arp(),collect_mac(),collect_dhcp(),collect_lldp()job functions per device. Newdevice_pollstable tracks per-device/per-job-type state (last success, last error, next due, interval, enabled). Per-device sequential execution avoids overlapping NAPALM sessions; cross-device parallel dispatch with configurable concurrency. 10% random jitter onnext_dueprevents thundering herd. Seeds from Nautobot inventory on first startup. Legacy monolithic collector gated behindMNM_LEGACY_COLLECTOR=true. Default intervals:MNM_POLL_ARP_INTERVAL=300,MNM_POLL_MAC_INTERVAL=300,MNM_POLL_DHCP_INTERVAL=600,MNM_POLL_LLDP_INTERVAL=3600. - Jobs page (
/jobs) — consolidated view of all background tasks: Sweep Scheduler, Modular Poller, Proxmox Collector, Database Prune, and legacy Endpoint Collector. Shows status, schedule interval, last run, next run, duration, result summary, and Run Now button per job.GET /api/jobsconsolidated endpoint. - Polling API endpoints —
GET /api/polling/status,GET /api/polling/status/{device},POST /api/polling/trigger/{device},POST /api/polling/trigger/{device}/{job_type},PUT /api/polling/config/{device}/{job_type}. Trigger endpoints return 202 Accepted. - Dashboard Polling Status card — per-device rows with green/yellow/red/gray status dots for ARP, MAC, DHCP, LLDP collection. "Poll Now" button per device. Auto-refreshes with dashboard interval.
- Onboarding progress tracker — in-memory per-IP state machine (
submitting → queued → running → succeeded | failed | timeout) surfaced viaGET /api/discover/onboardingand in the sweep host Details panel. - Sweep-scheduled trigger —
POST /api/discover/sweep-scheduledre-runs the first saved sweep schedule without requiring a request body (used by Jobs page Run Now). - Add-to-sweep prefill — dashboard "Discovered Subnets" → "Add to sweep" now pre-fills the CIDR textarea on the discovery page, shows an instructional banner, and scrolls to the form.
- Device interface subnet auto-expansion — sweep schedule loop auto-adds subnets from onboarded device interface IPs to the sweep scope. Subnet advisory filters them out.
- Nautobot Patch 4 —
patches/patch_processor_schema_logging.pypromotes schema validation errors from DEBUG to WARNING so the actual missing field is visible in JobResult logs without enabling debug mode. - Documentation index — new
docs/INDEX.mdwith role-based navigation.
- Table styling consistency — removed
full-width-tableedge-bleed wrapper from all pages; all tables now use card-padded style matching Sweep History. - Discovery page UX — moved CIDR instruction into textarea label, moved pipeline description under progress bar, scan log box vertically resizable (80px–500px).
- Container Status moved to bottom of dashboard page.
- Jobs nav link added to all pages (Dashboard, Discovery, Endpoints, Events, Logs, Jobs).
- Sweep stop button — now cancels within seconds instead of hanging for minutes. Cancel check added inside onboarding poll loop (every iteration) and port probe queue (
semaphore.acquirewith 2s timeout). - Five discovery pipeline bugs causing every onboarding to fail silently:
- Location type: saved schedule pointed at Region (no
dcim.devicecontent type); startup migration auto-repoints to Site. - JobResult stuck PENDING: controller reads
celery-task-meta-*from Redis db 1 as fallback when DB row stays PENDING (upstream plugin bug). - Status string case: poll loop checked
"completed"/"failed"but Nautobot 3.x usesSUCCESS/FAILURE. - IPAM IntegrityError:
delete_standalone_ip()removes unattached IP records before onboarding to prevent duplicate key crash. find_device_by_ip: rewrote to useIPAddress → Interface → Devicelookup chain (Nautobot 3.x rejects?primary_ip4__host=).
- Location type: saved schedule pointed at Region (no
- Container Status crash —
ImageNotFoundwhen Docker prunes old images after rebuild; falls back toConfig.Imagefrom container attrs.
- Database maintenance — daily prune + on-demand admin endpoints
- New helpers in
controller/app/endpoint_store.py:prune_old_events,prune_old_observations,prune_orphaned_watches,prune_stale_sentinels,prune_all,prune_preview,maintenance_stats. Each operation is structured-logged under the dedicatedprunemodule. - Background scheduled prune task in the controller, configurable via
MNM_RETENTION_DAYS(default 365) andMNM_PRUNE_INTERVAL_HOURS(default 24). First run is delayed 2 minutes after startup to let the system settle. - New API endpoints:
GET /api/admin/maintenance(row counts + last prune summary + retention setting),GET /api/admin/prune/preview(dry-run row counts),POST /api/admin/prune(run prune now) - New "Database Maintenance" card on the dashboard with row counts (endpoints, events, observations, watches), oldest-event/observation timestamps, retention setting, "Preview Prune" + "Run Prune Now" buttons (with confirm dialog per Rule 9), and last-run summary
- Sentinel-row reaping: sweep-only endpoints with
current_switch = '(none)'that age past the retention window are deleted - Orphaned watchlist entries (watches whose target MAC has been purged) are cleaned up
- New helpers in
- Docker log rotation on every service — added
x-default-loggingYAML anchor and appliedlogging: *default-logging(json-file driver, 10MB × 3 files = 30MB cap per container) to all 11 services indocker-compose.yml. Caps unbounded container log growth. - Nautobot retention documentation — new section in
docs/CONFIGURATION.mdcoveringJOB_RESULT_RETENTIONandCHANGELOG_RETENTION, where to set them, and how they interact with MNM's own prune loop. - Patch 3 in
nautobot/Dockerfile— tolerate missing chassis serials in Sync Network Data. Some devices (factory-reset, virtual chassis members, mid-RMA, certain EX2300 states) return command-getter results without a usableserialkey, which crashed the entire upstream sync run withKeyError 'serial'. The patch defaults to""and falls back to hostname-only matching when no non-empty serials were collected. Implemented as a small Python script (nautobot/patches/patch_diffsync_utils.py) since the multi-line replacement isn't sed-friendly. Tracking upstream. - Connector framework — Proxmox VE (read-only)
- New
controller/app/connectors/package establishing the connector pattern (read-only client + scheduler + state + Prometheus exposition + endpoint store integration) controller/app/connectors/proxmox.py— Proxmox VE API client. Collects nodes, VMs, LXC containers, host status, storage pools, ZFS pools, and physical disks/SMART- Per-VM/CT MAC extraction from net0/net1/... config lines, upserted into the endpoint store with
data_source="proxmox"(cross-correlates with infrastructure ARP/MAC table data) - Prometheus metrics endpoint
GET /api/proxmox/metricsexposingmnm_proxmox_node_*,mnm_proxmox_vm_*,mnm_proxmox_ct_*,mnm_proxmox_storage_*,mnm_proxmox_zfs_pool_*,mnm_proxmox_disk_* - New Prometheus scrape job
proxmoxtargetingcontroller:9090 - New Grafana dashboard "MNM - Proxmox Overview" with node health, VM inventory, top VMs by CPU/memory, ZFS pool usage with thresholds (green <70%, yellow 70-85%, red >85%), pool health table, ZFS growth over time, and physical disk SMART health
- Dashboard card on
/showing per-node CPU/memory/uptime, ZFS storage usage summary, and red alert banner for unhealthy pools/disks - New env vars:
PROXMOX_HOST,PROXMOX_TOKEN_ID,PROXMOX_TOKEN_SECRET,PROXMOX_VERIFY_SSL,PROXMOX_INTERVAL_SECONDS - Connector is fully optional — disabled when
PROXMOX_HOSTis not set - New classifications added to discovery:
virtual_machine,container,hypervisor - New doc: docs/CONNECTORS.md — connector framework reference and Proxmox setup guide (API token, PVEAuditor role, env vars)
- New
- Phase 2.7 — Endpoint Correlation Engine
- Dedicated
mnm_controllerPostgreSQL database (on the existingmnm-postgresinstance) for the controller's persistent state - SQLAlchemy + asyncpg async ORM layer (
controller/app/db.py) - New tables:
endpoints(MAC-keyed identity),endpoint_events(movement/IP/hostname change log),sweep_runs,collection_runs,ip_observations,kv_config - Automatic one-shot migration from
endpoints.jsonandconfig.jsonon first startup; old files are renamed to*.json.migrated - Endpoint diff/event detection:
appeared,moved_port,moved_switch,ip_changed,hostname_changed - New API endpoints:
GET /api/endpoints/{mac},/{mac}/history,/{mac}/timeline,/api/endpoints/events,/api/endpoints/conflicts - Endpoint detail page with full timeline narrative for any MAC
- Network activity feed page (
/events) — filter by event type and time window, with IP conflict detection - Dashboard "Recent Events" card showing the last 10 endpoint changes
- Bootstrap script idempotently creates the controller database on
mnm-postgres
- Dedicated
- SNMP-based endpoint collection (replaces NAPALM proxy for ARP/MAC tables)
- DataTable component with pagination, sorting, column preferences
- User preferences panel (theme, page size, auto-refresh interval)
- Configurable concurrency: MNM_SWEEP_CONCURRENCY, MNM_COLLECTION_CONCURRENCY, MNM_API_CONCURRENCY
- Timing instrumentation on sweep and collection (per-host, per-device, elapsed)
- IEEE OUI database with 39,178 vendor prefixes
- Infrastructure endpoint collection from device ARP, MAC, and DHCP tables
- Endpoint correlation: ARP (IP→MAC) + MAC table (MAC→port→VLAN) + DHCP (MAC→hostname)
- Endpoints UI page with sortable/filterable table
- 10 new endpoint_* custom fields on IP Address for infrastructure data
- Scheduled endpoint collection every 15 minutes
- Junos DHCP server binding collection via NETCONF RPC
- Shodan-style service fingerprinting: banner grab, HTTP headers/title, TLS certs, SSH banners
- Enriched network sweep with SNMP, MAC, DNS, and port discovery
- Device classification engine (network_device, server, web_service, printer, access_point, endpoint)
- Nautobot IPAM integration — all discovered IPs recorded with 11 custom fields
- Periodic re-sweep scheduling via Controller UI
- 365-day Prometheus retention (up from 15 days) with 10 GB size cap
- Comprehensive documentation: DISCOVERY.md, MONITORING.md, TROUBLESHOOTING.md, API.md
- CHANGELOG.md
- Endpoint collection uses SNMP walks instead of NAPALM proxy (10x faster)
- Discovery sweep: pre-fetches known IPs in bulk, parallel enrichment per host, batched IPAM writes
- Bootstrap auto-skips on subsequent runs when already initialized (>5000 device types + custom fields)
- All tables use full viewport width with pagination
- Prometheus retention configurable via
PROMETHEUS_RETENTION_DAYSenv var - Discovery UI shows enriched data: DNS name, MAC vendor, open ports, SNMP name, classification
- Bootstrap creates 11 discovery custom fields on IP Address model (idempotent)
- first_seen preserved on endpoint updates (was being overwritten every collection)
- Hostname populated from DHCP, DNS, and SNMP sources (priority order)
- Cleared stuck Nautobot jobs in "Pending" state from previous sessions
- MNM Controller container (FastAPI + vanilla JS) on port 9090
- Password-gated dashboard with container status and service links
- Seed-and-sweep network discovery with operator-defined scope
- LLDP neighbor advisory panel (Human-in-the-Loop, Rule 6)
- Nautobot API proxy endpoints for devices, locations, secrets groups
- Docker SDK integration for container health monitoring
- Traefik v2.11 reverse proxy with Docker label routing
- Prometheus v3.4.0 metric storage
- SNMP Exporter v0.29.0 polling IF-MIB
- gnmic for gNMI streaming telemetry
- Grafana 12 with Device Dashboard and Network Overview
- Anonymous read-only Grafana access
- Worker healthcheck:
celery inspect ping - Scheduler healthcheck:
/proc/1/cmdlinecheck
- Custom Nautobot 3.0 image with device-onboarding, welcome-wizard, NAPALM
- Docker Compose: PostgreSQL 15, Redis 7, Nautobot, Celery worker/scheduler
- Idempotent bootstrap with 5,200+ community device types
- Juniper devices onboarded (EX series, SRX series)
- Netmiko read_timeout patched to 120s
- tcp_ping null port crash (default to 22)
- Nornir credential env var split (NAUTOBOT_NAPALM_* vs NAPALM_*)
- SSH conn_timeout to 30s for slower devices