In the last 24 months we have walked into the back-of-house of operators with subscriber bases ranging from 8,000 to 220,000. The headline problems differ. The underlying problems are almost identical. This is the field-tested list — and what we keep doing about them.
1. Two NMSes, one for each OLT vendor
Whenever an operator acquires another, two NMSes show up. Each is "the source of truth" for its own kit. Engineers keep both tabs open. Mistakes happen in the joins. Consolidation onto a vendor-agnostic NMS pays for itself within 9–12 months in licences alone and within 3–4 months in error reduction.
2. The CRM does not know what the network is doing
Customer care is on the line with a complaining subscriber, and the only signal they have from the network is "active" / "not active." If the subscriber's WAN is up but the WiFi at the CPE is misconfigured, the agent has no way to know. Closing this signal gap is the single biggest move available to most operators.
3. Provisioning takes longer than it should because of one missing API
In nine out of ten manual-provisioning shops, there is exactly one tool (often the OLT controller) that does not have a usable API. So a human runs a CLI script. Everything else is automated. The cost of that one CLI script is days of latency and a long tail of typos.
4. Topology is in three different spreadsheets
Field has a Google Sheet. Engineering has a Visio. The NMS has its own auto-discovered graph. None of them agree. RCA cannot work without canonical topology — and canonical topology means LLDP/CDP-derived from the live network, augmented by hand only at the seams.
5. Firmware is way out of date — or way too fresh
There is no consistent policy. Either CPEs are running 4-year-old firmware with known CVEs, or they were force-pushed to the latest beta last month and 2% of them now hang every 12 hours. The fix is a staged rollout policy with health gates, baked into the ACS.
6. The alarm noise floor is too high
A typical NMS surfaces 4,000–20,000 alarms a month. A typical NOC reads 50. The rest is dropped on the floor and statistically buries the real ones. AI suppression based on topology and historical correlation cuts the visible volume by 80–95% without losing the actionable ones.
7. No backup of OLT running-config — anywhere
We have walked into ops centres where the most expensive single device on the network has no running-config backup beyond "the engineer's laptop." When that OLT dies on Saturday night, the recovery time is days, not hours. Daily automated backups to git-style versioning are table stakes.
8. RADIUS is the secret single point of failure
In the operator we last benchmarked, every minute of RADIUS unavailability cost roughly 400 sessions and a wave of inbound calls. The RADIUS was running on one VM, behind one load balancer, with no clustered fail-over. This is, painfully, the most common configuration we find.
9. Reports take three days to produce
A monthly executive report shouldn't take three days of analyst time. It does, because the data lives in four tools and the analyst is the integration. AI-assisted report generation with templated narratives turns that into an hour — and the analyst spends the saved time on something only humans can do.
10. The "knowledge" lives in two people's heads
Every ISP has at least one engineer who knows exactly which OLT port has the dodgy SFP and which subscriber is always going to call on Friday night. That knowledge is not written down. When the engineer changes jobs, six months of operational quality goes with them. Codifying tribal knowledge into the platform — as policy, as profiles, as monitors — is the work most operators put off and most regret putting off.
How we score on day one
When we engage with a new operator, we run a 90-minute audit against these ten patterns. The output is a heat-map: red (priority fix), amber (queue), green (already healthy). Three or fewer reds is unusual.
