Automating root-cause analysis on a GPON network

A subscriber goes offline at 21:14. By 21:15 the customer is on the phone. By 21:17 a Tier-1 agent has read the symptom into a ticket. By 21:23 a NOC engineer is in five tools simultaneously trying to figure out whether it is a real fault, a billing block, or a momentary blip. By 21:40 a truck is dispatched. In 60% of cases, the truck did not need to be.

This is the everyday cost of manual root-cause analysis. It is not the technicians' fault. The data exists; the workflow does not. The single highest-leverage change a GPON operator can make in 2026 is to automate this loop end-to-end. Here is how that loop should look and how to evaluate any RCA engine that claims to do it.

The physics of a GPON fault

In a Passive Optical Network, the actionable signals are concentrated in a small set of measurements. Knowing this is the foundation of automated RCA — because the AI has to know what to look at.

Optical Rx/Tx power at the ONT and at the OLT PON port (dBm).
OLT port status (up/down, last flap, error counters).
PON tree neighbours — sibling ONTs on the same splitter chain.
Authentication and session state (TR-069 inform, RADIUS, DHCP).
Recent configuration changes (last ACS push, firmware update).
Field-history correlation — has this ONT had a complaint in the last 30 days?
Weather signals where available — cold/heat events affecting outside-plant fibres.

The five-second backtrace

An automated RCA engine looks at all of that simultaneously. It does not run a sequential script; it runs a directed graph backwards from the symptom. "ONU offline" becomes "Authentication missing." That becomes "Was there an LOS?" That becomes "What is the Rx power and how does it compare to the last known good?" That becomes "Are siblings on the splitter also down?" Each branch contributes evidence with a weight.

How NetXol's RCA Engine reasons (typical 1.8 s end-to-end)

1Hydrate~120 ms

Pull last-known telemetry for the ONT, its OLT port, its siblings, its last 30 days of complaints, and its config history.

2Diagnose~250 ms

Walk a hypothesis graph: power loss, fibre cut, OLT-port flap, splitter loss, ONU hardware fault, auth issue, billing block.
Score each branch against the evidence.

3Rank~80 ms

Probabilistic rollup. Confidence is a real percentage, not a label like "high."

4Recommend~50 ms

Map the top hypothesis to a remediation: reboot, re-auth, push profile, dispatch.
Estimate cost of being wrong (e.g. unnecessary truck roll).

5Act or escalate~variable

If confidence is above policy threshold and the action is reversible, execute and verify.
Otherwise present to the NOC with full evidence.

What "confidence" should mean

A common failure mode of early RCA tools was a confidence number that meant nothing — a hand-tuned weight on a rule. A modern RCA engine should give you a probability that survives Bayesian sanity-checks: 50% means it is genuinely a coin-flip given current evidence; 95% means action without human review is reasonable for low-cost actions.

Calibration test

Sample 100 RCA outputs flagged as "92% confident." Of those, roughly 92 should be correct on review. If only 70 are correct, the model is over-confident and you should not allow it to auto-act yet.

The role of topology

Half the unnecessary truck rolls we see in the field could have been avoided with a single piece of context: "this is the third ONT on the same PON port to flap in 30 minutes." A single ONT report looks like a customer problem; three siblings down looks like a fibre cut. The pattern only emerges when topology is in the same query plane as telemetry.

This is why NetXol builds an LLDP/CDP-derived live topology graph as a first-class object in the platform. Every alarm carries its position in the graph. Every RCA hypothesis can ask "who else is downstream of this device?" without leaving the engine.

Evaluating an RCA engine — questions to ask

1What modalities does it fuse? (telemetry / topology / history / billing / weather)
2Is "confidence" calibrated? Ask for the calibration plot.
3Can it act, not just recommend? Under what policy is action allowed?
4Does it explain itself? You should be able to read the evidence trail for any conclusion.
5How does it handle unknown failure modes? Does it gracefully escalate rather than guess?
6Is its data plane multi-vendor? GPON without multi-vendor support is a museum piece.

Measured outcomes

When this loop runs end-to-end, the numbers we observe across operator deployments are consistent. MTTR drops by 40–60%. Truck rolls drop by 30–50%. Customer complaints associated with diagnosed faults drop further still, because many faults are remediated before the customer notices.

−47%

Mean time to repair

−38%

Inbound fault complaints

−42%

Avoidable truck rolls

Automating root-cause analysis on a GPON network

The physics of a GPON fault

The five-second backtrace

What "confidence" should mean

The role of topology

Evaluating an RCA engine — questions to ask

Measured outcomes

Further reading

Keep reading

Why FTTH ISPs need an AI Operating System, not another tool

TR-069 vs TR-369 (USP) in 2026: a practical guide for FTTH operators

Plan capacity before saturation, not after

Put your ISP on autopilot