We recently added a wide area repeater to our network of mostly NSM2's.
Two nodes appear to be computing network route metrics incorrectly.
Let's call the nodes A, R, and B - with R being the high elevation repeater.
Node A has LQ of 100 to node R and NLQ of 94
Node B has LQ of 100 to node R and NLQ of 94
Node A has LQ of 0 to node B and NLQ of 55 (thus B hears A poorly, and A does not hear B at all).
Node A correctly computes a 2-hop route to B via R, and computes the correct value of ETX (2.13)
Node B however decides that node A is a neighbor and computes a 1-hop direct route to B not using R.
ETX values are not displayed single-hop paths; however that value should be 1 / (.55 * 0.0) = infinity
Another different node C behaves exactly the same way as B.
It would seem that there should not be a neighbor displayed if either the LQ or NLQ value is zero,
since it should result in an infinite route cost. If LQ = 0, then the node seems to correctly void
the neighbor, however when NLQ = 0, there is a neighbor claimed -- perhaps that may be closely related to the error.
-- Tom, N5EG
Secondly I’ll note the Mesh status page doesn’t show you the route a packet will take, the node is indeed a neighbor for the concept of the node (it’s a device we can hear directly but may not necessarily hear us). Just because you see the node as a neighbor on the mesh status page doesn’t mean that it will be used as the path, it just signifies that we can see packets from them directly.
The odds are it actually is using the path Via R (a traceroute from your PC would show this)
In addition if you believe this is an actual flaw it should probably be opened in our trouble ticket system bloodhound ( http://bloodhound.aredn.org )
see
https://www.aredn.org/content/how-determine-which-path-being-used#comments
tracing the route can be interesting indeed.
I previously observed that OLSR (used in AREDN) tends to favor weak links with fewer hops over strong links with extra hops.
I think the problem is that the algo uses only link quality and not link capacity (the OEM firmware has both indicators).
Whats missing from consideration is the action of the wireless chip driver that automatically adjusts the bit rate downward on weak links to improve the link quality. So you can have a good link quality running at a low MCS index. For example the Rocket M5 will run down to -96 dBm at MCS 0 so any single-hop path in the -90 region will likely be chosen over a 2-hop path running at perhaps 6x's that data rate...
one would assume that the route is direct rather than through R. This is
difficult to troubleshoot as attempts to access node B from node A frequently
timeout.
and email it to me.
one would assume that the route is direct rather than through R. “
Well when you put it that way yes I can see how one would think that is the case. That itself might be worth a ticket as an enhancement even if the rest of it doesn’t pan out.
that B shows A dropped out of the neighbor column and moved into the remote column, with ETX of about
2.13. That makes B now reachable from A via R bi-directionally and everything works correctly.
The originally reported problem had been stable for days (i.e. no ability to move bits between A and B)
unit the cold front. Traceroutes from both ends yield the correct A-R-B and B-R-A paths respectively.
This means that we will need to await weather that brings the problem back, and get the folks on both ends
of the path taking data at the same time.
-- Tom, N5EG
The A and B nodes can not communicate with each other at times? If so, this would indicate they are establishing a direct (unstable) olsr routing link and would be expected to show as a direct neighbor (with lag time to drop from the list). This established link would have to beat the 2.13 ETX quality hop path for data to be routing over it. This means the LQ/NLQ would have to be ~68.5% in both directions. Is this occurring? Or this is only an issue of how information is showing in mesh status?
This touches on the root design limits in OLSRv1, which, as KE2N noted, is short of the ideal target (which we can get closer to, but never be perfect). LQ, SNR, or similar are poorer data points, because they don't tell us what we directly want to know, "what path, right now, yields the best through-put (and latency if that is a concern)".
The current mesh status information can also show 2 paths to a neighbor, dtdlink and RF, with no indication of which path traffic routes over. Although, we can infer that no traffic is routed over a 0% LQ link and dtdlink wins over RF to neighbors. There is little or no discernable routing information between remote nodes. Mesh status page falls short in a number of ways to present a routing view. To date, it's primary purpose is elsewhere e.g. what are the services on the mesh network.
Joe AE6XE
I am physically located at A. B is at another person's house.
The problem first started when we put up the high elevation repeater and everyone re-aimed antennas towards it. After doing so I noted that I was not able to easily exchange data between A and B. Since both A and B have what appear to be excellent links with R this was puzzling. I am physically located at the A node, it was easy to query A for information. As noted, A had the correct web display of information (B was not a neighbor, R was a neighbor, B was reachable as a remote node, ETX = 2.13).
At node B the web display showed A was a neighbor, with LQ = low value varying 25% to 45% with NLQ = zero percent. Node B did not show having a remote of node A. This data was gathered by the owner of node B physically present at node B and sent to me by email. I infer that node B should have calculated an ETX of 'infinite value' for 1-hop path to A and thus preferred to choose the route to R (B-->R-->A 2 hop route) since it should have computed to ETX = 2.13. This is where I hypothesize the problem was occurring. Why would a ETX=infinite route be better than an ETX=2.13 route? The terrible performance of the selected route seemed to indicate that B was indeed attempting to route traffic directly to A. So this would not appear to be small difference in path selection, rather a quite large discrepancy.
It stayed this way for about 72 hours quite consistently until the cold front blew through and killed the signals that B was hearing from A (about two hours after I submitted the first report). Once that happened the node B now shows A is not a neighbor and is instead a 2-hop remote node with ETX = 2.13. And B is now routing correctly, there is very good and very fast information exchange between A and B via R (path confirmed with traceroutes from the two ends).
The owner of node B just figured out how to retrieve the large set of extensive routing and link tables out of the LAN port of a directly attached node in text form, which includes the computed ETX values for neighbor nodes (a key piece of information). So if / when the problem recurs we will attempt to gather a complete picture of things from both A and B including the computed ETX values for neighbor nodes as well as remote nodes, gateways, etc.
Does that answer or clarify ?
-- Tom, N5EG
Joe AE6XE
I encourage you to submit a feature request ticket on this if you feel it is something we should have in AREDN. So far no one has officially requested this yet.
To date AREDN has run on the "What likely requires the least RF transmissions" the "Highest speed path" hasn't been specified to date as a core option of what AREDN is suppose to target in its routes. Essentially this is an AREDN protocol definition and not necessarily an OLSR limitation (OLSR can be expanded to support any metrics we want, we just have not done it) Now what requires the least RF tranmissiosn and retranmissions will sometimes be paired to the fastest route, but there are times this will not necessarily always be the case.
From A to B: traceroute results: A to R to B.
From B to A: traceroute results: B to A
Node B shows LQ = 88% and NLQ = 0% to node A as a neighbor
Node A shows B is not a neighbor, but is a remote node, 2 hops, ETX = 2.79
I've left voicemail with the node owner to try to capture node B's routing data.
The response over the network is too sluggish to do much remotely.
-- Tom, N5EG
and making it freely available, This is really a great project for the Ham community, and the team
deserves a lot of accolades.
My initial concern is that it stayed in this state for more than 72 hours when first encountered,
making node B unreachable for days at a time. The node owner downloaded node B tables yesterday
morning about 60~90 minutes after I captured the asymmetric routing, but by then things had fixed
themselves and routing was symmetrical - and working well.
It is interesting to note that when he pulled the table, node B considered node A to be a neighbor with
LQ = 0.19 NLQ = 0.000 ETX = INFINITE (spelled that way in the table) and was not a remote node.
But the routing was in fact as a remote node, with the correct path being computed. The web display
not matching the routing is inconvenient but not the important issue.
Continuous link state changes are tough on all routing protocols, even in the commercial Internet.
We must have encountered some kind of link condition where it is perhaps coming and going
periodically to trigger and persist a transient condition.
-- Tom, N5EG