Routes computed incorrectly

13 posts / 0 new

Mon, 04/02/2018 - 16:39

N5EG

We recently added a wide area repeater to our network of mostly NSM2's.
Two nodes appear to be computing network route metrics incorrectly.
Let's call the nodes A, R, and B - with R being the high elevation repeater.

Node A has LQ of 100 to node R and NLQ of 94
Node B has LQ of 100 to node R and NLQ of 94

Node A has LQ of 0 to node B and NLQ of 55 (thus B hears A poorly, and A does not hear B at all).

Node A correctly computes a 2-hop route to B via R, and computes the correct value of ETX (2.13)
Node B however decides that node A is a neighbor and computes a 1-hop direct route to B not using R.
ETX values are not displayed single-hop paths; however that value should be 1 / (.55 * 0.0) = infinity

Another different node C behaves exactly the same way as B.

It would seem that there should not be a neighbor displayed if either the LQ or NLQ value is zero,
since it should result in an infinite route cost. If LQ = 0, then the node seems to correctly void
the neighbor, however when NLQ = 0, there is a neighbor claimed -- perhaps that may be closely related to the error.

-- Tom, N5EG

Mon, 04/02/2018 - 18:56

KG6JEI

Please attach a support data

Please attach a support data file for each node in question while the issue is present. This can be obtained by going to the Setup > Administration page.

Secondly I’ll note the Mesh status page doesn’t show you the route a packet will take, the node is indeed a neighbor for the concept of the node (it’s a device we can hear directly but may not necessarily hear us). Just because you see the node as a neighbor on the mesh status page doesn’t mean that it will be used as the path, it just signifies that we can see packets from them directly.

The odds are it actually is using the path Via R (a traceroute from your PC would show this)

In addition if you believe this is an actual flaw it should probably be opened in our trouble ticket system bloodhound ( http://bloodhound.aredn.org )

Mon, 04/02/2018 - 19:35

(Reply to #2) #3

KE2N

trace

see
https://www.aredn.org/content/how-determine-which-path-being-used#comments

tracing the route can be interesting indeed.

I previously observed that OLSR (used in AREDN) tends to favor weak links with fewer hops over strong links with extra hops.

I think the problem is that the algo uses only link quality and not link capacity (the OEM firmware has both indicators).

Whats missing from consideration is the action of the wireless chip driver that automatically adjusts the bit rate downward on weak links to improve the link quality. So you can have a good link quality running at a low MCS index. For example the Rocket M5 will run down to -96 dBm at MCS 0 so any single-hop path in the -90 region will likely be chosen over a 2-hop path running at perhaps 6x's that data rate...

Mon, 04/02/2018 - 21:37

(Reply to #3) #4

N5EG

Thank you for the prompt

Thank you for the prompt replies and help !

Since Node B does not show node A as a remote node (only a local node),
one would assume that the route is direct rather than through R. This is
difficult to troubleshoot as attempts to access node B from node A frequently
timeout.

I will gather traceroutes and data files, although this requires coordination with a

person at site B to gather the data and reverse direction traceroutes from their end
and email it to me.

-- Tom, N5EG

Tue, 04/03/2018 - 01:57

(Reply to #4) #5

KG6JEI

“Since Node B does not show

“Since Node B does not show node A as a remote node (only a local node),
one would assume that the route is direct rather than through R. “

Well when you put it that way yes I can see how one would think that is the case. That itself might be worth a ticket as an enhancement even if the rest of it doesn’t pan out.

Wed, 04/04/2018 - 05:34

N5EG

An update - a cold front came

An update - a cold front came through. That has degraded the B hearing A path enough
that B shows A dropped out of the neighbor column and moved into the remote column, with ETX of about
2.13. That makes B now reachable from A via R bi-directionally and everything works correctly.

The originally reported problem had been stable for days (i.e. no ability to move bits between A and B)
unit the cold front. Traceroutes from both ends yield the correct A-R-B and B-R-A paths respectively.

This means that we will need to await weather that brings the problem back, and get the folks on both ends
of the path taking data at the same time.

-- Tom, N5EG

Thu, 04/05/2018 - 16:00

(Reply to #6) #7

AE6XE

Tom, can you clarify?

Tom, can you clarify?

The A and B nodes can not communicate with each other at times? If so, this would indicate they are establishing a direct (unstable) olsr routing link and would be expected to show as a direct neighbor (with lag time to drop from the list). This established link would have to beat the 2.13 ETX quality hop path for data to be routing over it. This means the LQ/NLQ would have to be ~68.5% in both directions. Is this occurring? Or this is only an issue of how information is showing in mesh status?

This touches on the root design limits in OLSRv1, which, as KE2N noted, is short of the ideal target (which we can get closer to, but never be perfect). LQ, SNR, or similar are poorer data points, because they don't tell us what we directly want to know, "what path, right now, yields the best through-put (and latency if that is a concern)".

The current mesh status information can also show 2 paths to a neighbor, dtdlink and RF, with no indication of which path traffic routes over. Although, we can infer that no traffic is routed over a 0% LQ link and dtdlink wins over RF to neighbors. There is little or no discernable routing information between remote nodes. Mesh status page falls short in a number of ways to present a routing view. To date, it's primary purpose is elsewhere e.g. what are the services on the mesh network.

Joe AE6XE

Thu, 04/05/2018 - 16:56

(Reply to #7) #8

N5EG

Hi Joe. Thanks for following

Hi Joe. Thanks for following up. Let me try to clarify.

I am physically located at A. B is at another person's house.

The problem first started when we put up the high elevation repeater and everyone re-aimed antennas towards it. After doing so I noted that I was not able to easily exchange data between A and B. Since both A and B have what appear to be excellent links with R this was puzzling. I am physically located at the A node, it was easy to query A for information. As noted, A had the correct web display of information (B was not a neighbor, R was a neighbor, B was reachable as a remote node, ETX = 2.13).

At node B the web display showed A was a neighbor, with LQ = low value varying 25% to 45% with NLQ = zero percent. Node B did not show having a remote of node A. This data was gathered by the owner of node B physically present at node B and sent to me by email. I infer that node B should have calculated an ETX of 'infinite value' for 1-hop path to A and thus preferred to choose the route to R (B-->R-->A 2 hop route) since it should have computed to ETX = 2.13. This is where I hypothesize the problem was occurring. Why would a ETX=infinite route be better than an ETX=2.13 route? The terrible performance of the selected route seemed to indicate that B was indeed attempting to route traffic directly to A. So this would not appear to be small difference in path selection, rather a quite large discrepancy.

It stayed this way for about 72 hours quite consistently until the cold front blew through and killed the signals that B was hearing from A (about two hours after I submitted the first report). Once that happened the node B now shows A is not a neighbor and is instead a 2-hop remote node with ETX = 2.13. And B is now routing correctly, there is very good and very fast information exchange between A and B via R (path confirmed with traceroutes from the two ends).

The owner of node B just figured out how to retrieve the large set of extensive routing and link tables out of the LAN port of a directly attached node in text form, which includes the computed ETX values for neighbor nodes (a key piece of information). So if / when the problem recurs we will attempt to gather a complete picture of things from both A and B including the computed ETX values for neighbor nodes as well as remote nodes, gateways, etc.

Does that answer or clarify ?

-- Tom, N5EG

Thu, 04/05/2018 - 18:13

(Reply to #8) #9

AE6XE

Tom, what we've done in a

Tom, what we've done in a couple of cases locally, is have A and B tweak their antennas (Nanostations?) away from each other but still keeping a quality link to 'R'. This has potential to make the symptoms go away. The other option is tweaking the antennas towards each other to improve the link quality (and still keeping a quality link to 'R' too). Work to stay out of the middle zone in limbo -- try to get a direct quality link between A and B or one that doesn't exist.

Joe AE6XE

Thu, 04/05/2018 - 17:14

(Reply to #9) #10

KG6JEI

" LQ, SNR, or similar are

" LQ, SNR, or similar are poorer data points, because they don't tell us what we directly want to know, "what path, right now, yields the best through-put (and latency if that is a concern)". "

I encourage you to submit a feature request ticket on this if you feel it is something we should have in AREDN. So far no one has officially requested this yet.

To date AREDN has run on the "What likely requires the least RF transmissions" the "Highest speed path" hasn't been specified to date as a core option of what AREDN is suppose to target in its routes. Essentially this is an AREDN protocol definition and not necessarily an OLSR limitation (OLSR can be expanded to support any metrics we want, we just have not done it) Now what requires the least RF tranmissiosn and retranmissions will sometimes be paired to the fastest route, but there are times this will not necessarily always be the case.

Sat, 04/07/2018 - 10:53

#11

N5EG

The problem has returned. I

The problem has returned. I was able to ssh to the remote node (very sluggish) and do a traceroute back to myself.

From A to B: traceroute results: A to R to B.
From B to A: traceroute results: B to A

Node B shows LQ = 88% and NLQ = 0% to node A as a neighbor
Node A shows B is not a neighbor, but is a remote node, 2 hops, ETX = 2.79

I've left voicemail with the node owner to try to capture node B's routing data.
The response over the network is too sluggish to do much remotely.

-- Tom, N5EG

Sun, 04/08/2018 - 03:17

#12

AE6XE

Tom, When a link (like this

Tom, When a link (like this one from A to B) is on the edge or marginal, the symptoms you see are expected. When one node stops hearing another, there is a delay in propagating the information to the other nodes in the mesh network. During this lag time, the routing information can be bogus until all nodes are updated. If the link is going up and down a lot, it is dissruptive. Even if it stays up a lot, but say it had a ~50% LQ and NLQ (ETX = 4), some services like video streaming start to become unusable/un-doable. This is in comparison to perfect 4 hop links (also ETX=4) where video streams work great.

Sun, 04/08/2018 - 10:10

(Reply to #12) #13

N5EG

Thanks for your comments. I

Thanks for your comments. I appreciate all that the AREDN team has done in creating the software
and making it freely available, This is really a great project for the Ham community, and the team
deserves a lot of accolades.

My initial concern is that it stayed in this state for more than 72 hours when first encountered,
making node B unreachable for days at a time. The node owner downloaded node B tables yesterday
morning about 60~90 minutes after I captured the asymmetric routing, but by then things had fixed
themselves and routing was symmetrical - and working well.

It is interesting to note that when he pulled the table, node B considered node A to be a neighbor with
LQ = 0.19 NLQ = 0.000 ETX = INFINITE (spelled that way in the table) and was not a remote node.
But the routing was in fact as a remote node, with the correct path being computed. The web display
not matching the routing is inconvenient but not the important issue.

Continuous link state changes are tough on all routing protocols, even in the commercial Internet.
We must have encountered some kind of link condition where it is perhaps coming and going
periodically to trigger and persist a transient condition.

-- Tom, N5EG

Amateur Radio Emergency Data Network

Menu SF

You are here

Routes computed incorrectly

Amateur Radio Emergency Data Network

Search form

Menu SF

You are here

Routes computed incorrectly