We have a xm nanostation m2 that I only have rf access to. I did a ota upgrade to the latest production release. When I checked a day later over rf it had gone "deaf". The mesh status of local node I was using to connect showed an IP address instead of a name for the node in question, good LQ but zero NLQ.
I had someone with physical access cylce power for me and it recovered. However a few days later it has gone deaf again.
Is this still a known issue? Should I downgrade to the previous production version for now?
KD2EVR, Need a support data download to see what is going on. 10 min. after booting, capture a baseline support download. Then when you see symptoms, capture another download. Let's see where the data leads.
I'm not aware of NLQ% suddenly going to 0% in 3.18.9.0. No issues entered or forum threads I recall?
Joe AE6XE
Node A = is your local node
node B = is the node to arrange physical access
In mesh status on node A you see in the Neighbor column, (node B 100%LQ 0%NLQ), correct? As you've indicated, this means node B is not receiving OLSR packets or hearing node A. node B probably has another link connected to a 3rd node C. Otherwise, B would not be showing as a direct neighbor on A, because OLSR has not directly handshaked both directions to establish a direct link between A and B. When node C is communicating information about B back to node A in OLSR, then B can show up on A as a neighbor. Maybe one of these other weak link Neighbors is node C.
The original deaf channel condition was selective in that it would stop hearing some neighbors, but still hear others. The symptoms would generally take hours to days to appear. On node B, we're looking for the log information in /tmp/rssi.log, which will be in the support download. There is logic to trigger the radio to recalibrate and start over when there are unexpected jumps (up or down) in received signal strengths. If the signal degrades from a given neighbor, from the last 1 hour window signal average by more than ~2 standard deviations, then the receiver is triggered to recalibrate. This logic has prevented these symptoms from showing up for a couple years now. I suppose it's possible, we could change this to a tighter ~1 std dev test on your node. But the data will tell us if we're looking in the right place.
Other possibilities include local interference, generally the 1st suspect area to look. Anything changed recently at node B's site? Interference was a big factor in deaf channel symptoms. But If the radio is struggling to survive with other noise/interferences, there's not much that can be done except to make the interference go away.
Joe AE6XE
We're on the same page with the exception that there's no alternate route for Node A. See the attached diagram if you're interested.
I use the A' airrouter as a convenience to check on the mesh from my house it is occasionlly heard weakly by B. It is not powered up 24/7. Node D facing away from B but is occasionally heard very weakly. Everything is on channel -2, 10Mhz bw.
I have not attempted to access B through C or D - I will try today. If successful I can easily get your support data and trigger a wifi scan and/or reboot.
Only recent change was Node D was added in October and maybe recent conditions (leaves dropping) are what are starting to allow it to be heard a little by B. I can always try putting it on another channel.
Joe AE6XE
I captured some support files:
1. While deaf
2. Shortly after reboot
3. Shortly after reboot 2 the other nodes dropped off the mesh status briefly and then came back
4. Ten minutes after reboot.
KD2EVR,
It looks like the condition started shortly after Dec 31 21:30. The log file shows history from Dec 26 to the 31st of data available. There were a handful of times rssi_monitor would trigger the radio to recalibrate each day. Then it stops taking any action as a couple of signals are attenuated out and the strong signal remains, although lower SNR now, continued through today. The logic treated the other signals as interference and made the environment pristine to only receive the remaining strongest signal.
What must be happening at this site is the conditions are just right/wrong and the receiver very very slowly goes into this state, such that the re-calibration trigger is not called to prevent the condition. What you can do is change the threshold to trigger re-calibrations, to make the test more sensitive. This can be done by finding and editing the following line in "/usr/local/bin/rssi_monitor":
from: $sdV3 = int(3 * $rssiHist{$_}{"sdV"} + .5);
to: $sdV3 = int(2 * $rssiHist{$_}{"sdV"} + .5);
AND
from: $sdH3 = int(3 * $rssiHist{$_}{"sdH"} + .5);
to: $sdH3 = int(2 * $rssiHist{$_}{"sdH"} + .5);
Change the '3' to a "2". It will then trigger a recalibration if a received signal is only (2 standard deviations + .5) instead of (3 standard deviations + .5) drop from the rolling average of the Received Signal Strength (RSS) . The unit is in dB. There are 1 minute samples for the last 1 hour window for tracking average and standard deviation of the RSS for each neighbor. This data can be found in /tmp/rssi.dat. For dual polarity devices there are 2 values for AVE and 2 values for STD for each neighbor.
Monitor the file /tmp/rssi.log to see when the trigger is called and how often. You can change this threshold value calculation to make it just the right amount of sensitivity to ensure the condition no longer occurs. For some environmental reason, your 'lucky' site needs a more sensitive threshold.
Joe AE6XE