Starting a new thread to deal with on-going problem. For details, please see last few posts in this thread:
https://www.arednmesh.org/content/bullet-m2hp-xw-firmware
As of Sunday 20-Jan-2019, the Rocket M2 (with 120* sector) on Mt Potosi in the SW corner of the LV area is still unavailable. This is a very important node here (in the short- to interim- timeframe) and this problem has crippled many new mode users who were relying upon linking thru Potosi.
Basically we can "see" the node in WiFi Scan at good signal strength and with the proper SSID etc. It just seems like it won't handshake with other nodes which used to hit it fairly easily. This removes any IP-based options for trouble-shooting.
It almost looks like it flipped into AP mode instead of Mesh mode. Physical access to this node is currently problematic, and I don't have full details yet.
I have two questions now so as to hopefully move forward on this:
1- is there any way to use that MAC address to connect over RF to the node if it's in Mesh operation?
2- if we wanted to try to connect to the node in AP mode, how would we configure one of our operational nodes to contact it (on -2/10), and then could we somehow remotely reboot that node back into proper operation?
TIA,
- Don - AA7AU
The SNR charts must be MAC based as Potosi still shows up with a consistent reading for real-time SNR - so that can't be IP-based.
Just adding another data point - still need HELP!
- Don - AA7AU
On a node receiving the Potosi signal, please grab a support download file. In this data is the output of a command to see that it is connected with an 802.11 adhoc network, "iw dev wlan0 station dump". this will confirm Potosi is still in mesh mode, if listed in this output. If it has an 802.11n adhoc connection, then we'd be looking at the next level for OLSR activity to exchange IP addresses and hostnames. This gets a bit more technical, but on your local node, install the tcpdump package and from the command line, "tcpdump -i wlan0 port 698" and look to see if any data is coming from the Potosi node. If not, then OLSR is not functioning at Potosi.
Joe AE6XE
Potosi's MAC is DC:9F:DB:36:81:99 - still shows up in WiFi scans. Here's your data:
root@W7HEN-HARC-M2R90-TDY:~# iw dev wlan0 station dump
Station dc:9f:db:36:81:99 (on wlan0)
inactive time: 350 ms
rx bytes: 2632078556
rx packets: 12549391
tx bytes: 821955257
tx packets: 6158694
tx retries: 4135115
tx failed: 1683
rx drop misc: 1136849
signal: -82 [-85, -85] dBm
signal avg: -82 [-84, -87] dBm
tx bitrate: 19.5 MBit/s MCS 2
rx bitrate: 39.0 MBit/s MCS 10
expected throughput: 13.366Mbps
authorized: yes
authenticated: yes
associated: yes
preamble: long
WMM/WME: yes
MFP: no
TDLS peer: no
DTIM period: 0
beacon interval:100
connected time: 1500991 seconds
What's next?
Potosi remains unresponsive on the mesh IP-layer,
- Don - AA7AU
edited to add: this data is from a node which has NOT rebooted since before Potosi went missing.
You'll need to locally sync with owners of the node to gain access to further investigate. Don KE6BXT is at Quartzsite, not sure about Frank to gain access.
Joe AE6XE
OK, did the tcpdump now: tcpdump -i wlan0 port 698
just cycles thru a few known nodes except not sure about this one;
22:06:46.172362 IP 10.71.178.144.698 > 10.255.255.255.698: OLSRv4, seq 0xfde8, length 60
but no entries for Potosi and its old IP#
Looks like its not talking .... but it still shows up in WiFi scan this eve.
Frank responded to my earlier email this evening: "... no remote resets available. Physical hill top access is not likely any time soon."
Is there anything else we can try remotely?
Thanks,
- Don - AA7AU
You might try this command to see if any traffic is coming out, "tcpdump -i wlan0 ether host dc:9f:db:36:81:99".
Joe AE6XE
Thanks, Joe!
Ran the "tcpdump -i wlan0 ether host dc:9f:db:36:81:99" and got *nothing* for three minutes of waiting:
root@W7HEN-HARC-M2R90-TDY:~# tcpdump -i wlan0 ether host dc:9f:db:36:81:99
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel
Understand you need a support data file from the node using LAN to try to figure this one out. I hope we can get that for you. However, I have no control over who will ultimately end up at the site and perform the power cycle. I will inform Frank (no reply yet to my last email) that this data capture needs to be done to help AREDN going forward.
I think that this Potosi failure certainly makes it real clear to all that a hub-and-spoke network design is a very poor choice - when the central focus stumbles and fails ... and even more so when that central point in only accessible at certain times of the year and then with difficulty. I have five new mesh users here in the HOA-dominated west side of Henderson, all pointed at Potosi (with no other current alternative) who are now deaf! --sigh--
- Don - AA7AU
Don, central site nodes are maintainable if you have alternate ways into them. Most all such sites in the SoCal network have access through a separate channel and usually on a different band. In addition, I would never place a node at a hard-to-get-to site without having a managed PoE switch that you can turn power off/on to each node. It's all in designing to a set of requirements which must include maintainability.
Andre, K6AH
Thanks Andre. Sorry for the overly broad comment. You are absolutely right and AREDN has good guidance on how to properly design networking.
The fellow in charge of this site now writes that they intend to implement a power-cycle type control over another RF access. But for now that node is unavailable; hopefully we'll get that data capture before power-cycle. But, I'm somewhat out of the loop on that.
Up in Idaho, our shoe-string budget sometimes precludes common-sense things like remote control over POE. But we're working on it. Luckily we have mostly a true interconnected mesh up there and the mountain top mesh node is not central to continued operations.
Thanks for all you do,
- Don - AA7AU
Interestingly, a different node here in Vegas seems to be exhibiting this same behavior as of around 2:45 AM this morning. Like Potosi, I see signal strength indicated and its MAC address shows up in a WiFi scan - "????" is shown in the host name. This node is DTD'd on the mountaintop; coming in through that link doesn't work either. Thankfully, this one is easier to physically access.
This is Rocket M5 XW that has been running a beta build (it's what was available when we installed and didn't want to risk an OTA upgrade). It has been running for ~2 months; probably 30 days + since a reboot.
Any predictions on what it will take to get it running?
root@K7FYI-MKTK5HP-SE:~# iw dev wlan0 station dump
Rick
K7FYI
Has the snow melted yet? I hear people will be skiing in Mammoth until July this year.
Thankfully, this one is only ~3,400' so no snow there. The higher peaks around town... still plenty of white!
Yesterday, on a trip to the mountain, we got the troublesome node back up after a reboot. Unfortunately, I couldn't get into it on the LAN side to get the support data file (...turns out, I had the wrong IP address). While at the site, I upgraded the firmware to 3.19.3.0. Everything was running fine when we left the mountain; when I checked on it from home, I found that it had stopped responding ~30 min earlier. The symptoms seemed to be the same.
Today, we traveled back to the mountain and this time I was able to get the support data file (attached). We now have the PoEs for both mesh nodes at this site on a remotely accessible power switch on another network so we can cycle them without traveling to the site.
Maybe this is a red herring, but here is our only thought on what's related to the sudden crashes of this node: A local user on a 31 mile path has been connecting to this node with a Ubiquity AirGrid. The AirGrid is running 3.19.3.0 firmware and I previously used it at my house (25 mile path) to connect to the same node without problems. We've noticed that when that device connects now, the node seems less responsive to other users. I don't have any firm metrics on the connection, but know it is marginal at best. Out of an abundance of caution, he powered it off this morning and it remains off.
Hopefully the support data is helpful!
Rick
K7FYI
What I found in the data download was the suspected situation where OLSR is miss-behaving -- it's running, but not producing hostname or routing information fundamental for the node to operate. There is a 'watchdog' script that restarts OLSR, but the watchdog script is not running to perform this step. Nothing is obvious why this olsr-watchdog script is not running, so I need to dig around a bit to figure out why.
Meanwhile, what is unique about the site(s)? This is the only reported instance, so we're seeing the planets only aligning for your group for some reason. Is there any extra heavy use of:
1) repeatedly refreshing the mesh status on these nodes?
2) continually polling the node over the mesh with sysinfo or other map generating scripts?
3) Is the marginal link continually going up/down, or is it always established and just marginal?
4) other?
Let's connect up separately. KE6BXT previously setup a tunnel to connect in, but haven't setup my client yet. I'll work to access tomorrow night.
Joe AE6XE
I don't "think" there is anything unique about the site or its use, and it's behaved flawlessly for about 45 days; then these two crashes. I do have a habit of setting mesh status to "Auto", but I'm normally viewing my QTH node in my browser (and I leave the browser window up 24x7).
Two nodes with strong connections are always present; one node running an omni comes and goes (very low LQ; high NLQ, as viewed from the mountaintop site) and I'm not sure how well the AirGrid site was connected in most recently (I was out of town until Friday night).
Iperfspeed is the only service installed; my sense is that it's used very, very infrequently.
What I find interesting is that the Potosi node appears to have the same problem... different hardware (M2 XM, IIRC). It seems odd that something so rare affects two separate mountaintop nodes here in Vegas.
Let me know what you think when you've tunneled in and feel free to contact me offline. Appreciate the help!
Rick
K7FYI
Frank was right - it did take until May. Thanks Frank!
- Don - AA7AU