Potosi M2Rocket still off-line

18 posts / 0 new

Sun, 01/20/2019 - 13:50

AA7AU

Starting a new thread to deal with on-going problem. For details, please see last few posts in this thread:
https://www.arednmesh.org/content/bullet-m2hp-xw-firmware

As of Sunday 20-Jan-2019, the Rocket M2 (with 120* sector) on Mt Potosi in the SW corner of the LV area is still unavailable. This is a very important node here (in the short- to interim- timeframe) and this problem has crippled many new mode users who were relying upon linking thru Potosi.

Basically we can "see" the node in WiFi Scan at good signal strength and with the proper SSID etc. It just seems like it won't handshake with other nodes which used to hit it fairly easily. This removes any IP-based options for trouble-shooting.

It almost looks like it flipped into AP mode instead of Mesh mode. Physical access to this node is currently problematic, and I don't have full details yet.

I have two questions now so as to hopefully move forward on this:

1- is there any way to use that MAC address to connect over RF to the node if it's in Mesh operation?

2- if we wanted to try to connect to the node in AP mode, how would we configure one of our operational nodes to contact it (on -2/10), and then could we somehow remotely reboot that node back into proper operation?

TIA,
- Don - AA7AU

Sun, 01/20/2019 - 16:47

AA7AU

Still shows up in Real-Time SNR charts

The SNR charts must be MAC based as Potosi still shows up with a consistent reading for real-time SNR - so that can't be IP-based.

Just adding another data point - still need HELP!

- Don - AA7AU

Sun, 01/20/2019 - 17:09

AE6XE

If you see this node on 10MHz

If you see this node on 10MHz channel width, then it couldn't be in AP mode -- no settings ever defined to be in this state. There has been occurrences of moisture shorting out the cat5 wiring, which can put a node in firstboot state (same as pressing the remote reset button on a UBNT power brick from ~15 seconds). The node would be in firstboot, but the AP is on a standard 20MHz channel.

On a node receiving the Potosi signal, please grab a support download file. In this data is the output of a command to see that it is connected with an 802.11 adhoc network, "iw dev wlan0 station dump". this will confirm Potosi is still in mesh mode, if listed in this output. If it has an 802.11n adhoc connection, then we'd be looking at the next level for OLSR activity to exchange IP addresses and hostnames. This gets a bit more technical, but on your local node, install the tcpdump package and from the command line, "tcpdump -i wlan0 port 698" and look to see if any data is coming from the Potosi node. If not, then OLSR is not functioning at Potosi.

Joe AE6XE

Sun, 01/20/2019 - 22:19

(Reply to #3) #4

AA7AU

Potosi is the first entry in that list

Potosi's MAC is DC:9F:DB:36:81:99 - still shows up in WiFi scans. Here's your data:

root@W7HEN-HARC-M2R90-TDY:~# iw dev wlan0 station dump
Station dc:9f:db:36:81:99 (on wlan0)
        inactive time: 350 ms
        rx bytes:       2632078556
        rx packets:     12549391
        tx bytes:       821955257
        tx packets:     6158694
        tx retries:     4135115
        tx failed:      1683
        rx drop misc:   1136849
        signal:         -82 [-85, -85] dBm
        signal avg:     -82 [-84, -87] dBm
        tx bitrate:     19.5 MBit/s MCS 2
        rx bitrate:     39.0 MBit/s MCS 10
        expected throughput:    13.366Mbps
        authorized:     yes
        authenticated: yes
        associated:     yes
        preamble:       long
        WMM/WME:        yes
        MFP:            no
        TDLS peer:      no
        DTIM period:    0
        beacon interval:100
        connected time: 1500991 seconds

What's next?

Potosi remains unresponsive on the mesh IP-layer,
- Don - AA7AU

edited to add: this data is from a node which has NOT rebooted since before Potosi went missing.

Sun, 01/20/2019 - 23:33

(Reply to #4) #5

AE6XE

This says that there is an

This says that there is an 802.11 adhoc connection between Potosi node and this node. looks like about a 17db SNR received signal. The Potosi node is live and making a wireless link. Next step is to run the tcpdump command to see if OLSR is up and sending out hello packets. I'd suspect there are none and thus, no traffic can be exchanged as there is no routing information to communicate with IP traffic.

You'll need to locally sync with owners of the node to gain access to further investigate. Don KE6BXT is at Quartzsite, not sure about Frank to gain access.

Joe AE6XE

Mon, 01/21/2019 - 01:19

AA7AU

TCPDUMP doesn't find Potosi

OK, did the tcpdump now: tcpdump -i wlan0 port 698
just cycles thru a few known nodes except not sure about this one;
22:06:46.172362 IP 10.71.178.144.698 > 10.255.255.255.698: OLSRv4, seq 0xfde8, length 60
but no entries for Potosi and its old IP#

Looks like its not talking .... but it still shows up in WiFi scan this eve.

Frank responded to my earlier email this evening: "... no remote resets available. Physical hill top access is not likely any time soon."

Is there anything else we can try remotely?

Thanks,
- Don - AA7AU

Mon, 01/21/2019 - 11:00

(Reply to #6) #7

AE6XE

Not much that can be done at

Not much that can be done at this point. The node has a watchdog reset feature if olsr stops responding on the node, so how it got into this state is unexplained. I'd want a support data download which can be obtained from a laptop on the LAN of the node at the site before rebooting it.

You might try this command to see if any traffic is coming out, "tcpdump -i wlan0 ether host dc:9f:db:36:81:99".

Joe AE6XE

Mon, 01/21/2019 - 12:19

(Reply to #7) #8

AA7AU

Nada

Thanks, Joe!

Ran the "tcpdump -i wlan0 ether host dc:9f:db:36:81:99" and got *nothing* for three minutes of waiting:
root@W7HEN-HARC-M2R90-TDY:~# tcpdump -i wlan0 ether host dc:9f:db:36:81:99
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on wlan0, link-type EN10MB (Ethernet), capture size 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

Understand you need a support data file from the node using LAN to try to figure this one out. I hope we can get that for you. However, I have no control over who will ultimately end up at the site and perform the power cycle. I will inform Frank (no reply yet to my last email) that this data capture needs to be done to help AREDN going forward.

I think that this Potosi failure certainly makes it real clear to all that a hub-and-spoke network design is a very poor choice - when the central focus stumbles and fails ... and even more so when that central point in only accessible at certain times of the year and then with difficulty. I have five new mesh users here in the HOA-dominated west side of Henderson, all pointed at Potosi (with no other current alternative) who are now deaf! --sigh--

- Don - AA7AU

Mon, 01/21/2019 - 12:31

(Reply to #8) #9

K6AH

Hub and Spoke Not Always a Bad Choice

Don, central site nodes are maintainable if you have alternate ways into them. Most all such sites in the SoCal network have access through a separate channel and usually on a different band. In addition, I would never place a node at a hard-to-get-to site without having a managed PoE switch that you can turn power off/on to each node. It's all in designing to a set of requirements which must include maintainability.

Andre, K6AH

Mon, 01/21/2019 - 12:59

(Reply to #9) #10

AA7AU

Apologies for the overly broad comment.

Thanks Andre. Sorry for the overly broad comment. You are absolutely right and AREDN has good guidance on how to properly design networking.

The fellow in charge of this site now writes that they intend to implement a power-cycle type control over another RF access. But for now that node is unavailable; hopefully we'll get that data capture before power-cycle. But, I'm somewhat out of the loop on that.

Up in Idaho, our shoe-string budget sometimes precludes common-sense things like remote control over POE. But we're working on it. Luckily we have mostly a true interconnected mesh up there and the mountain top mesh node is not central to continued operations.

Thanks for all you do,
- Don - AA7AU

Sat, 04/27/2019 - 10:27

#11

K7FYI

Different Node; Same Symptoms

Joe:
Interestingly, a different node here in Vegas seems to be exhibiting this same behavior as of around 2:45 AM this morning. Like Potosi, I see signal strength indicated and its MAC address shows up in a WiFi scan - "????" is shown in the host name. This node is DTD'd on the mountaintop; coming in through that link doesn't work either. Thankfully, this one is easier to physically access.

This is Rocket M5 XW that has been running a beta build (it's what was available when we installed and didn't want to risk an OTA upgrade). It has been running for ~2 months; probably 30 days + since a reboot.

Any predictions on what it will take to get it running?

root@K7FYI-MKTK5HP-SE:~# iw dev wlan0 station dump

Station fc:ec:da:66:bc:37 (on wlan0)

inactive time: 60 ms

rx bytes: 769707

rx packets: 12411

tx bytes: 0

tx packets: 0

tx retries: 0

tx failed: 0

rx drop misc: 0

signal: -69 [-69, -83] dBm

signal avg: -68 [-68, -82] dBm

tx bitrate: 6.0 MBit/s

authorized: yes

authenticated: yes

associated: yes

preamble: long

WMM/WME: yes

MFP: no

TDLS peer: no

DTIM period: 0

beacon interval:100

short slot time:yes

connected time: 2734 seconds

Rick
K7FYI

Sat, 04/27/2019 - 11:11

(Reply to #11) #12

AE6XE

Rick, at the end of January,

Rick, at the end of January, a watchdog fix went into the Nightly Build, to reset OLSR if it froze up, and process still running. This must be a slightly older firmware version. ...or a different root cause. We really need to capture a support data download (locally from the LAN of the node) before they are rebooted to confirm state and what the issue is.

Has the snow melted yet? I hear people will be skiing in Mammoth until July this year.

Sat, 04/27/2019 - 17:45

(Reply to #12) #13

K7FYI

Thanks, Joe - will do. We're

Thanks, Joe - will do. We're trying to arrange a trip up today or tomorrow. I do think the firmware is older than the OLSR fix. In retrospect, I should have risked the OTA upgrade.

Thankfully, this one is only ~3,400' so no snow there. The higher peaks around town... still plenty of white!

Sun, 04/28/2019 - 18:49

#14

K7FYI

Joe: Good news, bad news and

Joe: Good news, bad news and good news:

Yesterday, on a trip to the mountain, we got the troublesome node back up after a reboot. Unfortunately, I couldn't get into it on the LAN side to get the support data file (...turns out, I had the wrong IP address). While at the site, I upgraded the firmware to 3.19.3.0. Everything was running fine when we left the mountain; when I checked on it from home, I found that it had stopped responding ~30 min earlier. The symptoms seemed to be the same.

Today, we traveled back to the mountain and this time I was able to get the support data file (attached). We now have the PoEs for both mesh nodes at this site on a remotely accessible power switch on another network so we can cycle them without traveling to the site.

Maybe this is a red herring, but here is our only thought on what's related to the sudden crashes of this node: A local user on a 31 mile path has been connecting to this node with a Ubiquity AirGrid. The AirGrid is running 3.19.3.0 firmware and I previously used it at my house (25 mile path) to connect to the same node without problems. We've noticed that when that device connects now, the node seems less responsive to other users. I don't have any firm metrics on the connection, but know it is marginal at best. Out of an abundance of caution, he powered it off this morning and it remains off.

Hopefully the support data is helpful!

Rick
K7FYI

Support File Attachments:

supportdata-K7RSW-K7FYI-RM590-West-201904281948.tgz

Sun, 04/28/2019 - 22:44

(Reply to #14) #15

AE6XE

Rick, this is good data and

Rick, this is good data and does show an issue. I don't believe the marginal AirGrid link is a root cause, although it could be exacerbating the failure. Any marginal link, particularly if traffic is trying to get across it, will tie up the channel for all the other traffic. With extra handshaking to try and get the traffic across, latency gets worse and VOIP traffic can start losing bits, if bad enough.

What I found in the data download was the suspected situation where OLSR is miss-behaving -- it's running, but not producing hostname or routing information fundamental for the node to operate. There is a 'watchdog' script that restarts OLSR, but the watchdog script is not running to perform this step. Nothing is obvious why this olsr-watchdog script is not running, so I need to dig around a bit to figure out why.

Meanwhile, what is unique about the site(s)? This is the only reported instance, so we're seeing the planets only aligning for your group for some reason. Is there any extra heavy use of:

1) repeatedly refreshing the mesh status on these nodes?
2) continually polling the node over the mesh with sysinfo or other map generating scripts?
3) Is the marginal link continually going up/down, or is it always established and just marginal?
4) other?

Let's connect up separately. KE6BXT previously setup a tunnel to connect in, but haven't setup my client yet. I'll work to access tomorrow night.

Joe AE6XE

Sun, 04/28/2019 - 23:02

(Reply to #15) #16

K7FYI

Excellent. I'm glad the data

Excellent. I'm glad the data is helpful.

I don't "think" there is anything unique about the site or its use, and it's behaved flawlessly for about 45 days; then these two crashes. I do have a habit of setting mesh status to "Auto", but I'm normally viewing my QTH node in my browser (and I leave the browser window up 24x7).

Two nodes with strong connections are always present; one node running an omni comes and goes (very low LQ; high NLQ, as viewed from the mountaintop site) and I'm not sure how well the AirGrid site was connected in most recently (I was out of town until Friday night).

Iperfspeed is the only service installed; my sense is that it's used very, very infrequently.

What I find interesting is that the Potosi node appears to have the same problem... different hardware (M2 XM, IIRC). It seems odd that something so rare affects two separate mountaintop nodes here in Vegas.

Let me know what you think when you've tunneled in and feel free to contact me offline. Appreciate the help!

Rick
K7FYI

Wed, 05/08/2019 - 18:48

#17

ke6bxt

QST QST QST KE6BXT-N7ZEV-Potosi-M2R-54-129-153 is back on the

QST QST QST KE6BXT-N7ZEV-Potosi-M2R-54-129-153 is back on the air.

Wed, 05/08/2019 - 20:29

(Reply to #17) #18

AA7AU

May 8th

Frank was right - it did take until May. Thanks Frank!

- Don - AA7AU

Amateur Radio Emergency Data Network

Menu SF

You are here

Potosi M2Rocket still off-line

Amateur Radio Emergency Data Network

Search form

Menu SF

You are here

Potosi M2Rocket still off-line