intermittent hanging
intermittent hanging
We're experiencing an intermittent problem where our board seems to hang after power-up. We haven't been able to get the problem to reproduce reliably, but we see it quite frequently when we power on 20 boards (about 1 out of every 4 power cycles will have a single non-responsive board out of the 20, seemingly randomly selected).
On power-up, our app should spawn our highest priority tcp thread which listens on a port, accepts a connection, and reads from the connection. It also spawns a lower priority thread that sends UDP heartbeats every second. While testing this problem, all we were doing was powering the boards up and checking in WireShark for all 20 UDP heartbeats, then powering them back down and repeating the process (no TCP connections were attempted on the PC side). We noticed several behaviors when we got one to fail:
We were not able to open a TCP connection from the PC side. On a separate incident, we saw one fail and we just looked at the Netburner's ethernet port and noticed there were no LEDs illuminated, so we just unplugged the ethernet cable and plugged it back in and then everything came back up and worked fine (LEDs lit/blinked, heartbeats sent, were able to establish a tcp connection).
So I'm just looking for any suggestions/ideas on what I can do or try. I can't say for sure where the code is hanging...and our board doesn't even have a serial port for spitting out debug info or LEDs for...blinking info (worst design decision ever). I have an eval kit that I can run the code on, but it seems nearly impossible to replicate this problem on a single board.
I compile on NNDK 2.6.3 and target the MOD5272.
On power-up, our app should spawn our highest priority tcp thread which listens on a port, accepts a connection, and reads from the connection. It also spawns a lower priority thread that sends UDP heartbeats every second. While testing this problem, all we were doing was powering the boards up and checking in WireShark for all 20 UDP heartbeats, then powering them back down and repeating the process (no TCP connections were attempted on the PC side). We noticed several behaviors when we got one to fail:
We were not able to open a TCP connection from the PC side. On a separate incident, we saw one fail and we just looked at the Netburner's ethernet port and noticed there were no LEDs illuminated, so we just unplugged the ethernet cable and plugged it back in and then everything came back up and worked fine (LEDs lit/blinked, heartbeats sent, were able to establish a tcp connection).
So I'm just looking for any suggestions/ideas on what I can do or try. I can't say for sure where the code is hanging...and our board doesn't even have a serial port for spitting out debug info or LEDs for...blinking info (worst design decision ever). I have an eval kit that I can run the code on, but it seems nearly impossible to replicate this problem on a single board.
I compile on NNDK 2.6.3 and target the MOD5272.
Re: intermittent hanging
What is the other end of the ethernet cable plugged into? Is the power to that device cycled at the same time?
Re: intermittent hanging
They are all connected to a single ethernet switch, and it is not power cycled.
Not sure if this makes a difference, but we have all 20 boards configured to use hard coded static IPs, so the startup algorithm/order is InitializeStack(), GetInterfaceBlock(), set interface blocks' netIP,netIpMask,netIpGate,netDNS, OSChangePrio, EnableAutoUpdate, initialize some mailboxes, OSTaskCreate tcpMonitor, OSTaskCreate heartbeatMonitor.
Not sure if this makes a difference, but we have all 20 boards configured to use hard coded static IPs, so the startup algorithm/order is InitializeStack(), GetInterfaceBlock(), set interface blocks' netIP,netIpMask,netIpGate,netDNS, OSChangePrio, EnableAutoUpdate, initialize some mailboxes, OSTaskCreate tcpMonitor, OSTaskCreate heartbeatMonitor.
Re: intermittent hanging
Attached is a condensed version of code that replicates the problem (and I verified this many times). The test code sends a heartbeat every 2 seconds and when you open a tcp session on 192.168.13.4 port 51200 it will echo back anything received. Note that to test this out on 20 cards, I just had to change the hard coded address 20 times and rebuild the binary.
I also saw a different behavior when the problem arose: The ethernet port LEDs were on constantly/solid (as opposed to before when they were both constantly off). Unplugging the ethernet cable and plugging back in still got the program running again.
All suggestions are welcome.
I also saw a different behavior when the problem arose: The ethernet port LEDs were on constantly/solid (as opposed to before when they were both constantly off). Unplugging the ethernet cable and plugging back in still got the program running again.
All suggestions are welcome.
- Attachments
-
- main.cpp
- (9.19 KiB) Downloaded 308 times
Re: intermittent hanging
I'm wondering if this is a power supply issue. The reason is that the link lights are controlled by the Ethernet PHY. The PHY is essentially a state machine that will run independently of the processor to get negotiate the link. So the fact that the leds stay on or off all the time would indicate to me that somehow the current supply or rise time of the power supply is causing the PHY chip to lock up or go into an undefined state. It should not have anything to do with the code you have running on the processor.
Or, it could be that the switch you are using doesn't like all the links coming up at the same time. Maybe take one unit that constantly fails and disconnect all the others from the switch.
Or, it could be that the switch you are using doesn't like all the links coming up at the same time. Maybe take one unit that constantly fails and disconnect all the others from the switch.
Re: intermittent hanging
That all sounds plausible to me. For the power-supply rise-time scenario, is there a way to perform a hard-reset on the PHY chip? I'm thinking a work-around might be to delay during the processor startup (giving the power-supply time to stabilize), then I can issue a reset on the PHY and initialize the stack.
Or maybe a way I can do a run-time check to see if the PHY is hosed and then issue a a reset.
Or maybe a way I can do a run-time check to see if the PHY is hosed and then issue a a reset.
Re: intermittent hanging
Maybe a check for link, and if not, do a soft reset? Not sure how to do a hard reset of just the phy.
Re: intermittent hanging
The EtherLink() function seems to always be returning false....no matter what. Is there another way to do this?
Re: intermittent hanging
I see something in my release notes on 2.6.7 stating is fixed a callback function issue for a change in link status. Not sure if that has anything to do with the system call for link, or the callback is for some other use. I have not used it. You need that to work. I think the factory program shows link status on the serial port. I would try that first. If that does not work I would update the tools or contact netburner support to get it resolved.
Re: intermittent hanging
I was mistaken, I just wasn't giving the EtherLink() function long enough to return true. My startup code now pends for EtherLink() to return true before spawning the heartbeat thread. I removed the tcp thread all together as well just to see if it mattered (just monitoring for UDP heartbeats now). It did not matter, as I still am seeing the problem.