SBL2e fragmented receive data

Post by **pbreed** » Sat May 08, 2021 2:40 pm

Some how when you post code on a Netburner forum one might assume unless it was explicitly stated that this is code running on a Netburner.... not on a Chrome box. I asked VERY explicitly on May 5th to send me the code running on the Netburner...
Problem 1) You seem to assume that the entire message block will arrive in one TCP chunk, TCP is a stream protocol, not a packet protocol so one could easily have a 15 byte batch of bytes show up as 15 single byte reads.
If you need your result in a single packet Not having packetization on the receive end is a fundamental understanding/coding error....

Problem 2)You have built a 1 second (960msec) delay into your TCP receive and are asking me to fix a 1 second delay????
Fundamentally you should be using some kind of select loop for your recieve and exit the loop when you see a whole packet.

Now you can force the SBl2E to send ALL data in a single packet following a LF.
This will force the data going on the wire to be a single packet...
Use the custom packetization settings described earlier...
This may fix your issue, but its completely legal for the receive side TCP depending on buffer sand other various internals to give your reads the result in multiple pieces EVEN if it arrived on one packet.

If you really want "packets" then use UDP, but understand its also legal for the networking system to throw away UDP packets without notification. So if you need reliable guaranteed coms use TCP and packetize your result.

Now chrome books are a bit of a weird....beast... to really be sure where the delay is you need to look at a wireshark trace, hook both the chrome book and the SBL2E up to a ethernet switch you can monitor and capture the packets is the only 100% for sure way to know where the delay is.

There are socket options on the chrome end that can clean things up..
(Turn off NAGEL ie TCP_NODELAY) on the chorme end with socket options will likely help you here.)
Is programming the chormebook basically gcc /c++ on linux ?

So fixing your receive logic to not have ANYT Delays and to wait for the trailing LF would be a good thing no mater what else..

mrburns42 · Post by **mrburns42** » Sat May 08, 2021 7:35 pm

All you seem capable of doing is attacking me and criticizing me. You do not even read what I post.

First I am interfacing to a Chroma Load Box. I am not talking to a Chrome Book which you keep mentioning. Chroma is a company that makes test equipment. A Load Box is variable power load that can be controlled either by the front panel, or via an RS232 port in the older model that I have.

Second, I understand that TCP packets may have the data divided into multiple packets. The 950mS delay was implemented precisely to insure that it would impossible for any packets to still be pending. Instead of reading my explanation, you keep falsely saying that I requested that you change the code to accommodate that delay. I explained all this an even drew a flowchart to show the reason for the code.

In the case of the Chroma Load Box, the terminator character could provide a workaround to flush the data. However, for others reading these posts, what about systems that do not have a terminator and have random lengths of data? In those scenarios, what I am experiencing would result in loss of data or misplaced data.

KE5FX · Post by **KE5FX** » Sun May 09, 2021 10:21 pm

Not an NB employee, and I just skimmed the thread, but: TCP_NODELAY is almost certainly your problem (or rather the lack of it.) Everybody who ever does anything with TCP/IP has to learn this lesson the hard way, it seems.

To blame the right person, you need a time machine. Don't forget your gun, future generations of network programmers will thank you.

mrburns42 · Post by **mrburns42** » Tue May 11, 2021 6:56 am

Attached are screen shots of the SBL2e Configuration Web pages. Just so it is abundantly clear, this is running the stock firmware that ships with the device. The only items that I changed are shown in the screen shots.

I attempted to get a WireShark capture. However, that is difficult since the communication is from an embedded Linux device to the SBL2e via an Ethernet switch. The Ethernet switch is routing the traffic only between those two ports. I attempted an ARP flood to force the Ethernet switch into broadcast mode. However, that showed only the traffic from the Linux device to the SBL2e and not the return traffic which is the desired info. If I can locally find a managed switch with port mirroring, I will try again.

The system consists of an Ethernet switch, one or more embedded Linux devices, the SBL2e, and a Windows PC with two Ethernet ports. The Windows PC is running the DHCP server. The Windows PC is booted first and never rebooted. The purpose of the Windows PC is only to provide a GUI interface to this local network of 169.254.0.XX. I found that the problem is dependent on the boot order of devices in the system. If the SBL2e is booted last, then the problem will occur. However, if the SBL2e is booted first, before the Ethernet switch or the embedded Linux boxes, then the problem is not found. I ran an overnight test with the SBL2e as the first boot device and had zero problems. After that, I manually power cycled the SBL2e so that it booted with the Ethernet switch already running, and the problem was quickly seen.

After recreating the problem with an SBL2e power cycle, then I removed the SBL2e Ethernet cable from the Ethernet switch for a few seconds and re-attached it. That disconnect and reconnect again was enough to make the problem go away. Those steps to me imply that there is some necessary initialization code within the SBL2e that only gets executed when the Ethernet link is down. If the SBL2e is power first, it would find the Ethernet link down. If the SBL2e is powered last, then the link is already active during its boot.

It should be noted that the SBL2e is at a fixed IP address and only one piece of software on the embedded Linux side knows what the fixed IP address is. That software is not part of the Linux OS and only runs by me manually executing it from a console. So there is nothing that would create any kind of race condition on booting.

mrburns42 · Post by **mrburns42** » Tue May 11, 2021 11:15 am

Attached is a WireShark capture of the issue. I also included screenshot of the TCP stream at the problem point. The Chroma Load Box is on Port 24. There is another Fluke 45 meter on Port 23 which will also appear in the capture. After a significant wait with no extra receive packets to provide the missing terminator, I send "0x20, 0x20, 0x0A" this is two spaces and a linefeed. That combination will be ignored by the Chroma itself to keep the Chroma in-sync with what my code is doing. After that combination was sent, the final three characters popped up on the receive side.
Sometimes, like this instance, only one set of "0x20, 0x20, 0x0A" is required. However, other times up to three sets must be sent before I get the final data and terminator back. Note: to see the extra bogus commands, you must view in HEX as all of the three characters are whitespace.

A correction to the previous post. Disconnecting the Ethernet cable will not always fix the issue. I had one instance where the problem remained. However, thus far, placing the SBL2e first in the boot order keeps the problem from occurring.

mrburns42 · Post by **mrburns42** » Thu May 13, 2021 6:36 am

When running the SBL2e with the default setting that it ships with, it is sending back the data with only one byte per TCP packet. While very inefficient, that would not be an issue except that the SBL2e gets stuck sometimes at the end of the serial transmission and forgets the end the last couple of characters. Attached is a Wireshark capture that shows it. Also included is a screen shot of the WireShark decode of the problem area. In the screenshot, I placed a green diamond at the last good packet. The SBL2e sent it at 21.125275 and my end acknowledged it at 21.127316. At that point the SBL2e stopped sending data. My code waited several seconds and then sent three wake up "whitespace" messages. At the orange diamond, the SB2Le finally realizes that data remains and send it at 26.728712.

In this case only a single character remained, the linefeed terminator. In other cases, a couple of characters remained. In those cases, then final packet from the SBL2e will have all the remaining data and not individual bytes. The attached case is the simplest one. In the cases with more than one remaining characters, the SBL2e appears to send a packet out of order. My side sends back another ACK of the last good packet to signal what it is missing. The SBL2e ignores those duplicate ACKs and the data transmission stalls again until I send the wake up "whitespace".

This problem does not appear for me if the SBL2e is has the "Enable custom packetization logic" check box selected. If I select that option and leave everything else the defaults, then it sends multiple bytes per packet and the problem was not seen on an extended overnight run. In the User Manual, there is no documentation of the operation when this check box is in it default, un-checked state. For any others, with intermittent data problems on the SBL2e, I would recommend enabling those boxes, even if you want what appears to be the default settings.

Post by **pbreed** » Thu May 13, 2021 9:15 am

Please let me understnad how this is connected:

It this correct?

Chroma Load box <RS-232> SBl2E <TCP> <What is this device?>

What is the network topology between the SBL2E and whatever is connecting to it?
In the final device on the end of the TCP connection, what code is running there...
Have you set the TCP_NODELAY socket options there...

Is there a wireless connection involved?
If the ACK going back to the SBl2E is getting lost you would see exactly what you are seeing...
The fact you are seeing out of order packets says something strange is going on with the network.

If you do a flood ping from the end device to the SBl2E what sort of rate/loss do you see?

Post by **TomNB** » Thu May 13, 2021 9:19 am

I think the <What is this device> is his ARM processor running his custom code. It is a NXP processor running linux, with an Ethernet connection (no wifi)

mrburns42 · Post by **mrburns42** » Thu May 13, 2021 12:32 pm

Attached is a diagram of the connections. The NXP ARM processor needs to control the Chroma Load Box and a Fluke 45 meter. (Both instruments show this problem.) The ARM processor has no available ports to attach these instruments to. That is why the SBL2e is used. I have swapped the SBL2e onto three different HW setups. All of my HW show the same issue. So, the problem is not the Ethernet cabling, the Ethernet switch, the ARM processor or the Windows PC, as all have been swapped.

The ping that is included with the Linux build of the ARM processor does not include a flood option. If I run a normal ping and then terminate I get the output shown below. I believe the network is stable since the ARM processor actually boots over Ethernet by fetching files from the Windows PC.
I can also ping from the Windows PC to the SBL2e with no lost packets.

The NO_DELAY is set on the Linux side. My analysis of the WireShark capture that I sent is that the SBL2e was sent the ACK for the next to the last packet and then the SBL2e did not send the last packet until after I woke it back up with serial TX requests.

# ping 169.254.0.55
PING 169.254.0.55 (169.254.0.55): 56 data bytes
64 bytes from 169.254.0.55: seq=0 ttl=64 time=1.244 ms
64 bytes from 169.254.0.55: seq=1 ttl=64 time=1.216 ms
64 bytes from 169.254.0.55: seq=2 ttl=64 time=1.193 ms
64 bytes from 169.254.0.55: seq=3 ttl=64 time=1.187 ms
64 bytes from 169.254.0.55: seq=4 ttl=64 time=1.185 ms
64 bytes from 169.254.0.55: seq=5 ttl=64 time=1.181 ms
64 bytes from 169.254.0.55: seq=6 ttl=64 time=1.201 ms
64 bytes from 169.254.0.55: seq=7 ttl=64 time=1.187 ms
64 bytes from 169.254.0.55: seq=8 ttl=64 time=1.185 ms
64 bytes from 169.254.0.55: seq=9 ttl=64 time=1.189 ms
64 bytes from 169.254.0.55: seq=10 ttl=64 time=1.184 ms
64 bytes from 169.254.0.55: seq=11 ttl=64 time=1.187 ms
64 bytes from 169.254.0.55: seq=12 ttl=64 time=1.183 ms
64 bytes from 169.254.0.55: seq=13 ttl=64 time=1.186 ms
64 bytes from 169.254.0.55: seq=14 ttl=64 time=1.205 ms
64 bytes from 169.254.0.55: seq=15 ttl=64 time=1.186 ms
64 bytes from 169.254.0.55: seq=16 ttl=64 time=1.197 ms
64 bytes from 169.254.0.55: seq=17 ttl=64 time=1.187 ms
64 bytes from 169.254.0.55: seq=18 ttl=64 time=1.190 ms
64 bytes from 169.254.0.55: seq=19 ttl=64 time=1.188 ms
64 bytes from 169.254.0.55: seq=20 ttl=64 time=1.187 ms
64 bytes from 169.254.0.55: seq=21 ttl=64 time=1.192 ms
64 bytes from 169.254.0.55: seq=22 ttl=64 time=1.186 ms
64 bytes from 169.254.0.55: seq=23 ttl=64 time=1.185 ms
64 bytes from 169.254.0.55: seq=24 ttl=64 time=1.189 ms
^C
--- 169.254.0.55 ping statistics ---
25 packets transmitted, 25 packets received, 0% packet loss

mrburns42 · Post by **mrburns42** » Thu May 13, 2021 7:52 pm

I changed my code to pause for 60 seconds after detecting a missing terminator instead of sending the whitespace wake up TCP commands. Once I saw the printout that it was in this 60 second wait portion, then I sent ping requests from a separate console window. These pings were coming from a entirely separate OS instance. The original TCP connection was still open and still waiting. Attached is a screen shot of the WireShark. You can see the last packet was successful and then the SBL2e just sat there until the pings arrived eight seconds later. After the PINGs woke it up, then it sent the final packet back to the original TCP connection.

NetBurner Community Forum

SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data

Re: SBL2e fragmented receive data