Page 1 of 1

Questions about select().

Posted: Mon Feb 27, 2012 7:01 am
by fanat9
I use select() and I include my the only TCP file descriptor into two FD_SET's: into read_set to wakeup select() when I got data and into error_ds to wake up select() if something wrong with my tcp socket. I also include my custom FD in write_set.
But I don't get remote host disconnect or other tcp errors this way. If I specify some timeout for select() and when check TcpGetLastRxTime(imsSocket) i'm getting current time and can't detect disconnect this way too. The only way I managed to get disconnect error is to try to read from socket each time I got select() timeout.

I have no connection when I try to send "keepalive" packet to remotehost, but TcpGetLastRxTime return new value each time I try it, which I found strange.

And I really prefer to sleep in select() forever, until I got data ready to read or any kind of problem with socket because sending data to remote host is a higher priority.

Does anybody have experience with select() ? What strategy you use to identify problems with tcp socket?

Re: Questions about select().

Posted: Mon Feb 27, 2012 9:02 am
by Chris Ruff
In your application, are you listening or are you a client?
Generally speakng, your problem is a system problem- the client should poke the listener occasionally to reset a timer. If the timer goes off, you scuttle the connection socket and commence to re-listen for the client to reconnect. If one of the connections is legacy or you can't change the code and there is no way to develop a low resource keep alive, you will need to continue to fall through select and take a look around.

Chris

Re: Questions about select().

Posted: Mon Feb 27, 2012 11:04 am
by rnixon
I agree with Chris. To put it a another way, there is nothing in the tcp protocol to detect a disconnect like you are trying to do. If there is no data that has not yet been acked, then tcp doesn't know if the other side went away. This is called a half open socket for a server. Some of the basics:

1. If you are transmitting data and one side crashes, then you will get a socket error because the data will not be acked.
2. A keep-alive, if used properly, will work. It essentially works like item 1, by detecting data that isn't acked, or if the other side sends a rst.
3. Create your own "heartbeat" by sending out a packet to the client every so often, but you would need to be in control of the s/w running on the other side as well.

Re: Questions about select().

Posted: Mon Feb 27, 2012 4:01 pm
by fanat9
Well... I just look on select() as a "high level" function. What I mean by that:

Clearly, to get data ready to read (ready bits set on "read_fds") a lot of things have to be done. But its all done behind the screen (handling ethernet frames, when IP protocol on top of ethernet, and finally TCP). So, on Netburner platform(or should I use word uCOS?), we have Ethernet driver thread/task with priority 38, IP task with priority 39, TCP task 40 and finally users tasks.

So, I just naturally expected to have at least some problems with connection reported thru errors file descriptor set. Like, for example, unplugged ethernet cable or when remote system sent packet with FIN bit set and other including "no response from remote host" when KeepAlive feature is turned On.

Now KeepAlive: Yes, to get status of tcp connection you have to try to send some data to remote host. But it is a part of tcp protocol. Its called KeepAlive, which is basically SYN packet with zero data to be send to remote system and remote system have to ACK it the same way it ACK any other packet. So it can be used without any modifications on remote system, as long as its in compliance with TCP.
And I use this socket option on MS Windows(http://msdn2.microsoft.com/en-us/library/ms740476.aspx) and Linux http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/. And I don't have to manually check TcpGetLastRxTime() - I just get error events/signals if remote system didn't respond.

I did search where SetHaveError() used in tcp.cpp file and found just two places: in socket_struct::DoClose() and socket_struct::ClearPending().

Don't know what to do... not really want to start modify system stuff.

Re: Questions about select().

Posted: Mon Feb 27, 2012 6:52 pm
by rnixon
Hi,

You mention of modifying system files makes me think my comments were misread. I have some additional comments below. In no way am I suggesting changing a system file. Quite the opposite. I am suggesting adding code around you select to handle the errors that TCP passes up. I see from your comments you are comparing it to linux and windows, but those are significantly larger systems and there may be some differences on a small embedded platform.
fanat9 wrote:Well... I just look on select() as a "high level" function. What I mean by that:

Clearly, to get data ready to read (ready bits set on "read_fds") a lot of things have to be done. But its all done behind the screen (handling ethernet frames, when IP protocol on top of ethernet, and finally TCP). So, on Netburner platform(or should I use word uCOS?), we have Ethernet driver thread/task with priority 38, IP task with priority 39, TCP task 40 and finally users tasks.

So, I just naturally expected to have at least some problems with connection reported thru errors file descriptor set. Like, for example, unplugged ethernet cable or when remote system sent packet with FIN bit set and other including "no response from remote host" when KeepAlive feature is turned On.
Only those errors reported by standard tcp, not including keepalive, are passed up. I understand you are saying it should be "standard" or included, but it is not. So unplugging a cable, RST, or other problems will only be reported to select if data is in transit when one of those events occur.

Now KeepAlive: Yes, to get status of tcp connection you have to try to send some data to remote host. But it is a part of tcp protocol. Its called KeepAlive, which is basically SYN packet with zero data to be send to remote system and remote system have to ACK it the same way it ACK any other packet. So it can be used without any modifications on remote system, as long as its in compliance with TCP.
I think keep alive is a good thing to have, and I did mention it in my previous post. However, it is an option in the tcp rfc and following the rfc the min. time period is 2 hours. Not very useful, so most implementations make it much shorter. Again, I am *not* disagreeing with you that keepalive is a good thing to have, and that it would be great to have select automatically include it. I'm just trying to give you some info to implement it with the netburner system and make it very clear I am not recommending changing system files.

**********
Under Transmission Control Protocol (TCP) keepalives are an optional feature, and if included must default to off.[1] The keepalive packet contains null data. In an Ethernet network, a keepalive frame length is 60 bytes, while acknowledge to this, also a null data frame, is 54 bytes. There are three parameters related to keepalive:

Keepalive time is the duration between two keepalive transmissions in idle condition. TCP keepalive period is required to be configurable and by default is set to no less than 2 hours.

Keepalive interval is the duration between two successive keepalive retransmissions, if acknowledgement to the previous keepalive transmission is not received.
**********

And I use this socket option on MS Windows(http://msdn2.microsoft.com/en-us/library/ms740476.aspx) and Linux http://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/. And I don't have to manually check TcpGetLastRxTime() - I just get error events/signals if remote system didn't respond.
No argument here either, but they are different operating systems.
I did search where SetHaveError() used in tcp.cpp file and found just two places: in socket_struct::DoClose() and socket_struct::ClearPending().

Don't know what to do... not really want to start modify system stuff.
Have you looked at the keepalive example? It seems like you could have the select timeout process the keepalive for your connections. It depends on how long you can go for detection, but if you only need to detect a dead connection when data is not being transmitted, doing it once every 5 minutes or so might work. This doesn't require any system mods. Just suggesting a possible solution.

Re: Questions about select().

Posted: Mon Feb 27, 2012 7:55 pm
by fanat9
Only those errors reported by standard tcp, not including keepalive, are passed up. I understand you are saying it should be "standard" or included, but it is not. So unplugging a cable, RST, or other problems will only be reported to select if data is in transit when one of those events occur.
My point: If ethernet task already detected cable unplugging event why not to set it to errors_fds or if remote system sent close connection request and our TCP thread responded to it and connection closed - why not to set errors_fds? And so on.
KeepAlive same, except it optional feature. But still - if it enabled for particular socket - why not to make a TCP task responsible for it, means handle KeepAlive and set error_fds in case of any problems, just like it implemented on other platforms.

Have you looked at the keepalive example? It seems like you could have the select timeout process the keepalive for your connections. It depends on how long you can go for detection, but if you only need to detect a dead connection when data is not being transmitted, doing it once every 5 minutes or so might work. This doesn't require any system mods. Just suggesting a possible solution.
It works when I use it with ReadWithTimeout(), but not with select(). After select() timeout I use same piece of code to send keep-alive packet using TcpSendKeepAlive() and read timestamp of last successfully sent packet, but for some reason TcpGetLastRxTime() returns updated time with no working connection to remote system. I have to try to read() socket after select() timeout and in this case I get negative return, which means something wrong with socket.
But then again - I didnt try to send anything - just read - and system already knows that we have no connection. And if it already knows it - why not to set errors_fds, so it can be detected by select().