Issue/Bug in newest release

Discussion to talk about software related topics only.
Post Reply
rsg
Posts: 54
Joined: Thu May 15, 2008 5:36 am

Issue/Bug in newest release

Post by rsg »

I seem to have discovered an issue with the newest (Rel 2.5.2) tools. Because of the inter-dependencies between my code and an external $70,000 motion platform, I cannot readily generate a simple test case to submit, so I can only describe the problem in detail. I also have an acceptable work-around, so I'm not in need of help. I'm simply reporting this in the hopes that it may save someone some time, and may eventually help in resolving the root cause.

Earlier this month, I realized that the date of the newest tools, advertised as "Release Date: Jan 24, 2010", was actually 2011 - so I decided to upgrade from Rel24_rc2 which I've been using. Two weeks later, I got a bug report of a missed deadline, which we all attributed to some new code that I've been adding.

We are controlling an external system that sends UDP packets out at 60 Hz, or every ~17 ms; we need to respond with another UDP packet before the next packet is sent - i.e. within ~15 ms, or the external system will exhibit a noticeable "bump" in its trajectory. FWIW - this is a private network - no connections to the Internet or the company intranet.

I have a task (call it Feedback) that waits on the UDP packet, and calls OSFlagSet to send a signal to another task (called Command), which is responsible for calculating the next commands and sending the response. This used to work fine with the old tools, but now, with the new tools, something happens to delay the second task. It isn't clear whether a packet is lost, or a flag set is lost, but what definitely happens is a missed Command deadline.

Today, when I hooked up a scope and determined the missed deadline, I decided to revisit my priorities. I had originally set the priority for the Feedback and Command tasks to higher than any other of my user tasks, but lower than the network tasks; I figured that I needed to keep the networking as responsive as possible. To (hopefully) prevent a missed flag event, I boosted these priorities to the highest levels; only interrupts are now higher. The problem still persists. To be sure there wasn't a system task that I was unaware of, I tried TaskScan, and found nothing unexpected. Stumped, I started to think what could be the problem:

1. Only interrupts can preempt my tasks
2. Other tasks can only prevent my tasks from running if they prevent task switching
3. I only have one other task, and it never disables task switching itself
4. So could it be in a system task or interrupt handler?

When I got to 4, I finally remembered I had changed the tools. Rebuilding the application eliminates the problem! So, it seems that somewhere, the system has changed...

As for me, I'll stick with the old tools, unless and until this issue is resolved; the old ones are good enough, and they work. I am willing to revisit this or experiment with ideas or fixes, time permitting.
rnixon
Posts: 833
Joined: Thu Apr 24, 2008 3:59 pm

Re: Issue/Bug in newest release

Post by rnixon »

Can you provide a list of all tasks and priorities (including the system tasks)?
If you are using UDP, you don't want any task that uses UDP to be a higher priority (lower number) than the system IP task or any other system network related task.
rsg
Posts: 54
Joined: Thu May 15, 2008 5:36 am

Re: Issue/Bug in newest release

Post by rsg »

Here is my original priority assignment:

Priority Task
#63 Idle
#50 Main
#49 User (command receiver)
#47 User (Feedback)
#46 User (Command)
#45 HTTP
#40 TCPD
#39 IP
#38 Enet

I also experimented with making the Feedback and Command priorities 27 and 26, but that didn't change things.

FWIW - Your advice
you don't want any task that uses UDP to be a higher priority (lower number) than the system IP task
was what I was thinking in my original assignment - great minds think alike?! ;)
User avatar
lgitlitz
Posts: 331
Joined: Wed Apr 23, 2008 11:43 am
Location: San Diego, CA
Contact:

Re: Issue/Bug in newest release

Post by lgitlitz »

How often do you lose the packet and is it easy for your to replicate? Have you modified any of the system files of either of the versions? First thing I would do is run a merge on C:\nburn\include\constants.h, differences in this file can have a major effect on performance variation. Are you using any peripherals? Are you using the SSL or SSH libraries?

I just did a merge on the full system and mod5270 directories of the latest build and 24_rc2. There are a fair amount of changes but very few that would cause significant added delays. The file I am most suspicious about is ..\nburn\system\arp.cpp. There were changes made to this file to prevent a rare atomicity error. This was corrected by protecting the pointer modify with OSLock. While OS(Un)Lock is correct functionality-wise, it is a bit code heavy for atomic protection of a variable. I have changed all of the OSLock(); calls to USER_ENTER_CRITICAL(); and all OSUnlock(); calls to USER_EXIT_CRITICAL();, there should be 4 of each. Can you see if this fix works?

My next step here would be to run a time-slice profiler to see where the CPU is spending all of its time. I would probably run this in your application built from both versions of the NNDK so you can just look for the differences. The following post has all the code you need for running a time-slice profiler: http://forum.embeddedethernet.com/viewt ... ?f=5&t=711

-Larry
Post Reply