You math gurus, could use some advise.

seulater · Post by **seulater** » Thu Oct 18, 2012 8:55 pm

I got this sample code that i tweaked a bit to fit my situation.
It is a screen saver for a lcd.
See video of it here to get a better idea of what it is doing.

http://www.youtube.com/watch?feature=pl ... x8SnwvDbKg

In the code below which works fine for me but is brutally slow. Its not the LCD library i am using as i get 25fps.

I have commented a line in the code "// Comment this out and these loops fly. "

for you math guys, does anything jump out at you that could make these loops faster in doing the math.

Thanks for you time.

Code: Select all


#define MAX(a,b) ((a) > (b) ? a : b)
#define MIN(a,b) ((a) < (b) ? a : b)

#define BALL_COUNT	3

struct MetaBall
{
  int x;
  int y;
  int r;
  int rsq;
  int dx;
  int dy;
};
struct MetaBall balls[BALL_COUNT];


float calculateThreshold(struct MetaBall *pBall, int x, int y);
int RLP_MetaBalls(void);


//*********************************************************************
//****************************  TASK 2  *******************************
//*********************************************************************
int RLP_MetaBalls(void)
{
  const int width = 320;
  const int height = 240;
  const int maxRadius = 35;
  const int leftLimit = 80;		//80
  const int rightLimit = width - 80;	//80
  const int topLimit = 60;		//60
  const int bottomLimit = height - 60;	//60
  const int surfaceWidth = rightLimit - leftLimit;
  const int surfaceHeight = bottomLimit - topLimit;

  WORD t=0;

  // Initialize the balls
  for (int i = 0; i < BALL_COUNT; ++i)
  {
    struct MetaBall *pBall = &balls[i];
    pBall->x = (rand() % surfaceWidth) + maxRadius;
    pBall->y = (rand() % surfaceHeight) + maxRadius;
    pBall->r = 20;
    pBall->rsq = pBall->r * pBall->r;
    pBall->dx = (rand() % 4) + 1;
    pBall->dy = (rand() % 4) + 1;
  }

  // Run the animation loop
  for (;;)
  {
    int minX = width;
    int minY = height;
    int maxX = 0;
    int maxY = 0;

    // Move the balls and find the extream ball locations
    for (int i = 0; i < BALL_COUNT; ++i)
    {
      struct MetaBall *pBall = &balls[i];

      if ((pBall->x <= leftLimit && pBall->dx < 0) || (pBall->x >= rightLimit && pBall->dx > 0)) pBall->dx = -pBall->dx;
      if ((pBall->y <= topLimit && pBall->dy < 0) || (pBall->y >= bottomLimit && pBall->dy > 0)) pBall->dy = -pBall->dy;

      pBall->x += pBall->dx;
      pBall->y += pBall->dy;


      minX = MIN(minX, pBall->x);
      minY = MIN(minY, pBall->y);
      maxX = MAX(maxX, pBall->x);
      maxY = MAX(maxY, pBall->y);
    }

    // Calculate bounding rect based on extream ball locations
    minX -= maxRadius;
    minY -= maxRadius;
    maxX += maxRadius;
    maxY += maxRadius;

    // Clear the back buffer
    memset(lcd_ram, 0, width * height * sizeof(short));

    // Iterate the pixels in the bounding rect
    // to calculate the influence of each of the balls
    // on each pixel


    for (int y = minY; y < maxY; y++)
    {

      for (int x = minX; x < maxX; x++)
      {
        // Calculate the sum of influence of each ball on the current pixel
        float sum = 0.0;

        for (int i = 0; i < BALL_COUNT; ++i)
        {
        	int dx = x - balls[i].x;
        	int dy = y - balls[i].y;

        	if (dx != 0 && dy != 0)
        	{
			// Comment this out and these loops fly. 
        		sum += balls[i].rsq / (float)((dx * dx) + (dy * dy));
        	}
        }

        // Check if the pixel should be lit up.
        if (sum > 0.9 && sum < 1.0)
        {
        	lcd_ram[x][y] = 0x001f;
        }

      }
    }

    ShowLcdRam();


  }
  return 0;
}

tod · Post by **tod** » Thu Oct 18, 2012 9:57 pm

I'm not a math guy but I can see a couple of things I would experiment with. First I would instrument this so that I could get some timing values out of the loop without relying on the algorithm producing correct results.
The easiest thing to try would be to de-reference balls.rsq, balls.x and balls.y once outside of the three nested loops since balls[] doesn't appear to be changing inside those loops. (The compiler might be optimizing this for you.)
Speaking of compiler optimization are you using -O2? Did you try -O3?

Then I would try changing sum to an int, get rid of the cast and see how much savings you get doing integer math vs floating point math. This is just for timing as the results won't be correct.
If that shows a worthwhile savings then I would look at scaling it up by an appropriate factor of 10 and keeping the math integer based. You have to make sure you don't end up with overflow. If your ints are so big you cant even scale by 10, you might see if uint64_t based ints are faster than the floating point.

I didn't look that closely at the code but if dx and dy have any kind of weighted distribution of limited values you could pre-calculate a dx-squared and dy-squared lookup table (I'm assuming it would be stored on the heap and you're using a chip with enough RAM to get some benefit) and when dx or dy hit one of the more common values you would just look up the square instead of calculating it.

And you can also unroll the for loop on i and save the time of the branch. I was just trying to optimize some SPI stuff and and I was surprised at the time savings I got (I was optimizing in the hundreds of nanoseconds time frame.)

Also if you know the problem is that one line, you could distill your question down and post on a site like StackOverflow and reach a wider audience.

Just some late at night thoughts.

ecasey · Post by **ecasey** » Fri Oct 19, 2012 5:25 am

You seem to have identified what I would say is the slow part, namely the line that calculates "sum".
Try changing sum to an int and then:

Code: Select all

sum += 100*balls[i].rsq / 100*((dx * dx) + (dy * dy));

Then the compare becomes:

Code: Select all

 if (sum > 90 && sum < 100)

As Tod points out, there is a lot of overhead in handling floats.

It might be even faster if you use binary math by shifting balls.rsq and ((dx * dx) + (dy * dy)) before the divide. The compare would have to change too to get the 90% and 100%.

Ed

seulater · Post by **seulater** » Fri Oct 19, 2012 6:37 am

Speaking of compiler optimization are you using -O2? Did you try -O3?
I have tried both with no observable results.

I tied much of what you said with the exception of changing sum to int value.
I gave this a quick try and needed to go to 64bit, but did not see uint64_t in the lib. uint32_t is there.

I did not post this elsewhere only because i wanted to keep it in a platform dependent forum.
I know architecture make a huge difference from processor to processor. But this code runs smoothly on their processor. Yet on ours running @ 250mhz it could not get out of its own way. I am beyond shocked that its taking it so long to do this floating math.

tod · Post by **tod** » Fri Oct 19, 2012 10:47 am

You have to include <stdint.h> for uint64_t, It's just
typedef unsigned long long uint64_t;

I recently looked up some the of clock cycles for integer math and a division (rems.l) instruction took 38 cycles a multiply (muls.l) took 8 (assuming I'm reading the table correctly). Hoist the three values for balls.rsq out of the loop and pre-calculate their inverse. Then in the loop try multiplying by the inverse. Of course the instructions I'm looking at are for integer operations. Might be interesting to try it on floats and see if there is a similar diff between float multiply and divides. I have no idea how the MAC factors into all this. When I recently took a look at optimizing some code I loaded a debug build and just used source/assembly view to get an idea of what the source was turned into.

joepasquariello · Post by **joepasquariello** » Tue Oct 23, 2012 2:48 pm

Just adding my "amen" to this chorus. In the post by "ecasey" above, he multiplied both the numerator and denominator by 100, and I think he meant that just the numerator should be multiplied by 100. That will result in 100 x greater values, suitable for comparison against 90 and 100, as opposed to 0.9 and 1.0.

I've done benchmarking (data below) of integer versus floating-point math on the 5213, and I think the ratios should be about the same for the newer and faster coldfire chips. For basic math operations, 32-bit integer is about 10 x faster than single-precision floating point. 64-bit integer (unsigned long long) works great, but for multiply and divide is only a little faster than 32-bit floats, so stick with 32-bit integers if you can.

Joe

unsigned long sum = 0;
for (int i = 0; i < BALL_COUNT; ++i)
{
int dx = x - balls.x;
int dy = y - balls.y;
if (dx != 0 && dy != 0)
{
sum += (100 * balls.rsq) / ((dx * dx) + (dy * dy));
}
}

// Check if the pixel should be lit up.
if (sum > 90UL && sum < 100UL)
{
lcd_ram[x][y] = 0x001f;
}

Bench5213 started.
Each time (s) is for 1M ops, so 1 op will be the same value (us).

1M s-p multiply = 6.83 s (22323.310547)
1M s-p divide = 7.25 s (0.999996)
1M s-p add = 3.44 s (1000001.000000)
1M s-p subtract = 3.46 s (1.000000)

1M d-p multiply = 22.12 s (22025.364508)
1M d-p divide = 14.54 s (1.000000)
1M d-p add = 5.21 s (1000001.000000)
1M d-p subtract = 5.36 s (1.000000)

1M UL multiply = 0.38 s (303415308)
1M UL divide = 1.42 s (17467)
1M UL add = 0.32 s (305026216)
1M UL subtract = 0.32 s (3989976004)

1M ULL multiply = 2.05 s (5916646100250)
1M ULL divide = 4.95 s (18090)
1M ULL add = 0.42 s (327175735)
1M ULL subtract = 0.46 s (18446744073382412051)

tony · Post by **tony** » Wed Oct 24, 2012 8:12 pm

You could try using fixed point math. There is a header file on http://sourceforge.net/projects/fixedptc/ that should work for you.

NetBurner Community Forum

You math gurus, could use some advise.

You math gurus, could use some advise.

Re: You math gurus, could use some advise.

Re: You math gurus, could use some advise.

Re: You math gurus, could use some advise.

Re: You math gurus, could use some advise.

Re: You math gurus, could use some advise.

Re: You math gurus, could use some advise.