[MPlayer-dev-eng] [OT] C-code Optimiation Contest
Arpi
arpi at thot.banki.hu
Wed Jul 16 13:44:21 CEST 2003
Hi,
> > - rek_multiply(a,b,c, 0, 0, 0,0, 0,0, DIM);
> > + rek_multiply(a,b,c, 0, 0, 0,0, 0,0, NUM);
>
> Oh, the solution I posted, was the one where DIM=NUM=512, later as I
> said, that I set DIM to 516 I changed of course the call too (And
> moved the loop-vars i,j,k into the rek_multiply). Further I changed
> the limit, when he uses the simple Matrix-Multiplikation. It looks now
> like this:
>
> #define START_SIMPLE 4
I also did such changes and even more: changed arrays+zeile+splate to
pointers so it's simpler and faster, along other tricks:
/* simple matrix multiply */
#include "multiply_d.h"
static inline void n_multiply(double* a, double* b, double* c, int size){
int i,j,k;
#define SIZE 16
//#define SIZE size
for(i=0; i<SIZE; i++) {
double* b2=b;
for(k=0; k<SIZE; k+=4) {
for(j=0; j<SIZE; j+=8) {
c[j] += a[k] * b2[j] + a[k+1] * b2[DIM+j] + a[k+2] * b2[2*DIM+j] + a[k+3] * b2[3*DIM+j];
c[j+1] += a[k] * b2[j+1] + a[k+1] * b2[DIM+j+1] + a[k+2] * b2[2*DIM+j+1] + a[k+3] * b2[3*DIM+j+1];
c[j+2] += a[k] * b2[j+2] + a[k+1] * b2[DIM+j+2] + a[k+2] * b2[2*DIM+j+2] + a[k+3] * b2[3*DIM+j+2];
c[j+3] += a[k] * b2[j+3] + a[k+1] * b2[DIM+j+3] + a[k+2] * b2[2*DIM+j+3] + a[k+3] * b2[3*DIM+j+3];
c[j+4] += a[k] * b2[j+4] + a[k+1] * b2[DIM+j+4] + a[k+2] * b2[2*DIM+j+4] + a[k+3] * b2[3*DIM+j+4];
c[j+5] += a[k] * b2[j+5] + a[k+1] * b2[DIM+j+5] + a[k+2] * b2[2*DIM+j+5] + a[k+3] * b2[3*DIM+j+5];
c[j+6] += a[k] * b2[j+6] + a[k+1] * b2[DIM+j+6] + a[k+2] * b2[2*DIM+j+6] + a[k+3] * b2[3*DIM+j+6];
c[j+7] += a[k] * b2[j+7] + a[k+1] * b2[DIM+j+7] + a[k+2] * b2[2*DIM+j+7] + a[k+3] * b2[3*DIM+j+7];
}
b2+=DIM*4;
}
c+=DIM; a+=DIM;
}
}
static void rek_multiply(double* a, double* b, double* c, int size)
{
int dsize;
/* first rekursion finishen */
size/=2;
dsize=DIM*size;
if(size==16){
n_multiply(a,b,c, size);
n_multiply(a+size,b+dsize,c, size);
n_multiply(a,b+size,c+size, size);
n_multiply(a+size,b+dsize+size,c+size, size);
n_multiply(a+dsize,b,c+dsize, size);
n_multiply(a+size+dsize,b+dsize,c+dsize, size);
n_multiply(a+dsize,b+size,c+dsize+size, size);
n_multiply(a+dsize+size,b+dsize+size,c+dsize+size, size);
} else
{
rek_multiply(a,b,c, size);
rek_multiply(a+size,b+dsize,c, size);
rek_multiply(a,b+size,c+size, size);
rek_multiply(a+size,b+dsize+size,c+size, size);
rek_multiply(a+dsize,b,c+dsize, size);
rek_multiply(a+size+dsize,b+dsize,c+dsize, size);
rek_multiply(a+dsize,b+size,c+dsize+size, size);
rek_multiply(a+dsize+size,b+dsize+size,c+dsize+size, size);
}
}
void multiply(double a[][DIM], double b[][DIM], double c[][DIM]) {
rek_multiply(a[0],b[0],c[0], NUM);
}
Note i got best results (around 11x) with normal matrix mul at 16x16,
both 8x8 and 32x32 gave slower runs. Btw it's still slower than my p4
pointer-magic version which runs at ~19x.
A'rpi / Astral & ESP-team
--
Developer of MPlayer G2, the Movie Framework for all - http://www.MPlayerHQ.hu
More information about the MPlayer-dev-eng
mailing list