by Spike » Tue May 08, 2012 7:10 pm
#define makevec4(a,b,c,d,out) ((long long *)out)[0] = &a,((long long *)out)[1] = &c
mwahaha. evil though.
the function call overhead will kill you with your function. the sse inside is less likely to be directly optimised (that is, it'll never no-op elements that can be trivially overwritten due to known values).
sse2 at least struggles with dotproducts.
enabling auto-sse2 optimisations actually reduced framerates for me with fte - quake has a lot of dotproducts for things like culling etc.
as mh says, sse is good for bulk stuff like 'multiply this vec4 by 4, now add 2'. Its also quite good for memcpys, if you can avoid alignment issues and non-multiple-of-16 data blocks.
However, while 'multiply each element in this vec3' is fast enough, adding those 3 elements together is more painful than doing the whole thing with just x87 instructions, and that's your basic dotproduct that is used all over the place in quake!
maybe sse3 fixes something? I've not checked, just that optimising for sse2 was a loss for me, but then I don't have a software renderer. Benchmark it!
.