qbismSuper8 builds

by **qbism** » Tue May 08, 2012 4:48 am

Not really a release, but a test build:
The exe is optimized for MMX, SSE, SSE2, and SSE3 (Pentium 4/ Athlon 64). Hopefully will run on a wider range older machines, and I didn't notice any slowdown compared to icore optimized build.
Framerate is improved in fog due to fewer depth samples. It tests every other pixel, but still blends them all.
Also added SSE vector math posted by Reckless here. I don't know how much faster it is, but it's probably not slower.

by mh » Tue May 08, 2012 6:00 pm

I doubt if you're going to get much from SSE-izing those operations - where SSE works best is on long chains of float4 data. Using SSE instructions on single float3s seems highly unlikely to give anything extra.

by **revelator** » Tue May 08, 2012 6:42 pm

by **Spike** » Tue May 08, 2012 7:10 pm

#define makevec4(a,b,c,d,out) ((long long *)out)[0] = &a,((long long *)out)[1] = &c
mwahaha. evil though.
the function call overhead will kill you with your function. the sse inside is less likely to be directly optimised (that is, it'll never no-op elements that can be trivially overwritten due to known values).

sse2 at least struggles with dotproducts.
enabling auto-sse2 optimisations actually reduced framerates for me with fte - quake has a lot of dotproducts for things like culling etc.

as mh says, sse is good for bulk stuff like 'multiply this vec4 by 4, now add 2'. Its also quite good for memcpys, if you can avoid alignment issues and non-multiple-of-16 data blocks.

However, while 'multiply each element in this vec3' is fast enough, adding those 3 elements together is more painful than doing the whole thing with just x87 instructions, and that's your basic dotproduct that is used all over the place in quake!

maybe sse3 fixes something? I've not checked, just that optimising for sse2 was a loss for me, but then I don't have a software renderer. Benchmark it!

by mh » Tue May 08, 2012 8:32 pm

Auto-SSE made DirectQ slower too. The only place I remain using SSE is in my matrix operations, which D3D does automatically for me anyway (yayyy! no work!) and which are called so relatively few times by comparison to so much other more important stuff, so it doesn't really count for much in the general case (useful for IQM though, even if I do have to pad them to 4x4 - no big deal).

I believe that the DarkPlaces matrix library is set up to be friendly for compiler auto-vectorization, but I've already got the D3D library and it works natively with the rest of the code, so why bother? (I even use this library with OpenGL where possible - it works, I know it, I'm comfortable with it, why not?)

Everywhere else I get much more performance by putting this kind of calculation on the GPU where it counts. Not an option for a software engine of course.......

(Random mad thought - use CUDA/OpenCL/DirectCompute to accelerate these ops but still render in software).

by **Spike** » Tue May 08, 2012 9:04 pm

gcc actually has a built in vector type. You can do your c = a*b; etc stuff with it and it'll automatically be converted to sse or altivec (depending on cpu, obviously) for you without any of the unreadable gibberish intrinsic names.
explicit vector types allow the compiler to allign things properly, etc, which will help save a few cycles even with auto-vectorisation, and because its not instruction-set-specific, you can always change the cpu target and you get a working build with the same code for an entirely different cpu.

by **revelator** » Wed May 09, 2012 3:12 am

theres an asm version of memcpy in quake2xp unfortunatly it does not compute with gcc's assembler (masm syntax) .
one switch you can try with gcc is -ftree-vectorize to enable the vectorizing compiler. i had a few problem with some sources though when using this.

by **qbism** » Wed May 09, 2012 5:31 am

by **revelator** » Wed May 09, 2012 12:15 pm

seems the absolute winner as for calls is strcmp :mrgreen:

by **Baker** » Wed May 09, 2012 12:24 pm

Hmmmm ....

by **qbism** » Wed May 09, 2012 5:08 pm

q_strcmp: 2 billion calls/ 6500 frames = 300,000+ calls/frame :shock:

Maybe flipscreen is the place to post-process fog in 32bit color.

by **revelator** » Wed May 09, 2012 5:39 pm

comparing strings is probably not the biggest hit on performance normally but 2 billion calls... ouch.

by **mankrip** » Wed May 09, 2012 8:26 pm

I've looked at D_DrawZSpans a few weeks ago, and couldn't really find a way to optimize it.

R_DrawSurfaceBlock8_mip* obviously needs to be optimized, and should be easy to. Limiting the number of lightmap updates per second could also help.

Q_strcmp usage should be easy to optimize in some way, maybe by finding ways to reduce the number of calls for it.

GetEdictFieldValue seems to be the villain in QuakeC's performance. Good to know, it should be the starting point if I try to optimize the QC VM.

Draw_Fill has a surprisingly high hit on performance.

by **Spike** » Thu May 10, 2012 1:38 am

the whole point of D_DrawZSpans is because a 386 doesn't have enough registers to interpolate/write depth at the same time as colours.
with the sse or amd64 instruction sets, you might get enough spare registers (possibly mmx too if you don't use floats).
Modern CPUs have more agressive instruction pipelines. If you don't use the value from mem reads instantly, you may well be able to get away with using a little memory (stack?) instead of registers. Memory writes should at least be cacheable also.
Back when FTE still had a software renderer (much of which used C instead of asm) I found that combining D_DrawSpans8 and D_DrawZSpans gave a couple of percent speedup. But like I say, no asm so its more a case of the compiler not managing to use all registers efficiently.

GetEdictFieldValue contains a loop and strcmps. Its not pretty. The offsets should be directly cacheable anyway.

by **qbism** » Thu May 10, 2012 5:07 pm

qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Re: qbismSuper8 builds

Who is online