2D can actually be a surprisingly huge slowdown - it's quite fillrate and overdraw intensive and batching up calls can help a lot. I remember getting a fright when Draw_Character turned out to be my biggest bottleneck on a VMWare test machine - I had to start thinking of batching calls after that. Way I handle it is by using a common input/vertex layout for everything (position/colour/texcoords, even if not needed - helps keep buffer switches down and instancing gets the vertex size down to under half that needed for a full quad, although in practice that's not a bottleneck - I just wanted to experiment with instancing), sniffing for state changes and issuing the current batch if one happens (these are generally only texture (the scrap system helps a lot here) and shader, although I've also got the ability to set a new ortho matrix if needed, and D3D's separation of sampling parameters from the texture object means that I also need to watch out for some textures that need to clamp, some that need to wrap, etc), otherwise just adding the specified quad to a vertex buffer. I've a nice state filtering system that can take a callback to be executed before state changes, so I shove my Draw_Flush into that and everything happens automatically. One final flush of anything left over at the end of the frame and it's done.
Another surprisingly big bottleneck is drawing the gun model. Right now I've got assumptions in my code that it's going to be the last thing drawn in a frame (big mistake that) but as soon as I work them out of it I'm going to try moving it to first and see if the GPU's early-Z can help any (it should be able to quickly reject a lot of world and other polys before the PS runs). Shouldn't be an issue - Q2 just draws it mixed in with the regular ents and it works fine there.
Bleagh - verbal diarrhea.
