_never_ read from video memory.
if you need to read from video memory, use some other buffer somewhere (like quake's water warp buffer), a bit like you would an fbo in gl.
video memory is generally paged as uncached memory, so every single access results in stalling the cpu while waiting on the result, hence the slowness.
DIBs are blitted from system memory each time, hence why you'd want to use directdraw when you can.

.