Using an Asyncronous Blitter in Direct Draw.

By Deano Calver

This article is designed as an software queue, some video cards have hardware queues that will do this job you. Also polling would probebly do well placed in a high priority thread. For Direct Draw check the hardware BLTQUEUE caps flag and if set if the queue big enough just use that, else use a modified advanced version of the below.

Use, abuse,copy or do whatever you want with this article or the source code, if you do use it or just want some info mail me at

The key to using an asynchronous blitter and CPU together is having both running full speed without haven’t to wait for the other. You will never get it to work like this 100 % but with a bit of luck and some clever software design most of the time they will run concurrently. While this example applies to any async blitter it is intended for Direct Draw systems.

The first thing is to try thinking in terms of 2 operations at once, the blitter is not just a fast copier it occurs simultaneously with the CPU. i.e. when you start coping with the blitter, it returns immediately back to the CPU with out waiting for the blit to happen. If you try and use to blitter again immediately it probably still doing the first blit so you would have to wait. This is a CPU stall and an incredible waste of processing time. On the other hand if you start a blit and then do an CPU intensive task, the blit will probably have finished and be waiting for the next blit. This is a blitter stall which means you not using you graphics hardware to its full potential.

These tables represent the dueling the CPU and blitter have. Blit1 lasts 3 blitter slots, Blit2 lasts 2.

CPU Stall

Start Blit1  Blitting 1 
Wait  Blitting 1 
Wait  Bltting 1 
Start Blit2  Blitting 2 
Wait  Bltting 2 

Blitter Stall

Start Blit1  Blitting 1 
CPU Working  Blitting 1 
CPU Working  Blitting 1 
CPU Working  Wait 
Start Blit2  Blitting 2 
CPU Working  Blitting 2 

One technique to using the hardware to full advantage it writing the program to start a blit then do a CPU task exactly matching the time taking to do the blit and the blit again. While on a fixed computer platform like the Amiga 500 this was indeed possible, its all but impossible to be that exact on a modern PC. You can run into problems just profiling exactly the CPU, (Quake has this problem on Cyrix 6x86 as it programmed with a exact FPU / IPU overlap that different on the Cyrix) and thats just between a handful of CPU’s but try doing that with the number of video cards on the market.

Exact Overlap

Start Blit1  Blitting 1 
CPU Working  Blitting 1 
CPU Working  Blitting 1 
Start Blit2  Blitting 2 
CPU Working  Bltting 2 

One system that can be used to good effect on a PC is the system of a Blitter Queue, the queue keeps all blits ready for the blitter to finish when it does it starts the next blit. Ideally support for the Blitter Queue is done in hardware as some kind of FIFO, while some video cards may have this capability most do not (yet). Next would be some kind of blitter finish interrupt but no PC video card I know supports this (probably due to limited IRQs) , so to implement it on the PC we have to use perhaps the oldest style of hardware programming, Polling. Polling is an old system where you check regularly to see if the event you interested in has happened, if it has you perform the appropriate routine. In a typical PC implementation of a Blitter Queue, a blit is no longer blitted but placed into the queue and the CPU continues on its merry way, placing other blits into the queue whenever it likes, the polling determines if the blitter free and if so start the next blit. The main synchronization needed is before a flip occurs to make sure all blits have finished.

Blitter Queue table assuming perfect polling Blitter Queue

Post Blit1  Blitting 1  Start Blit 1 
Post Blit2  Blitting 1  Do Nothing 
CPU Working  Blitting 1  Do Nothing 
CPU Working  Blitting 2  Start Blit 2 
CPU Working  Blitting 2  Do Nothing 

The key to it working is the actual polling, too often and you waste valuable time checking when the blitter has finished, not enough and the blitter will be idle for too long. A simple way is just to put polls in regularly throughout you code, another simple method is to put it on a fast timer, both work well but big blits get multiple polls and small blits finish to soon. These methods work for most code and represent a good speed increase from simple waiting blits.

Blitter Queue table using a 2 slot timer poll Blitter Queue

Post Blit1  Blitting 1  Blitter Free Start Blit1  Check Blitter 
Post Blit2  Blitting 1  Do Nothing  Do Nothing 
CPU Working  Blitting 1  Blitter Busy  Check Blitter 
CPU Working  Wait  Do Nothing  Do Nothing 
CPU Working  Blitting 2  Blitter Free Start Blit2  Check Blitter 
CPU Working  Blitting 2  Do Nothing  Do Nothing 

While there is a 1 slot wait in the above example this is the worst case that can happen, and a dynamic timer system can cut down even on this wait, the most important thing though is that the CPU routines don’t have to slow down except for the polling overhead which should be small.


This is quite a simply blitter queue that doesn’t handle several important cases, the most important is if the CPU wants access to a surface after a blit, in the above version their is no support, as a lock could happen before a blit, the solution is to have another device that locks the surface and then calls a callback routine, you can improve the Poll to allow this to occur when some blits occur. Also auto deletion of temporary memory can be added. A dynamic poller can be implemted by profiling the blitter and then calculated a estimated time until its finished, the blitter profiling has to be done when setup on a particular machine as every make of blitter has different speeds.

Source Code 

Download Blit Queue Class 

A Visual C++ class of the above method is available for download, it has the call back idea implemented but the callback hasn’t been extensively tested so beware, I removed the auto delation feature as their likely to be specific to my code. Its NOT optimised at all, but still allows your code to blit in the background, without much reworking of any code.

Assuming you allocated a CBltQueue class as BltQueue when initilising.

To Post a blit

struct SBob *Bob;

Bob = BltQueue->GetBob();

// Fill in Bob structure.


Then all you have to do is regular BtQueue->Poll() and just before flipping BltQueue->Empty()


</body> </html>