Never heard of it right?

Well its one of my favorite things on the PS3 and gets little love cos its one of those tiny features that make life so much nicer.

Atomic ops refer to the most important principle in multi threading. It say that a single processor must read/modify/write without another thread interrupting (hence atomic). Without atomicy, multiple core system are much harder (if not near impossible)

The ACU (s) are a part of each SPU that allow atomic updates to occur very quickly. It appears fairly simple each SPU had 512 bytes of cache (yes contrary to what you might have heard SPU do have a tiny bit of cache). 512 bytes is divided into 4 128 byte lines. The MFC (DMA) unit can bring a cache line in from memory and mark it reserved… if another processor writes to the same bit of memory the reservation is lost and you know to repeat the read/modify/write cycle to guarantee atomicy.

All good, but whats really clever its how its implemented. If another SPU asks for that bit of memory its get its from another SPUs cache, if its in there. So you effectively have a fast SHARED 512 bytes. When an SPU writes, the other SPU only have to read from the writing SPU cache rather than DMAing it back to main memory and DMAing it into LS. Cuts down alot of memory traffic. I even abuse it and just use it as a conventional cache and communication channel between SPU. You can push alot of data around with a fast 128 byte path.

And the nicest thing about it… It just works… All the cache snooping, routing etc. all just happens magically inside the chip. You say ATOMIC_GET and ATOMIC_SET and treat yourself to a 512 byte shared cache.

So for example for some of the army stuff, I need statistics to be kept (things like how many dead, ko’ed etc.)

These are 128 byte structure, that each SPU read/writes to as nessecary. When first you look at it, its seems to be really slow if not for the ACU. All those 128 bytes DMA, every time I need to add a number I’m doing 2 128 DMA (one read/one write) but due to the fact that its sitting inside SPU cache most of the time its ends up just being EIB ring traffic. And thats fast, really really fast.

I just have a shared counter statistics system that all works even tho I can be making 100’s of atomic updates per frame.

Nice one… Whoever at STI who added that bit of hardware deserves a beer from me :D