PS3 mumblings...
I'm trying to wrap my mind around the PS3 hardware specs for some time now, and somehow I don't get the point.
In the beginning, when there was just the Cell in the game it all made sense. The Cell was able to achieve an impressive vertex processing rate, at least compared to a traditional CPU. So Sony would use the Cell to implement a very flexible vertex processing pipeline, and connect it to a relatively simple graphics chip, which would basically just be a rasterizer... this could result in a relatively simple and cheap system, right? But then, surprisingly late, Sony announced that the PS3 would contain a traditional nvidia-made GPU and everything suddenly made less sense. Now there was a powerful and power-hungry CPU able to process a lot of vertex data per second, and another powerful and power-hungry GPU, also able to process a lot of vertex data per second. That's basically the main point why I don't get it. Why are there two completely different and completely separate vector stream processors in the PS3 which must be programmed in completely different ways? Of course there must be a secret masterplan behind all this soon to be revealed to us mere mortals. Or could it be that the Cell couldn't match a modern GPU in terms of vector processing power and Sony had to fallback to an emergency plan?
The PS3 is basically the Cell CPU, connected to 256 MB main RAM at roughly 25 GB/s, and the RSX GPU connected to 256 MB of video RAM at 22 GB/s bandwidth and connected to main RAM at 15..20 GB/s bandwidth, that's at least what the publicly available specs say. There's not much information how fast the Cell can read from and write to GPU RAM. According to this, write speed is about 4 GB/s, while read speed is 16 *MB*/s. Now the latter is not as dramatically bad as the Inquirer article implies. Reading from video RAM is generally a bad idea on any architecture, because it stalls the GPU. But the 4 GB write speed isn't something to write home about either.
CPU-to-System-RAM bandwidth is pretty good compared to a modern PC, GPU-to-Video-RAM bandwidth seems to be about on par with nvidia 6800 reference boards.
The Cell runs at 3.2 GHz and has one general PowerPC core, and 7 specialized stream processing units. Just as with the Xbox360, the 3.2 GHz is slightly misleading. The Xbox360 and Cell cores are stripped down PowerPC cores and lack an out-of-order instruction scheduler, which seems to yield a real-world performance comparable to a 1.8 GHz P4 (that's hearsay information though, but sounds reasonable). The Xbox360 makes up for this by having 3 identical cores. But on the Cell, there's only one of those cores.
So without the 7 additional stream processors of the Cell, the PS3 is basically comparable to a PC from 2001 (albeit with very good memory bandwidth) and a 2004/2005 era GPU. Graphics wise, that's ok. Consoles have the advantage of a hardware that is set in stone, and a very thin software layer, so that on a console, programmers can play some tricks that are impossible on the PC with its many hardware and software configurations.
What concerns me is the weak general processing power of the Cell. By raw numbers, the Cell is a vector processing monster thanks to its 7 (on PS3) specialized stream processing units (SPEs). But if one starts to look at the details, the Cell looks more like a solution in search of a problem. Generally, an SPE should roughly be capable of processing about 51 GB/s (3.2 GHz x 128 bit, assuming one vector operation per cycle). According to this, this is exactly the bandwidth to the local memory of an SPE. The bandwidth to the interconnect bus for a single SPE is 25 GB/s, and the Cell seems to be optimized especially for interchanging data between SPEs. There only seems to be a single channel to main memory with 25 GB/s. So the optimal scenario seems to be to chain the SPEs into a pipeline, and to stream a single data set through that pipeline. Feeding all SPEs in parallel from main memory would by far saturate the single 25 GB/s data channel. The second problem is the small local memory per SPE. SPEs don't have conventional caches to main memory, but 256 kilobytes of embedded memory both for code and data. It looks like that data transfers in and out of the SPE must be handled manually through a DMA engine. 256 kByte is really not much. Assuming 64 bytes per vertex in a 3d model, that's just enough room for about 4000 vertices. And then the code must also fit somewhere into those 256 KB.
Now lets quickly compare the Cell to a modern graphics chip. A GPU basically is made of lots and lots of little shader units, each of those is comparable to an SPE. In terms of flexibility, a modern GPU shader isn't that much different from an SPE, both can be programmed in a C-like highlevel programming language and can do branches and loops. A GPU has many more shader units then a Cell has SPEs (40..60 shader units compared to 7..8 SPEs on a Cell), however, a GPU is clocked much lower then a Cell (~0.5 GHz compared to ~3 GHz). The big difference is in the programming model. While the Cell exposes all of its internal complexity to the programmer, a GPU just looks like a very simple linear stream processor from the outside. A GPU consumes a single dataset, and all the complex parallelization happens inside the GPU, completely hidden from the programmer. From a programmers point of view it doesn't make a difference, whether there's only one shader unit in the GPU, or whether there are hundreds of them. Compare that to all the hoopla that's needed on the Cell. Data must be shuffled in and out of every SPE via DMA, there's an internal size limit on the data that can be processed at once, there are all types of bandwidth limitations to think of, and so on and on... Why oh why didn't the Cell designers take some inspiration from the graphics guys?
So what to do? There's only so much need for stream processing in a typical game, and that's already mostly handled by the GPU. One could build a pretty flexible and fast vertex processing pipeline with the Cell, but that's quite pointless, because the RSX already does that, GPUs have been specialized for years for this type of work. One could try to "prepare" vertices for the GPU, especially stuff that can't be done on the GPU, maybe some advanced dynamic LOD system, algorithmic geometry generation, etc... But then one also needs to continuously feed the vertex data to the GPU, and from the available bandwidth it looks like the the Cell could hardly keep the GPU busy. And frankly, IMHO there's not much need anymore for doing any per-vertex stuff on the CPU. For years, graphics hardware has been pushed into the direction of putting vertex processing off the CPU and onto the GPU.
So what else to keep the SPEs busy? Coding and decoding of several video and audio streams in parallel would be a perfect task. But on a game console?
The only area which comes to mind where the Cell could shine in a game context is probably physics. Physics needs a lot of floating point processing power, and doesn't have to interact with the rest of the game code too much (basically writing updates from the game world into the physics world, evaluating the physics interactions, and reading back the results into the game world). So maybe we will see some great physics in PS3 games, but is there really a need for it? Look how successful the Aegia physics accelerator has been...
Somehow, the hype surrounding the Cell reminds me of the Itanium-hype of the 90's. Today, we all ought to sit in front of cheap Itanium workstation with processing powers unheard of from mere PCs. The Itanium boasted very impressive theoretical peak-performance numbers. The problem was, that the Itanium relied on heavy parallelization on the instruction level, but most software just didn't have enough of that instruction-level parallelism to keep all execution units of the Itanium busy. The Cell could suffer from the same problem, just on a different level. There may just not be enough need for vector processing in games besides 3d graphics, which is already better handled by the GPU.
In the beginning, when there was just the Cell in the game it all made sense. The Cell was able to achieve an impressive vertex processing rate, at least compared to a traditional CPU. So Sony would use the Cell to implement a very flexible vertex processing pipeline, and connect it to a relatively simple graphics chip, which would basically just be a rasterizer... this could result in a relatively simple and cheap system, right? But then, surprisingly late, Sony announced that the PS3 would contain a traditional nvidia-made GPU and everything suddenly made less sense. Now there was a powerful and power-hungry CPU able to process a lot of vertex data per second, and another powerful and power-hungry GPU, also able to process a lot of vertex data per second. That's basically the main point why I don't get it. Why are there two completely different and completely separate vector stream processors in the PS3 which must be programmed in completely different ways? Of course there must be a secret masterplan behind all this soon to be revealed to us mere mortals. Or could it be that the Cell couldn't match a modern GPU in terms of vector processing power and Sony had to fallback to an emergency plan?
The PS3 is basically the Cell CPU, connected to 256 MB main RAM at roughly 25 GB/s, and the RSX GPU connected to 256 MB of video RAM at 22 GB/s bandwidth and connected to main RAM at 15..20 GB/s bandwidth, that's at least what the publicly available specs say. There's not much information how fast the Cell can read from and write to GPU RAM. According to this, write speed is about 4 GB/s, while read speed is 16 *MB*/s. Now the latter is not as dramatically bad as the Inquirer article implies. Reading from video RAM is generally a bad idea on any architecture, because it stalls the GPU. But the 4 GB write speed isn't something to write home about either.
CPU-to-System-RAM bandwidth is pretty good compared to a modern PC, GPU-to-Video-RAM bandwidth seems to be about on par with nvidia 6800 reference boards.
The Cell runs at 3.2 GHz and has one general PowerPC core, and 7 specialized stream processing units. Just as with the Xbox360, the 3.2 GHz is slightly misleading. The Xbox360 and Cell cores are stripped down PowerPC cores and lack an out-of-order instruction scheduler, which seems to yield a real-world performance comparable to a 1.8 GHz P4 (that's hearsay information though, but sounds reasonable). The Xbox360 makes up for this by having 3 identical cores. But on the Cell, there's only one of those cores.
So without the 7 additional stream processors of the Cell, the PS3 is basically comparable to a PC from 2001 (albeit with very good memory bandwidth) and a 2004/2005 era GPU. Graphics wise, that's ok. Consoles have the advantage of a hardware that is set in stone, and a very thin software layer, so that on a console, programmers can play some tricks that are impossible on the PC with its many hardware and software configurations.
What concerns me is the weak general processing power of the Cell. By raw numbers, the Cell is a vector processing monster thanks to its 7 (on PS3) specialized stream processing units (SPEs). But if one starts to look at the details, the Cell looks more like a solution in search of a problem. Generally, an SPE should roughly be capable of processing about 51 GB/s (3.2 GHz x 128 bit, assuming one vector operation per cycle). According to this, this is exactly the bandwidth to the local memory of an SPE. The bandwidth to the interconnect bus for a single SPE is 25 GB/s, and the Cell seems to be optimized especially for interchanging data between SPEs. There only seems to be a single channel to main memory with 25 GB/s. So the optimal scenario seems to be to chain the SPEs into a pipeline, and to stream a single data set through that pipeline. Feeding all SPEs in parallel from main memory would by far saturate the single 25 GB/s data channel. The second problem is the small local memory per SPE. SPEs don't have conventional caches to main memory, but 256 kilobytes of embedded memory both for code and data. It looks like that data transfers in and out of the SPE must be handled manually through a DMA engine. 256 kByte is really not much. Assuming 64 bytes per vertex in a 3d model, that's just enough room for about 4000 vertices. And then the code must also fit somewhere into those 256 KB.
Now lets quickly compare the Cell to a modern graphics chip. A GPU basically is made of lots and lots of little shader units, each of those is comparable to an SPE. In terms of flexibility, a modern GPU shader isn't that much different from an SPE, both can be programmed in a C-like highlevel programming language and can do branches and loops. A GPU has many more shader units then a Cell has SPEs (40..60 shader units compared to 7..8 SPEs on a Cell), however, a GPU is clocked much lower then a Cell (~0.5 GHz compared to ~3 GHz). The big difference is in the programming model. While the Cell exposes all of its internal complexity to the programmer, a GPU just looks like a very simple linear stream processor from the outside. A GPU consumes a single dataset, and all the complex parallelization happens inside the GPU, completely hidden from the programmer. From a programmers point of view it doesn't make a difference, whether there's only one shader unit in the GPU, or whether there are hundreds of them. Compare that to all the hoopla that's needed on the Cell. Data must be shuffled in and out of every SPE via DMA, there's an internal size limit on the data that can be processed at once, there are all types of bandwidth limitations to think of, and so on and on... Why oh why didn't the Cell designers take some inspiration from the graphics guys?
So what to do? There's only so much need for stream processing in a typical game, and that's already mostly handled by the GPU. One could build a pretty flexible and fast vertex processing pipeline with the Cell, but that's quite pointless, because the RSX already does that, GPUs have been specialized for years for this type of work. One could try to "prepare" vertices for the GPU, especially stuff that can't be done on the GPU, maybe some advanced dynamic LOD system, algorithmic geometry generation, etc... But then one also needs to continuously feed the vertex data to the GPU, and from the available bandwidth it looks like the the Cell could hardly keep the GPU busy. And frankly, IMHO there's not much need anymore for doing any per-vertex stuff on the CPU. For years, graphics hardware has been pushed into the direction of putting vertex processing off the CPU and onto the GPU.
So what else to keep the SPEs busy? Coding and decoding of several video and audio streams in parallel would be a perfect task. But on a game console?
The only area which comes to mind where the Cell could shine in a game context is probably physics. Physics needs a lot of floating point processing power, and doesn't have to interact with the rest of the game code too much (basically writing updates from the game world into the physics world, evaluating the physics interactions, and reading back the results into the game world). So maybe we will see some great physics in PS3 games, but is there really a need for it? Look how successful the Aegia physics accelerator has been...
Somehow, the hype surrounding the Cell reminds me of the Itanium-hype of the 90's. Today, we all ought to sit in front of cheap Itanium workstation with processing powers unheard of from mere PCs. The Itanium boasted very impressive theoretical peak-performance numbers. The problem was, that the Itanium relied on heavy parallelization on the instruction level, but most software just didn't have enough of that instruction-level parallelism to keep all execution units of the Itanium busy. The Cell could suffer from the same problem, just on a different level. There may just not be enough need for vector processing in games besides 3d graphics, which is already better handled by the GPU.