|)U(|(0F|)3@TH
July 22nd, 2003, 05:29 PM
Found this over at B3D.com. It's pretty dense, but interesting if you can get through it. It's also been taken off the internet, the B3D poster had to get it from Google cache.
http://www.beyond3d.com/forum/viewtopic.php?t=6947
The Technology of PS3
Eddie Edwards, April 2003
Foreword
Recent news articles have explained that the patent application for the technology on which PS3 will assumedly be based is now available online. I've spent some time examining the patent and I have formed some theories and educated guesses as to what it all means in practice. This document describes the patent and outlines my ideas. Some of these guesses are informed by my knowledge of PS2 (I was one of the VU coders on Naughty Dog's Jak & Daxter although I do not work for Sony now). You may wish to refer to Paul Zimmons' PowerPoint presentation which has diagrams that might make some of this stuff clearer. Also, until I get told to take it down, I have made the patent itself available in a more easily downloadable form (a 2MB ZIP containing 61 TIF files).
The technology of PS3 is based on what IBM call the "Cell Architecture". This architecture is being developed by a team of 300 engineers from Sony, IBM and Toshiba. PS2 was developed by Sony and Toshiba. Sony appear to have designed the basic architecture, while Toshiba have figured out how to implement it in silicon. The new consortium includes IBM, who for PS3 will use their advanced fabrication technologies to build the chips faster and smaller than would otherwise have been possible. In addition, the effort is supposedly a holistic approach whereby tools and applications are being developed alongside the hardware. IBM have particular expertise in building applications and operating systems for massively parallel systems - I expect IBM to have significant input into the software for this system.
There is a lot of PS2 in the Cell Architecture. It is the PS2 flavour that is most apparent to me when I read the patent. However, IBM must be bringing a significant amount of stuff to the table too. The patent for instance refers to a VLIW processor with 4 FPUs, rather than a dual-issue processor with a single SIMD vector FPU. Does this imply that the chips are based on an IBM-style VLIW ALU set? Or does it just mean that it's a fast VU with a "very long instruction word" of only 2 instructions? Furthermore, note that IBM have been making and selling massively parallel supercomputers for several decades now. IBM experts' input on the programming paradigms and tool set are going to be invaluable. And the host processor finally drops the MIPS ISA in favour of IBM's own PowerPC instruction set. But we may not get to program the PPCs inside the PS3 anyway.
I have had to make assumptions. Forgive them. If anyone with insight or knowledge wishes to enlighten me, please do.
Contents
*
* Foreword Cells
* APUs
o Instruction Width
* Winnie the PU
* PEs
* The Broadband Engine
* Visualizers
* Will the Real PS3 Please Stand Up?
* Memory : Sandboxes
* Memory : Producer / Consumer Synchronization
* Memory : Random Access, Caches, etc.
* Forward and Sideways Compatibility
* Graphics
o Modelling
* Programming PS3
* Jazzing with Blue Gene
* Stream Processing
* Readers' Comments
* Links and References
Cells
(There is some confusion as to what a "cell" is in this patent. The media is generally using the term "cell" for what the patent calls a "processing element" or "PE". In the patent, the term "cell" refers to a unit of software and data, while the term "PE" refers to a processing element that contains multiple processing units. I will use that nomenclature here.)
Cells are central to the PS3's network architecture. A cell can contain program code and/or data. Thus, a cell could be a packet in an MPEG data stream (if you were streaming a movie online) or it could be a part of an application (e.g. part of the rendering engine for a PS3 game). The format of a cell is loosely defined in the patent. All software is made up of cells (and here I use software in its most general sense to include programs and data). Furthermore, a cell can run anywhere on the network - on a server, on a client, on a PDA, etc.
Say, for instance, that a website wanted to stream a TV signal to you in their new improved format DivY. They could send you a cell that contained the program instructions for decoding the DivY stream into a regular TV picture. Then they send you the DivY-endoded picture stream. This would work if you had a PS3 or if you had a digital TV, or even if you had a powerful enough PDA - assuming their design followed the new standard.
Depending on how "open" Sony make this it might be easy or impossible to program your own PS3 just by sending it data packets you want it to run. (Note that Sony's history in this respect is interesting - their PSX Yaroze and PS2 Linux projects do show some willingness to open their machines up to hobbyists.)
APUs
Cells run on one or more "attached processing units" or APUs (I pronounce this after the character in the Simpsons!) An APU is architecturally very similar to the vector unit (VU) found in PS2, but bigger and more uniform:
* 128-bit processor
* 1024-bit external bus
* 128K (8192 x 128-bit words) of local RAM
* 128 x 128-bit registers
* 4-way floating point vector unit giving 32GFLOPS
* 4-way integer vector unit giving 32GIOPS
(Compare this to the VU's 128-bit external bus, 16K of code RAM, 16K of data RAM, 32 x 128-bit registers, single way 16-bit integer unit, and only 1.2GFLOPS.)
The APU is a very long instruction word (VLIW) processor. Each cycle it can issue one instruction to the floating point vector unit and one to the integer vector unit simultaneously. It is much more similar to a traditional DSP than to a CPU like the Pentium III - it does no dynamic analysis of the instruction stream, no reordering. The register set is imbued with enough ports that the FPU and the IPU can each read 3 registers and write one register on each cycle. Unlike the VU, the integer unit on the APU is vectorized, each vector element is a 32-bit int (VU was only 16-bit) and the register set is shared with the FPU (in VU there is a smaller dedicated integer register set). APU should therefore be somewhat easier to program and much more general-purpose than the VU.
Unlike the VU, which used a Harvard architecture (seperate program and data memories), the APU seems to use a traditional (von Neumann) architecture where the 128K of local RAM is shared by code and data. The local RAM appears to be triple- ported so that a single load or store can occur in parallel with an instruction fetch, mitigating the von Neumannism (the other port is for DMA). The connection is 256 bits wide (2 x 128 bits), so only one load or store can occur per cycle - it seems reasonable to assume therefore that the load/store instructions only occur on the integer side of the VLIW instruction, as was the case on the VU. Since there is no distinction between integer and floating point registers this works out just fine. The third RAM port attaches the APU to other components in the system and allows data to be DMAed in or out of the chip 1024 bits at a time. These DMAs can be triggered by the APU itself, which differs from the PS2 where only the host processor could trigger a DMA.
Note that the APU is not a coprocessor but a processor in its own right. Once loaded with a program and data it can sit there for years running it independently of the rest of the system. Cells can be written to use one or more APUs, thus multiple APUs can cooperate to perform a single logical task. A telling example given in the patent is where three APUs convert 3D models into 2D representations, and one APU then converts this into pixels. The implication is that PS3 will perform pure software rendering.
The declared speed of these APUs is awesome - 32GFLOPS + 32GIOPS (32 billion floating-point instructions and 32 billion integer instructions per second). I expect Sony consider a 4-way vectorized multiply-accumulate instruction to be 8 FLOPs, so the clock speed of the APU is 4GHz, as has been reported elsewhere in the media. This is very much faster than the PS2's sedate 300MHz clock - by about 13 times. I presume that the FPUs are pipelined (i.e. you can issue one instruction per cycle but it takes, say, four cycles to come up with the answer). But if PS2 had a 4-stage pipeline for the multipliers at 300MHz, what's the pipeline depth going to be at 4GHz? 8 stages? 16 stages? The details of this will depend on the precise design of the APU and this is not covered by the patent, but it is worth noting that naked pipelines are hard to code for at a depth of 4; at a depth of greater than this it may simply be unfeasible to write optimal code for these parts.
Note: the APUs may instead be using an IBM-style VLIW architecture where each ALU (4 floating point and 4 integer) is operable independently from different parts of the instruction word. However, the word size of the registers is 128, so each floating point unit must access part of the same register. This seriously limits the effectiveness of a VLIW architecture and makes it rather difficult to program for. I therefore assume that the ALUs are acting like typical 4-way vector SIMD units.
One interesting departure from PS2 is that all software cells run on APUs. On PS2 there were two VUs but also one general- purpose CPU (a MIPS chip). This chip was the only chip in the system capable of the 128-bit vector integer operations (necessary for fast construction of drawlists), and this functionality is now subsumed into the APU. There is a non-APU processor in the new system but it only runs OS code, not cells, so its precise architecture is irrelevant - it could be anything, and the same software cells would still run on the APUs just fine.
Instruction Width
Given 128 registers, it takes 7 bits to identify a register. Each instruction can have 3 inputs and 1 output which is 28 bits. I am presuming they are keeping the extremely useful vector element masks which would add 4 bits to the FPU side. Only in the case of MAC (multiply-accumulate) are 3 inputs actually needed, but say you specify a MAC on both the IPU and FPU - that's 60 bits for register specifications alone. I therefore doubt that the instruction length is 64 bits - I think the VLIW on the APU must be 128 bits wide, which is reasonable since that's the word length and since there is bandwidth to read 128 bits out of memory per cycle as well as do a load/store to/from memory at the same time. But this is probably going to mean code is not overly compact - only 8,192 instructions will fit into the whole of APU RAM, with no room for data in that case.
On the other hand, 128 bits is a lot of bits for an instruction given that only 60 are used so far. Assuming 256 distinct instructions per side (which is very very generous) that's 8 bits per side making 76. My guess is they may have another 16 bits to mask integer operations, just as 4 bits mask the FPU operations. 16 bits enables you to isolate any given byte(s) in the register. That's 92.
Another cool feature they might employ is conditional execution like on the ARM - 4 bits would control each instruction's execution according to the standard condition codes. I was suprised not to see this on the VU in PS2 (perhaps ARM have a patent?) because it helps to avoid a lot of petty little branches. If the PPC is influencing the design, they may just throw a barrel shifter in after every instruction too (that would be quite ARM-like as well). So even without unaligned memory accesses you can isolate any field in a 128-bit word in a single mask-and-shift instruction. Another 7 bits there too (integer only) ... that's still only 99 bits - 29 bits are still available.
http://www.beyond3d.com/forum/viewtopic.php?t=6947
The Technology of PS3
Eddie Edwards, April 2003
Foreword
Recent news articles have explained that the patent application for the technology on which PS3 will assumedly be based is now available online. I've spent some time examining the patent and I have formed some theories and educated guesses as to what it all means in practice. This document describes the patent and outlines my ideas. Some of these guesses are informed by my knowledge of PS2 (I was one of the VU coders on Naughty Dog's Jak & Daxter although I do not work for Sony now). You may wish to refer to Paul Zimmons' PowerPoint presentation which has diagrams that might make some of this stuff clearer. Also, until I get told to take it down, I have made the patent itself available in a more easily downloadable form (a 2MB ZIP containing 61 TIF files).
The technology of PS3 is based on what IBM call the "Cell Architecture". This architecture is being developed by a team of 300 engineers from Sony, IBM and Toshiba. PS2 was developed by Sony and Toshiba. Sony appear to have designed the basic architecture, while Toshiba have figured out how to implement it in silicon. The new consortium includes IBM, who for PS3 will use their advanced fabrication technologies to build the chips faster and smaller than would otherwise have been possible. In addition, the effort is supposedly a holistic approach whereby tools and applications are being developed alongside the hardware. IBM have particular expertise in building applications and operating systems for massively parallel systems - I expect IBM to have significant input into the software for this system.
There is a lot of PS2 in the Cell Architecture. It is the PS2 flavour that is most apparent to me when I read the patent. However, IBM must be bringing a significant amount of stuff to the table too. The patent for instance refers to a VLIW processor with 4 FPUs, rather than a dual-issue processor with a single SIMD vector FPU. Does this imply that the chips are based on an IBM-style VLIW ALU set? Or does it just mean that it's a fast VU with a "very long instruction word" of only 2 instructions? Furthermore, note that IBM have been making and selling massively parallel supercomputers for several decades now. IBM experts' input on the programming paradigms and tool set are going to be invaluable. And the host processor finally drops the MIPS ISA in favour of IBM's own PowerPC instruction set. But we may not get to program the PPCs inside the PS3 anyway.
I have had to make assumptions. Forgive them. If anyone with insight or knowledge wishes to enlighten me, please do.
Contents
*
* Foreword Cells
* APUs
o Instruction Width
* Winnie the PU
* PEs
* The Broadband Engine
* Visualizers
* Will the Real PS3 Please Stand Up?
* Memory : Sandboxes
* Memory : Producer / Consumer Synchronization
* Memory : Random Access, Caches, etc.
* Forward and Sideways Compatibility
* Graphics
o Modelling
* Programming PS3
* Jazzing with Blue Gene
* Stream Processing
* Readers' Comments
* Links and References
Cells
(There is some confusion as to what a "cell" is in this patent. The media is generally using the term "cell" for what the patent calls a "processing element" or "PE". In the patent, the term "cell" refers to a unit of software and data, while the term "PE" refers to a processing element that contains multiple processing units. I will use that nomenclature here.)
Cells are central to the PS3's network architecture. A cell can contain program code and/or data. Thus, a cell could be a packet in an MPEG data stream (if you were streaming a movie online) or it could be a part of an application (e.g. part of the rendering engine for a PS3 game). The format of a cell is loosely defined in the patent. All software is made up of cells (and here I use software in its most general sense to include programs and data). Furthermore, a cell can run anywhere on the network - on a server, on a client, on a PDA, etc.
Say, for instance, that a website wanted to stream a TV signal to you in their new improved format DivY. They could send you a cell that contained the program instructions for decoding the DivY stream into a regular TV picture. Then they send you the DivY-endoded picture stream. This would work if you had a PS3 or if you had a digital TV, or even if you had a powerful enough PDA - assuming their design followed the new standard.
Depending on how "open" Sony make this it might be easy or impossible to program your own PS3 just by sending it data packets you want it to run. (Note that Sony's history in this respect is interesting - their PSX Yaroze and PS2 Linux projects do show some willingness to open their machines up to hobbyists.)
APUs
Cells run on one or more "attached processing units" or APUs (I pronounce this after the character in the Simpsons!) An APU is architecturally very similar to the vector unit (VU) found in PS2, but bigger and more uniform:
* 128-bit processor
* 1024-bit external bus
* 128K (8192 x 128-bit words) of local RAM
* 128 x 128-bit registers
* 4-way floating point vector unit giving 32GFLOPS
* 4-way integer vector unit giving 32GIOPS
(Compare this to the VU's 128-bit external bus, 16K of code RAM, 16K of data RAM, 32 x 128-bit registers, single way 16-bit integer unit, and only 1.2GFLOPS.)
The APU is a very long instruction word (VLIW) processor. Each cycle it can issue one instruction to the floating point vector unit and one to the integer vector unit simultaneously. It is much more similar to a traditional DSP than to a CPU like the Pentium III - it does no dynamic analysis of the instruction stream, no reordering. The register set is imbued with enough ports that the FPU and the IPU can each read 3 registers and write one register on each cycle. Unlike the VU, the integer unit on the APU is vectorized, each vector element is a 32-bit int (VU was only 16-bit) and the register set is shared with the FPU (in VU there is a smaller dedicated integer register set). APU should therefore be somewhat easier to program and much more general-purpose than the VU.
Unlike the VU, which used a Harvard architecture (seperate program and data memories), the APU seems to use a traditional (von Neumann) architecture where the 128K of local RAM is shared by code and data. The local RAM appears to be triple- ported so that a single load or store can occur in parallel with an instruction fetch, mitigating the von Neumannism (the other port is for DMA). The connection is 256 bits wide (2 x 128 bits), so only one load or store can occur per cycle - it seems reasonable to assume therefore that the load/store instructions only occur on the integer side of the VLIW instruction, as was the case on the VU. Since there is no distinction between integer and floating point registers this works out just fine. The third RAM port attaches the APU to other components in the system and allows data to be DMAed in or out of the chip 1024 bits at a time. These DMAs can be triggered by the APU itself, which differs from the PS2 where only the host processor could trigger a DMA.
Note that the APU is not a coprocessor but a processor in its own right. Once loaded with a program and data it can sit there for years running it independently of the rest of the system. Cells can be written to use one or more APUs, thus multiple APUs can cooperate to perform a single logical task. A telling example given in the patent is where three APUs convert 3D models into 2D representations, and one APU then converts this into pixels. The implication is that PS3 will perform pure software rendering.
The declared speed of these APUs is awesome - 32GFLOPS + 32GIOPS (32 billion floating-point instructions and 32 billion integer instructions per second). I expect Sony consider a 4-way vectorized multiply-accumulate instruction to be 8 FLOPs, so the clock speed of the APU is 4GHz, as has been reported elsewhere in the media. This is very much faster than the PS2's sedate 300MHz clock - by about 13 times. I presume that the FPUs are pipelined (i.e. you can issue one instruction per cycle but it takes, say, four cycles to come up with the answer). But if PS2 had a 4-stage pipeline for the multipliers at 300MHz, what's the pipeline depth going to be at 4GHz? 8 stages? 16 stages? The details of this will depend on the precise design of the APU and this is not covered by the patent, but it is worth noting that naked pipelines are hard to code for at a depth of 4; at a depth of greater than this it may simply be unfeasible to write optimal code for these parts.
Note: the APUs may instead be using an IBM-style VLIW architecture where each ALU (4 floating point and 4 integer) is operable independently from different parts of the instruction word. However, the word size of the registers is 128, so each floating point unit must access part of the same register. This seriously limits the effectiveness of a VLIW architecture and makes it rather difficult to program for. I therefore assume that the ALUs are acting like typical 4-way vector SIMD units.
One interesting departure from PS2 is that all software cells run on APUs. On PS2 there were two VUs but also one general- purpose CPU (a MIPS chip). This chip was the only chip in the system capable of the 128-bit vector integer operations (necessary for fast construction of drawlists), and this functionality is now subsumed into the APU. There is a non-APU processor in the new system but it only runs OS code, not cells, so its precise architecture is irrelevant - it could be anything, and the same software cells would still run on the APUs just fine.
Instruction Width
Given 128 registers, it takes 7 bits to identify a register. Each instruction can have 3 inputs and 1 output which is 28 bits. I am presuming they are keeping the extremely useful vector element masks which would add 4 bits to the FPU side. Only in the case of MAC (multiply-accumulate) are 3 inputs actually needed, but say you specify a MAC on both the IPU and FPU - that's 60 bits for register specifications alone. I therefore doubt that the instruction length is 64 bits - I think the VLIW on the APU must be 128 bits wide, which is reasonable since that's the word length and since there is bandwidth to read 128 bits out of memory per cycle as well as do a load/store to/from memory at the same time. But this is probably going to mean code is not overly compact - only 8,192 instructions will fit into the whole of APU RAM, with no room for data in that case.
On the other hand, 128 bits is a lot of bits for an instruction given that only 60 are used so far. Assuming 256 distinct instructions per side (which is very very generous) that's 8 bits per side making 76. My guess is they may have another 16 bits to mask integer operations, just as 4 bits mask the FPU operations. 16 bits enables you to isolate any given byte(s) in the register. That's 92.
Another cool feature they might employ is conditional execution like on the ARM - 4 bits would control each instruction's execution according to the standard condition codes. I was suprised not to see this on the VU in PS2 (perhaps ARM have a patent?) because it helps to avoid a lot of petty little branches. If the PPC is influencing the design, they may just throw a barrel shifter in after every instruction too (that would be quite ARM-like as well). So even without unaligned memory accesses you can isolate any field in a 128-bit word in a single mask-and-shift instruction. Another 7 bits there too (integer only) ... that's still only 99 bits - 29 bits are still available.