AMD Bulldozer Core Patents & Diagram: 4-way Core, CMT and AMD's Turbo?

dattimr

Distinguished
Apr 5, 2008
665
0
18,980
Interesting topic from Mechromancer, at XtremeSystems, related to a german interpretation (made by Dresdenboy) of AMD's patents filled for Bulldozer:

http://www.xtremesystems.org/forums/showthread.php?t=223148

by Dresdenboy @ 2009-04-15 – 10:37:15 am

It's time for my first blog post after publishing my thoughts in several forums for years.

I want to start with a graphic (first published on planet3dnow) showing what some of AMD's last year's patent applications contained as an exemplary MPU architecture. It's worth to note, that this architecture fits nicely to some rumor brought up by Charlie on the Inquirer.

Another hint is this old AMD presentation, which mentions future developments like "Throughput Architecture" and "Cluster-based Multi-threading", although not explicitly stating its planned use. However, both sources tell us about some clusters. Now this is, what appeared in the patent applications:



Additionally there are many other interesting bits hidden in many different patent applications (long numbers) and filed patents (shorter numbers):
clustered multithreading with 2 int clusters with each of them having:
2 ALUs, 2 AGUs
one L1 data cache
scheduler, integer register file (IRF), ROB
(see 20080263373*, 20080209173, 7315935)
a trace cache, not to make cheaper decoders but to quickly recover from a mispredicted branch (7197630 and many others)
read port arbitration for a faster IRF (7315935)
shared FPU supporting ADD, MUL, FMAC etc. and 64 or 128 bit max. operand width (20080263373)
FPU may run in full bit or reduced bit modes to save power (20080209185)
32 byte fetch, 4-way Decoder - multithreaded round robin or depending on queue saturation (20080263373, EP1244962)
fine grained power management (token based, 20080263373) for optimal usage of given TDP/ACP
a lot more speculation (data speculation, cache way prediction, see 7024537, 7028166 and many others)
2 loads from L1 D$ per cycle per cluster (7502914)
maybe 2 cycle effective L1 D$ latency instead of 4 thanks to replaying (7502914)
possibly a shared L2 (7502914)
loop detectors (7130991)
dynamically scalable cache architecture to save power by switching off cache portions or levels (20080104324)
AMD's turbo mode (running cores faster if others are less utilized, 7490254, filed 2005/08/02)

Even if only some of these points will be true for Bulldozer, it will be a very interesting MPU.

3663732_9bc35365d1_l.png


Bulldozer_Core_uArch_0.4.png
 

dattimr

Distinguished
Apr 5, 2008
665
0
18,980
Just saw your post in the "AMD and Intel" thread (linking to http://citavia.blog.de/), jaydee. The first time I saw this info was at Xtremesystems. Perhaps it can be further discussed here, though.