Here is another paper that discusses cache faults and how they are handled, if you read AMD's bios programming guide you can also see where the cache size is set by special Machine Specific Registers (MSRs), likely redundant cache is implemented, say one targets a 2 meg cache design, you physically build 2.5 Meg, then at startup faulty blocks can be marked as bad and not used.
AMD's Bios Guide:
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/26094.PDF
ftp://ftp.cs.wisc.edu/markhill/Papers/toc93_faults_original.pdf
A manufacturing defect causes a fault in a cache if it impairs the correct operation of the cache. We will study those faults that make a bit in the cache unable to retain the value written to it, but that do not otherwise perturb the operation of the cache (e.g., do not cause an electrical short circuit). A fault causes an error if it causes the system to enter a logical state other than the one intended. We can prevent faults in an on-chip cache from causing errors by (1) discarding chips with such aults, (2) using redundant memory, or (3) disabling cache blocks that contain faults. The advantage of discarding chips, method (1), is that it works for any defect. Its disadvantage, however, is that by reducing yield it increases chip cost.
Those who may be older and remember that hard drives initially had to have a low-level format, usually a routine programmed in bios accomplished the task, in which before partitioning and high level formats, one would low level format the HD where bad sectors would be found and marked in the allocation table for the HD. As HD quality go better the low level formatting has become a non-issue and is likely done at the factory, nonetheless, I am postulated with limited information that cache initializes and functions similarly.
In that, upon startup up the BIOS bootstrap code initializes the cache memory and tests the caches up to the point of the actual size needed or specified by the MSRs. In this, as the processor initializes, if a block is found bad or faulty it is simply marked as bad or faulty and never used. The cache continues it's memory check through all the physically available SRAM until the specified amount of good cache is found (specified by the configuration set by the MSR registers).
So say Intel has a 4 meg part, but setups up the MSRs to identify it as a 2 meg part. The bios intialization then tests and validates good cache up to 2 megs, then quits after 2 good megs of cache are allocated. This would make the most sense. Of the 2 mega bits data that can be bad there is no way to physically pin out hard logic in the package to account for every word line and bit line.
I am still reading and learning so I can confidently bring back to the forum the methods employed to accomplish such a task. This is a detail that I do not have a great deal of knowledge on.
Jack
The way I see it this is, perhaps, one of the most important issues on chip manufacturing, next to process & litho; it influences all aspects of manufacturing, from
sand-to-chip/yield, literally.
I'm still reading through & searching (good AMD paper, btw), in order to get a glimpse of a sequencial order for logic/cache error detection/correction techniques, both at the soft & hardware levels; so far, I've only found out that, from wafer quality grade through defect detection precision tools, to Design For Testing (DFT) procedures & [on-chip] Logic/Memory Built-In Self Test (BIST
http://www.mentor.com/products/dft/memorytest/index.cfm), all contributes to initial, mid-process & final binning decisions, on what concerns the means by which wafer batches are selected to provide the best end results, i.e., higher yields.
For instance, even the kind/type of defects are taken into account, at the process level & graded in accordance with its "severity", some being considered "normal" (sort of inoffensive). I've also found out that, cache redundancy & ECC are just some ways of dealing with defective patterning and that Machine Specific Registers (MSR) can "store" defective cache bit pointers within the core logic (not in the cache, itself); also interesting, is the use of small portions of flash memory (see my previous post above), for a reason (of course!): MSRs behave as ROM as long as only
warm resets are performed; once a
cold reset to the processor is made, MSRs are erased and so are the defective cache bit pointers (perhaps this is not the most adequate wording; anyway...); as known, flash memory retains its content, even when in off state; a 'perfect' ROM for storing/maintaining the relevant info on the defective cache coordinates (I wonder how this might work with shared cache, since both logic cores can address the whole cache blocks...
maybe defective management has just evolved enough... but not that much, in order to link both L1 caches... I'm just guessing).
I'll try to collect, understand & post in some order, all the relevant data I can find on the subject, as it seems so darn interesting from every angle I look at it.
@1Tanker: I know you've addressed this issue before; I just missed it. Well, seems that you've got me hooked, this time around!
(Just a few illustrative links):
http://www.freescale.com/files/technology_publications/doc/Papers/Eintell5170ARTICLE.pdf - On typical litho defectivity
http://www.nikonprecision.com/immersion/media/Immersion_Defectivity_D6996.pdf - on Nikon's Immersion Litho
http://www.sematech.org/meetings/archives/other/20001030/08_Test_Screen_Shirley.pdf - elementals of defect management
http://userpages.umbc.edu/~abhishek/link_docs/defect_based_test.pdf - Intel Automatic Test Pattern Generation (ATPG)
Cheers