L3 does take up some die space, but not an enormous amount on most desktop chips. The typical "core" of logic + L1 + L2 takes up much more space than the L3, and if you have a reasonable amount of L2 cache, especially so. Video games often do like L3 cache as the Athlon II vs. Phenom II tests show. Encoding, not so much. Yes, the performance difference between L3-less Athlon IIs and L3-equipped Phenom IIs is often in that 10-15% range. But "enthusiasts" often say that one chip "absolutely slaughters" another when the performance delta is far less than 10-15%. I would expect that the enthusiasts would be all up in arms if Intel or AMD decided to drop L3 from all of their chips, claiming it makes them "glacially slow" or some other hyperbole...
Comparing a 3.6 GHz i5-2500K to a 3.6 GHz i7-3930K and saying they differ just in cache size is about like comparing apples to oranges. You have a four-core, four-thread chip with a 128-bit IMC versus a six-core, 12-thread chip with a 256-bit IMC. Any program that runs similarly on both chips is single-threaded or very close to it, and essentially just looks at single-core throughput and maximum Turbo Boost clock speed. That is a shrinking number of programs these days, and that number will continue to shrink in the future.
Also, die space isn't really that big of a concern. The general argument about larger dies being bad essentially boils down to a higher cost to make larger dies. That is true, to a point as larger dies have higher intrinsic costs due to fewer possible candidate dies per wafer due to sheer die area, as well as lower percentage yields of larger dies compared to smaller ones. However, it costs a lot of money to make a separate smaller-die mask and production line, and most people forget that. It is often LESS expensive to use the great big old server die with the mountain of L3 rather than make a special smaller mask with a small or no L3 that gives overall similar performance. How do I know this? We have seen many, many parts from both makers where this is the case. That i7-3930K you mentioned above is a 415 mm^2 monstrosity of a die, and a decent chunk of the die is actually fused off, as the die physically has 8 cores and 20+ MB of L3 cache onboard! Other big wastes of space on desktop chips are the massively wide memory interfaces seen in the LGA2011 desktop chips and the disabled HT/QPI interfaces that are not active on anything but the 4+ socket versions of those chips. At least the L3 may help performance a little, disabled parts do nothing at all. IGPs also do essentially nothing on enthusiast chips as well, just ask anybody with an unlocked LGA1155 Sandy Bridge. Who buys an i7-2700K to use it with the crappy onboard IGP? But, the reason it is there is because Intel uses the same parent die for their quad-core mobile chips, and many of those really do just have the on-die IGP paired with the CPU.