Yes, I'd read those posts. The answer I'm getting is that variability due to TIM application is not tested, but is not thought to be a big variable so long as it's done right.
On one of my Opteron boxes, I did a TIM application repetition test quite a few times. In general, after idling 4 hours a day for 3 or 4 days, the idle temps would bottom out. At that point, I'd run dual prime for an hour or so to get a load temp. After a few more days of collecting data, I'd pull the HSF, clean all surfaces and then do the test again. When I first started doing this, the day 3 or 4 idle temps varied by a couple of degrees C.
To back up a bit, I did this stuff in a temperature-controlled box that could typically holds temps within 1C of the set point. Rather than just measure idle temp, I'd measure the tempered air temp (set point 25C) and the CPU temp as a function of time, plot the difference and then integrate over a 15 minute time window once temps stabilized. So like I said, early on, I was seeing about a 2C variation, then a couple of weeks later, I was seeing 0 to 1C variations more often than 2C variations. Again, this was at idle. At load, variations of 1 to 2C were typical. If you try to do this stuff with no control of your ambient temps, the results will have more scatter and monitoring the delta T becomes more important.
From my experience with the one temperature being 5 C higher than the other, I was able to see a difference in the way the TIM had spread when I removed the HSF.
With practice, you can get to the point that your TIM application will be very reproducible. Find a scheme that works and stick with it. OTOH, most people apply TIM as seldom as possible and this is probably meaningless to them.