[H] defending real gameplay vs benchmarks

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
The main thing for me is that despite errors and issues internally within [H]'s own reviews, instead of looking to improve on them, they instead look to attack the criticism first or the other methods (what did Anand do to draw their ire personally?), and then only later if it becomes blatantly obvious to everyone else do they then even start to consider an options that is not of their own making. When it was pointed out there might be IQ problems with some drivers that they may have mised it became a problem that people focused on 'such a small issue' not that they missed it, and continued to use beta drivers despite using 'they are just beta drivers' as the excuse.

I like to use both [H]'s benchies and other sites, because they never match up. Which one is correct is all a point of view, [H] is subjective (no question it's the style they chose) and strictly valid just to the setup they have and the runs they ran. I'm not sure how that gives more information to someone not using a setup similar to there's nor someone who isn't playing the small handful of games they chose at the settings they prefer (Oblivion with little/no grass, but shadows on everything including grass?).

I find that instead of moving the art of benchmarking forward, this article is more of a defense of their methods and an attempt to condemn all those who don't follow this narrow path.

If someone were looking for a be all and end all single source of information, it would need to have more information and be more consitent than their current reviews are which pretend to be more than they are.

If anyone can explain to me why their hystograms don't match their min/avg/max numbers in the 'settings' that would also help alot. [:wr2:3]

Personally I prefer looking at th hystograms of the apples-apples tests because that tells me more than subjective op-onions, however if the hystograms too are questionable, then how can you validate what is essentially a 'look & fell' review rather than something reproduceable?
 


Couldn't agree with you more. [:thegreatgrapeape:6]

I don't care if they use Voodoo and the number of dead-chickens a card will kill as their benchmark, as long as I have a large selection of people testing it should give me an overall view for whatever I'm looking for.

To think that any one review/method tells the whole story, is to be ignorant of the complexities of this hardware. :pfff:
 
Wow; been gone all day...this thread really took off. Thx for all the input.

@ TGGA - I think [H] lashed out on Anand because readers kept throwing Anand's 3870X2 links at Kyle in the forum for their HD3870x2 review. 😉 It really does seem like this article is just their attempt to defend their testing methods, no matter what. I agree with you unless Anand made some comments I missed, it's seems low to pick on them like that. Anyway, this [H] seemed worth reading though.

I do like real world gameplay results available to add to other benchmarks in basing our overall opinion. But yeah, I wish they did real world apples to apples tests at various resolutions, instead of deciding one card needs medium this, no grass, etc. Take firingsquads settings (typically max and 3 resolutions, sometimes with and without fsaa) and take the time to "real world" bench all them. That would be alot more usefull.

I've noticed in my own fraps benching that the min very often doesn't match the time fps listed that get put in the histogram. Is seems the min/max/ave usually reports a lower minimum than gets logged each second. I like the histogram myself as if both cards ave the same and hit 15 fps as a min, but card A drops in the teens alot (bigger fluctuation high/low) while card B only does it once. Obviously B is offers better gameplay.
 


Yeah in my comment in the thread I mention that one of my biggest issues (other than questionable internal validity) is the idea that there is more benefit in seeking out or creating differences in the two, because if there is equality then there is only 1 dataset. While a reviw like Xbit Labs or Firingsquad may have 6-9+ for that one game. And how do you value that 1920x1200 withou AA is better than 1280x1024 with 4-8XAA? That's information that you lose in a 'this is our favourite settings' benchmark/review. I prefered Beyond3D's testing that was restricted to a single IHV, no Red vs Green, it was just about the hardware, and the main comparison was last generation vs current generation, or high end vs mid-range, and thus much more questionable stuff was removed, includiung questionable beta drivers and floptimizations.

I've noticed in my own fraps benching that the min very often doesn't match the time fps listed that get put in the histogram. Is seems the min/max/ave usually reports a lower minimum than gets logged each second. I like the histogram myself as if both cards ave the same and hit 15 fps as a min, but card A drops in the teens alot (bigger fluctuation high/low) while card B only does it once. Obviously B is offers better gameplay.

Yep and also if you look at some of the apples-to-apples test, I think you'd have a tough time convincing me that either card is singificantly better at full settings, yet one 'required' being brought from high to medium shader quality? That implies that one is totally unuseable at that setting while the other is the only one playable, which is not really born out by the dataset provided.

Overall I relied more on [H] in the past, now I treat them like any other 2nd tier information source because I don't think they provide enough information with which to get a good picture of limitations and such. I rarely get much from one review, but if one review has alot of data I may get an idea of some areas that may benifit or restrict a design, and then I look to more reviews to see whether they give credence to this view.

Firingsquad, TechReport and Xbit are usually the ones that gives me this large granular snapshot from which I can pull some insight Digit-Life used to do that more so than now. I can also get alot of background and tech information from both here at Tom's and also at Beyond3D or 3Dcentre, and then after that I start simply adding to the information by looking over for anomalies from places like [H], Digit-Life, Bit-Tech, Hexus, EliteBastards, Anand (rarely though) and others.

Anywhoo, I think this is more about [H] defending themselves, but I also think the way they're doing it by attacking everyone else and not seeing value in their tests, just hurts them a little more in the process too.
 
They show themselves to be Nvidia fanboys on the last page. Neither the 8800gts or the 3870x2 can run Crysis with those settings. The 8800gtx comes out "on top" when settings are at medium, but others pointed out driver issues.

Besides, Crysis is a FPS and I bought my 3870x2 based on The Witcher. That was not a canned demo used by Anandtech. They showed that a 3870x2 got 47 fps instead of 30 vs. a 3870. The 3870x2 also beat the 8800gtx in that game:

http://www.anandtech.com/video/showdoc.aspx?i=3209&p=10

Only the 8800gt in SLI, at the highest resolution tested, beat the 3870x2. Since I'll be playing at around 1600 x 1200 when I get a 2O" LCD, that's fine by me.

Crysis favors Nvidia, especially if Nvidia's optimizations that hurt Crysis image quality are taken into account. Other games favor the 3870x2. I chose to buy it based on a real world FRAPS playthrough. I doubt that the people at H can test the 3970x2 without bias.

Supporting an FX 5800 on Screen Savers! Very funny bit.

I miss that show. There used to be Tech TV, now that channel's Dreck TV.

This Friday, I order an Antec Neo 650 with a 6+2 PCIe cable. My Neo 550 does not have it and the factory overclocked 3870x2 I have needs it. So, I'll get to FRAPS benchmark it at 1280 x 1024 on a 17" Viewsonic A71f. Can't get that 20" LCD until next month.

Will I be CPU limited at my CRT's resolution with an X2 4600+ Windsor? That is the queston.
 
My problem is, you get 2 people using this method and youll have 2 different sets of results. THAT is a no brainer. You simply CANNOT have apples to apples using this, much like a 250 pound man hitting an 8 ball vs a 5 yr old. Coordination, hieght, strength etc all different. We need to ask a scientist about these methods to see if theyre credible. I think not IMHO
 
As far as benchmarks, I like to see the 3DMark ones (just out of curiousity) and definately like to see real world ones to back up the Synthetics. I don't always believe all benchmarks from one place, but I do end up looking at several places to verify or get an opinion about a product.
I'm usually more interested in the mainstream products, since I can't afford the top end stuff. Here is what I'd like to see in reviews of GPU's:

Mobo, Memory, Case, PSU, CPU HSF, and HD be all the same for every test.
Test #1
I'd like to see the CPU changed out for every test, but don't change the GPU. Use the high end GPU (3870 x2 or 8800gts (g92)) and complete all tests.

Test #2
I'd like to see the GPU changed out (2600xt/3650/3850/3850 512mb/3870/8800gt/8800gt 256mb/8800gts (g92)/8800gtx/8800gtx Ultra) for every test, but don't change the CPU out (using an e2140 or AMD x2 3800).

Than perform the test above (#2), but with a mainstream CPU (e6750/x2 5000 BE). Follow this up with the same test, but with the CPU from the high end CPU, etc.... You see where I'm going with this.

This of coarse would be nice to have, but is probably not financially feaseable, but would definately let the consumer know what kind of performance gains/losses you would have in a certain situation. Most of these review sites use $1k CPU's and test out a 2600xt or some other low end GPU, which noone with the $ to buy a $1k CPU would do. This just doesn't make sense. I know they do this so you can see the maximum FPS that you would probably get with the given GPU, but c'mon!
 


3DMark will get more interesting when a game is developed by Futuremark. Otherwise, it's eye candy that tests GPU's but not under real world conditions. My midrange GPU's all did better in games than in 3DMark. I'll get to see if I'm CPU limited next week when I set up the 3870x2.



What I find amusing are all the graphics card tests with very high end processors. I'd love to know which card ends up being CPU limited by which processor. We know that a Pentium C2D should limit an 8800gts, but where does that limitation stop on the Intel side? Where does it stop on the AMD side?

So, I like the idea of testing each GPU for review with a value processor, a mainstream processor and an enthusiast processor from each company. That would give a fairer real world comparison. Of course, most enthusiasts would say that no one buys a high end GPU with a low end system, but it happens. Plus, there's that great middle not represented in the reviews at all.
 

Exactly what I would like to see. Low/Mid/High end CPU's with a high end GPU. Than one could figure out if a e2140 or a e6550 would all that you would need to play your particular game. But we know that none of these review sites would do this, since they are paid to push the latest technology. It would be nice to see it, but I'm just dreaming here.
 
Its been awhile, but the last time Ive seen this done was with the 939s, using a FX60. All they did was lower or raise the multiplier and used different cards. At the time, 2.2GHZ was considered the "bottleneck" point for the 1900xtx. Id like to see this again, at least once for each cpu/gpu generation. The 1900xtx was considered the top card at the time, tho they also tested lower and the prior gen of cards as well
 
Unless the results are reproducable by a 3rd party I would consider them as false.

Real World Gameplay is WAY too subjective. As an example.. what if while playing it 5 frags go off with explosions and another time 10 go off? This is why we have canned GPU benchmarks.

Of COURSE it's not going to mimic real world framerates. Does the Crysis Bench have 20 people shooting at you while 15 explosions are going off?

However if you take a Video card and one gets 30 FPS on the Benchmark and the other gets 20 FPS, I would venture a guess to say the former is stronger.

This *IS* however why you don't just use a single benchmark. You use multiple ones from several games and include 3D Mark and take an average.

Unless [H]'s results are 100% repeatable (I'd say within 5-10% and that's lenient) on the same system by a 3rd party they can be regarded as false. Unless the time demos are exactly the same and used by both cards on the exact same system, they are false.

Apples to Apples... It means reproducable results. In general most sites did find the 3870X2 (the card that started all this controversy) to be better in most situations. Somehow [H] didn't... yet they are the only ones telling the "truth."

Edit: I'd also like to add that they should run a minimum of 12 Benches and throw out the highest and lowest. It is difficult though, as several options exist.
 
Any site using Crysis as the defining benchmark are doing their readers a diservice. Crysis is still in beta as far as I'm concerned.
 
I see no problem in how they test the cards ,as long as the tests are consistent.[H] is a consumers tool in which to judge the cards.They removed some of the games that they use to use because the test time was too long,which I think is wrong.More games in the test,since there are a lot using the same engine,would give us a more detailed view of what [H] is seeing.I guess my point is they need to use the top ten games to more define these cards,yes it would be longer for them to test,but I think its more fitting to there methods.
 


It's not that they test that way, it's that they have both ATI and Nvidia cards on medium settings in Crysis on the next to the last page and then on the final page, where they reveal their bias against ATI, they show the frames per second the 3870x2 gets on high settings. It's not like the 8800gts can handle Crysis on high settings; maybe R770 and G100 will manage to, but not today's cards.

The difference between the two page's settings is bad amateur journalism, pure and simple. Both the 3870x2 and Nvidia 8800gts do similarly on medium settings, with a slight nod to Nvidia. Also, their games tested are mostly, if not all, FPS. I'm not bothering to check all their titles because I don't have them bookmarked on this new install, but they leave out other genres where the difference is not on Nvidia's favor.

They just wish they were Anandtech. Perhaps they aren't getting the samples to bench that Anandtech gets. When a smaller site attempts to start a row with a major site, then motives must be questioned. I certainly do, and I'm not even posting at Anandtech all that much. I just read it, along with Tom's Hardware and Xbit Labs as my 3 favored benchmarking sites.
 


Since validation against hard results is out of the question, they need to produce error estimations for their results.


That would require a great number of runs (3 is not enough), and plotting of the deviation in FPS between the runs. If the variation is large, more runs are needed until a bell-curve can be extracted and the average calculated.

A method of ensuring consistency between runs is also needed, like youtube videos of the runs nearest the average.



[H]'s current method would be laughed out of a peer-review assessment if that helps.

 
Actually n=3 can be enough if the difference between means is great enough and the standard error is sufficiently small. Power analysis can help determine the sample sizes required to achieve statistical significance (p>0.05, or 95% confidence level). Obviously more runs or samples are better because it reduces your error and thus increases your statistical power.

The problem with [H]ardOCP's testing methods are that there is way too much uncertainty in the real time gaming environment - so Amiga500 is right and more samples (or runs) would be needed to determine any sort of statistical significance.
 
They just wish they were Anandtech. Perhaps they aren't getting the samples to bench that Anandtech gets. When a smaller site attempts to start a row with a major site, then motives must be questioned.
[H]ardocp is no small site by any means. Their forums and folding team are huge. Matter of fact, Folding they are #1 by far so they obviously have quite the loyal crowd of readers. http://folding.extremeoverclocking.com/team_list.php?s=
 
The reason the EPA can't do real world testing is that it would have to be done outside. Let's do two cars A and B.

Was Temperature exactly the same ?
Was Humidity exactly the same ?
Was Elevation exactly the same ?
Was Traffic exactly the same ?
Was Tire Pressure exactly the same ?
Was Road Surface Condition exactly the same ?
Was Driver's Mood exactly the same ?
Was the Tension in Driver's shoelaces exactly the same ?
Was Wind Direction exactly the same ?
Was Road Grade exactly the same ?

I could keep going :)

So the test are done on a dynometer where every condition is exactly the same and the answers will be exactly the same all the time. EPA didn't change their methodology, they just changed the fudge factor which accounts for air resistance and other factors so as to provide an "approximation" which better reflected usage in real world conditions.

If ya want subjective reviews go to PC Ragazine. My favorite was the comparison they once did between Microsoft Access and Lotus Approach....Approach got 11 "Excellent" ratings and 1 "Good" in 12 categories.....Access got 7 Excellents / 5 Good. They "tied" for Editor's Choice.

There's a more recent one where they reviewed PC Protection Suites. The Zone Alarm product and the Norton product scored equally on most categories but on 2 or 3 of them the ZL product trounced by like a 4 to 2 margin. The Norton product edged out one category with like a 4.5 to 4.0 margin but when you read the text of the article relating to that category, it said that the ZL product "performed the best". That's why subjective reviews can't be accepted as too often the winner is chose before the test begins and then the test skewed to produce the desired result.



 
I can't accept [H]'s review of the 3870 X2 because I don't get where they are coming from at all whatsoever.

First they say that a single 8800GT isn't playable at 1680x1050 with everything on high in Crysis. WTF? I have built 2 systems quite recently (one e6850 and one e8400) with EVGA Superclocked 8800GTs and both customers play Crysis on high @ 1680 and love it.

I've watched my buddy play through all of Crysis @ 1920x1200 on mixed high/very high with a single 8800Ultra and a 6850@3.3 with 8Gigs on Vista 64. It ran great and was more than playable throughout the whole game. Now the [H] would say even triple Ultras would be poor at this.......WTF??? I can't believe my own eyes?
 
Scientific? Who CARES!

When I get a new game I do exactly what they do - find the best balance between FPS and quality.

Oh and FYI, this is scientific. It's called qualitative testing.
In the case of video cards it is VERY IMPORTANT.

I can understand wanting to compare performance with a standard, but would argue this method only tells part of the story. Admittedly, a mix of both graphs would be nice.


-Cheese


EDIT/PS: I wont defend their actual conclusion (necessarily) but I will defend their methods.