I run one pass of each benchmark to "warm up" the GPU after launching the game, then I run at least two passes at each setting/resolution combination. Some games don't require a restart between settings changes, many do. If the two runs are basically identical (within 0.5% or less difference), I use the faster of the two runs. If there's more than a small difference, I run the test twice more and basically try to figure out what "normal" performance is supposed to be.
While the above might seem like it would be potentially prone to mistakes, I've been doing this for years and the first two runs (plus warm up) are usually a good indication of performance. I also look at all the data, so when I test for example RTX 3070 and RTX 3060 Ti and RTX 3070 Ti, I know they're all generally going to perform within a narrow range — 3070 Ti is about 5% (give or take) faster than 3070, which is about 5% (again, give or take) faster than 3060 Ti. If, after I put all the data together, I see games where there are clear outliers (i.e. performance is more than 10% higher for the cards I just mentioned, I'll go back and retest whatever cards are showing the anomaly and confirm that it is either correct due to some other factor, or that my earlier test results are no longer valid.
Given each card requires about eight hours to run all the tests (at least for a card where I run all four resolutions/settings), there's obviously going to be lag between the latest drivers, game patches, and when a card gets tested. I've started the new hierarchy tests with the current generation cards, and I'm now going through previous generation cards. Some cards were tested with 511.65 and 22.2.1, others were tested with 511.79 and 22.2.2 — I updated drivers mostly because I added Total War: Warhammer 3 to the suite. When I've finished all the testing of previous generation GPUs in the next month or so, I'll go back and retest probably RTX 3080 and RX 6800 XT on the latest drivers and game patches and check for any major differences in performance. If there's clearly a change in performance, for the better, I'll start running through the potentially impacted GPUs again. This is usually limited to one or two games that improve, though, so I don't have to retest everything every couple of months.