Question why would GPU performance keep declining, but is not due to thermal throttling?

rjohnson98126

Honorable
Jan 2, 2019
24
5
10,515
Hello and thank you for any consideration you can give me for my problem

Situation: im using my GPU to grind away neural network training, which require days/weeks of constant processing. short story, it's an open source software that is well-regarded, is being been highly utilized by many individuals quite successfully, and the software has solid dev support.

Problem: every time I start or restart, my GPU begins at expected processing rate (say 50 examples per second), but then immediately starts degrading exponentially such that after 3 hours it's at 50% of the starting rate (25 examples per second). It will continue to drive itself into the ditch if i leave it. So i have to keep restarting. It restarts just fine, but immediately starts degrading in the exact same manner, every time. It's not thermal throttling and the devs do not know why my GPU is doing this, it is not something they've seen before, although clearly it is happening.

Details: my GPU is a GeForce 2060. this is considered a sufficient lower-midrange card to do such processing. there are people who successfully use older, less powerful GPUs to run this same neural network training, although somewhat slower, however even they don't have this degradation. As mentioned, there appears to be no thermal throttling occuring. I use MSI Afterburner while processing, and the GPU temp stays consistently around 55-60C, and no weird behavior in other metrics. Ive also run Unigine Heaven benchmarking tool, my 2060 benchmarks about the same as any other 2060. Ive let it just sit there on Unigine for long periods, the performance and temp remain consistent.

Other Specs: my motherboard and processor are old, but i'm told (and can see) there is very little processing done by the microprocessor. It's all GPU.

B350 PC Mate​
Ryzen 5 1600​
32 GB DDR5​
GeForce RTX 2060 (6GB)​
i have not overclocked anything.​
I'm really stumped as to what's happening or how to fix it. thank you for any assistance you can give.
 
Last edited:

rjohnson98126

Honorable
Jan 2, 2019
24
5
10,515
Hi, thanks for response

for the past two hours of training i get the following:

CPU usage for the 12 threads varies between 25-50% usage. CPU temperature for all threads fixed at 57C

GPU usage is stable at 15-25%. GPU memory utilization constant at 5.8GB. GPU temperature fixed at 48C the entire two hours.

SSD usage is effectively zero.

Power Supply, EVGA 700W Bronze. Total power consumption consistent at 28W, with occasional troughs and spikes between 10W to 80W.

all of the metrics remain consistent during entire 2 hours, while the training calculation rate steadily dropped from 100% to about 60% during this time.

thank you again for looking into this.
 
Last edited:
The first thing that stands out as odd to me, is that your GPU is not being used to full potential at any point. I would expect the usage to start aroun 90% and slowly drop, instead of maintain consistent ~20%.

Is the program you are using running at full speed, and demanding enough from your GPU? Usually, when given a large number of tasks, the GPU will complete them as fast as possible, which leads to either GPU or CPU usage being maxxed out. As this is not the case for you, I suspect the program is actually not asking the GPU to do enough work. For some reason, it appears, the program assigns your GPU 50 tasks per second, using your example, and slowly lowers this amount over time. If 50 tasks per second is 25% of your GPU, ignoring CPU for a second, then I would expect you to be processing at 200 Task per Second, with 100% GPU usage.

Not knowing anything about what you are actually doing makes this a bit harder, but double check your program data and what you are trying to compute, make sure there isnt a limit to the processes being built into the code or program you set up.

Here is the part where I am truly leading in darkness, and have no idea if this is by any means helpful or realistic. - If the falloff is truly exponential, then somehow maybe the software is completing tasks, and then somehow eliminating half, and running through the other half again. If the number of tasks it runs each iteration is halved, then this is a definition exponential decay. It never stops, but it just does less and less until it is functionally not doing anything. Restarting the program puts all the tasks back into order, and it starts will all of them again.
 

rjohnson98126

Honorable
Jan 2, 2019
24
5
10,515
the questions you ask are good questions, and i don't have the answer. Perhaps i will try to relay it to the SW devs. i think you might be onto something, w.r.t. the GPU utilization.

I don't mean to seem cryptic about the program, i just didn't want to distract from the issue. if it helps, the program is Faceswap ... https://faceswap.dev/ ... I am apparently the only one of hundreds of people having this issue, so i have to approach this that the problem is somewhere in my hardware or hardware settings, and not somewhere in their well established software.

also, i spoke incorrectly. previously i thought the rate was decreasing exponentially, but now that i look at it, it's pretty linear. Here's a screenshot of the in-program analysis; the end result is that the number of "tasks" being completed per time is decaying, and it should remain constant.

rate-decay.png
 
Last edited: