Hello and thank you for any consideration you can give me for my problem
Situation: im using my GPU to grind away neural network training, which require days/weeks of constant processing. short story, it's an open source software that is well-regarded, is being been highly utilized by many individuals quite successfully, and the software has solid dev support.
Problem: every time I start or restart, my GPU begins at expected processing rate (say 50 examples per second), but then immediately starts degrading exponentially such that after 3 hours it's at 50% of the starting rate (25 examples per second). It will continue to drive itself into the ditch if i leave it. So i have to keep restarting. It restarts just fine, but immediately starts degrading in the exact same manner, every time. It's not thermal throttling and the devs do not know why my GPU is doing this, it is not something they've seen before, although clearly it is happening.
Details: my GPU is a GeForce 2060. this is considered a sufficient lower-midrange card to do such processing. there are people who successfully use older, less powerful GPUs to run this same neural network training, although somewhat slower, however even they don't have this degradation. As mentioned, there appears to be no thermal throttling occuring. I use MSI Afterburner while processing, and the GPU temp stays consistently around 55-60C, and no weird behavior in other metrics. Ive also run Unigine Heaven benchmarking tool, my 2060 benchmarks about the same as any other 2060. Ive let it just sit there on Unigine for long periods, the performance and temp remain consistent.
Other Specs: my motherboard and processor are old, but i'm told (and can see) there is very little processing done by the microprocessor. It's all GPU.
Situation: im using my GPU to grind away neural network training, which require days/weeks of constant processing. short story, it's an open source software that is well-regarded, is being been highly utilized by many individuals quite successfully, and the software has solid dev support.
Problem: every time I start or restart, my GPU begins at expected processing rate (say 50 examples per second), but then immediately starts degrading exponentially such that after 3 hours it's at 50% of the starting rate (25 examples per second). It will continue to drive itself into the ditch if i leave it. So i have to keep restarting. It restarts just fine, but immediately starts degrading in the exact same manner, every time. It's not thermal throttling and the devs do not know why my GPU is doing this, it is not something they've seen before, although clearly it is happening.
Details: my GPU is a GeForce 2060. this is considered a sufficient lower-midrange card to do such processing. there are people who successfully use older, less powerful GPUs to run this same neural network training, although somewhat slower, however even they don't have this degradation. As mentioned, there appears to be no thermal throttling occuring. I use MSI Afterburner while processing, and the GPU temp stays consistently around 55-60C, and no weird behavior in other metrics. Ive also run Unigine Heaven benchmarking tool, my 2060 benchmarks about the same as any other 2060. Ive let it just sit there on Unigine for long periods, the performance and temp remain consistent.
Other Specs: my motherboard and processor are old, but i'm told (and can see) there is very little processing done by the microprocessor. It's all GPU.
B350 PC Mate
Ryzen 5 1600
32 GB DDR5
GeForce RTX 2060 (6GB)
i have not overclocked anything.
I'm really stumped as to what's happening or how to fix it. thank you for any assistance you can give.
Last edited: