News AMD’s Lisa Su steps in to fix driver issues with new TinyBox AI servers — Tiny Corp calls for AMD to make its GPU firmware open source, points to i...

Status
Not open for further replies.

Admin

Administrator
Staff member
Sometimes a fellow tech company is needed to give the big boys a good PR kicking to "sort out their sh!t", when you have been patient, they fail to deliver, and they are screwing around with your company and customers. So, thumbs up to Tiny Corp and lets see if Lisa Su has big enough shoes on her feet to kick assess?
 
…so a company built their business plan around using gaming/consumer cards in an enterprise workflow instead of springing for the data center products, then threw a tantrum when AMD agreed to help them with that (and acknowledged Nvidia never would) but didn’t solve everything immediately?

Did they not build a prototype and validate their workflow before taking preorders and starting their production run?
 
…so a company built their business plan around using gaming/consumer cards in an enterprise workflow instead of springing for the data center products, then threw a tantrum when AMD agreed to help them with that (and acknowledged Nvidia never would) but didn’t solve everything immediately?

Did they not build a prototype and validate their workflow before taking preorders and starting their production run?

This. Basically they want to use gaming GPUs as data center shenanigans and "demand" AMD to open source their firmware? Like, it's even a startup? Sounds to me marketing stunt more than anything
 
In many ways, this is TinyBox's fault too. (Not just AMDs)

You don't release a product without first vetting it.

At a former job we had a salesmen with a $1 million + sale to a lab with the promise the software to control said hardware was ready. We spent 80 hours per week fixing up his blunder for over two months. And he was fired IIRC (Just because he was seeing $$$ and making promises without consulting with the company.)

And asking for firmware is a good way to spill company secrets including things like:
1. Settings that should never be changed,
2. Secret coding routines that give them a performance advantage
3. Incomplete Beta Code that reveals company roadmap
4. Alpha/Beta code never intended for release
5. DRM keys. This could cost AMD their licensing.
6. Give hackers attack vectors.

Support for the 7900XTX and 7900XT ROCm API is < 6 months old. Of course there's going to be bugs as the architecture was never designed for such.

And yelling like this is a good way to get your company black listed. The Chiwuawua doesn't tell the Mastiff what to do. They should know better.

Company president isn't the sharpest pencil in the box. I bet he's young.
 
Last edited:
…so a company built their business plan around using gaming/consumer cards in an enterprise workflow instead of springing for the data center products, then threw a tantrum when AMD agreed to help them with that (and acknowledged Nvidia never would) but didn’t solve everything immediately?

Did they not build a prototype and validate their workflow before taking preorders and starting their production run?
This!
 
  • Like
Reactions: PEnns
So you pay for a Toyota and want it to perform like a Ferrari. Hmmmm… Jokes aside, I do think the startup is pushing it when companies are paying a lot for AI/ data center hardware, while they are paying for a bunch of USD 999 bucks GPUs and expecting data center quality treatment. They should be fortunate the CEO even bothered about them.
 
  • Like
Reactions: ravewulf and PEnns
Doesn't TinyBox just want ROCm to work as advertised?
It doesn't sound like they want server support, just advertised features to work.
https://www.amd.com/en/products/software/rocm.html
If AMD is locking out ROCm functionality from consumer GPUs (that they claim work with ROCm) with firmware, maybe they should be honest and state that ROCm support might be coming to those GPUs sometime in the future.

Why all of the hate for TinyBox when they seem to just be guilty of the crime of trusting AMD.
 
All of the above. Also, this discussion should be behind closed doors, in their testing phases, not publicly, unless it happened and AMD didn't solve the issues.

However, AMD should asking a person or two to be dedicated to talk to TinyBox, and help them solve the issue as quickly and as efficiently as possible. That's the kind of support I think Nvidia has, which could be one reason there are so many more Nvidia sponsored games and software.
 
  • Like
Reactions: evdjj3j
…so a company built their business plan around using gaming/consumer cards in an enterprise workflow instead of springing for the data center products, then threw a tantrum when AMD agreed to help them with that (and acknowledged Nvidia never would) but didn’t solve everything immediately?

Did they not build a prototype and validate their workflow before taking preorders and starting their production run?
My thoughts exactly. Everyone misses the point that these guys built their business model on assumptions. But that’s AI right now. The speculative investor vc money is being thrown at satisfactory answers to “how fast can you get to market”, not how much “you’ve validated it’s even possible”. They should thank their lucky stars anyone at AMD even acknowledged them given the absolute insane demand for datacenter class AI products right now. AMD is behind nvidia, but they are still in the mix because demand is insatiable.
 
Thanks for the ignorant and unrelated comment, Tom's Hardware definitely lacks those /s
It’s not irrelevant. Building a boxed solution that hinges on the stability of amd drivers? Did none of these people push newest gen amd cards in a pc before? They have a reputation as much as I still buy their older cheaper mid range stuff.
 
Doesn't TinyBox just want ROCm to work as advertised?
It doesn't sound like they want server support, just advertised features to work.
https://www.amd.com/en/products/software/rocm.html
If AMD is locking out ROCm functionality from consumer GPUs (that they claim work with ROCm) with firmware, maybe they should be honest and state that ROCm support might be coming to those GPUs sometime in the future.

Why all of the hate for TinyBox when they seem to just be guilty of the crime of trusting AMD.
Because AMD can do no wrong, and it must be Nvidia's fault... because reasons?

Ironically, you can use CUDA on your standard GTX/RTX gaming GPU just fine, and it's supported - and by actual support staff rather than having to yell at the CEO on twitter. There's a EULA limitation on not doing so at datacentre scale (and you won't get management or integration tools), but no actual hardware or software limitations on running CUDA.
 
  • Like
Reactions: rluker5
Doesn't TinyBox just want ROCm to work as advertised?
It doesn't sound like they want server support, just advertised features to work.
https://www.amd.com/en/products/software/rocm.html
If AMD is locking out ROCm functionality from consumer GPUs (that they claim work with ROCm) with firmware, maybe they should be honest and state that ROCm support might be coming to those GPUs sometime in the future.

Why all of the hate for TinyBox when they seem to just be guilty of the crime of trusting AMD.

They formed a business model on untested software that wasn't intended for mainstream parts in the first place. So now they are crapping their pants there are bugs on software that was just released (making it new) when the stable proven software stack has been released for years.

Let me give you an example of why this is stupid.

Ideally you want a vector comparison that returns the closest value that returns 1 in similarity. Well you might have thousands of vectors to test to find the closest ideal one. You might have a software stack item that supports comparing 256 vectors at a time (internally) because DEDICATED AI hardware can perform a bitroll on 256 vectors and resolve which one is greatest in log2(256) = 8 clock ticks. But consumer hardware might only be able to compare 16 vectors at a time. This creates tournament rounds of multiple passes which takes more clock ticks. Well the internal stack might not take this into consideration and if the next uOp operates on the previous result based on the fact it will take 8 clock ticks, you get an incorrect answer as the mulitple round tournament isn't complete.

When you have quite literally hundreds or even over 1000 uOps, comparing the timing operations of all instructions becomes a very NP Complete problem and it takes time to solve as the domain space for testing exponentially grows based on the number of permutations for instruction orders increases. Each instruction can break down into dozens of uOps and just because one instruction breaks down to steps x1 y1 z1, the exact same instruction next time will break down to x2, y3, z20 (even though uOps are the same) because the resources are busy for x1 y1, z1. Then you have to verify they complete in the order expected (Data dependency race condition) Some uOps can have a variable amount of execution time and it's up to the OOE scheduler to account for this. (By locking an outside register/memory read on an incomplete uOp) NOW GP GPUs don't have OOE Schedulars. But their software stack can emulate them. In fact, GPUs have pretty stupid schedulars as it adds considerably to the transistor count for each stream/wave. This is one of the reasons why shaders have to be compiled for each machine, because you want to get the optimal execution process for each hardwares unique abilities and quirks. The order of instructions is optimized then. And this is why we often have glitches that require patches because the shader compiler sometimes gets it wrong. As architectures age, they get more stable because more combinations are tested. And this is why the statement by AMD said "There is really no more gas left to optimize GCN" They realize that it would be incredibly rare to find a uOp combination that hasn't already been patched as GCN is a very mature architecture. But that also means any future versions of software written for it should have a very very high degree of compatibility out of the box. 7000 Series is new architecture to ROCm. There's bound to be bad combinations of instructions they haven't caught.

Having a really really good architectual engineer who understands bit level math like this is a really RARE resource. He has to physically write out the way the bits SHOULD execute, and then compare it bit for bit down at the internal register level using a GPU stepper in debug mode. And there can be a lot of variables to keep track of. I used to do this when debugging firmware, and it can be a nightmare with just a dozen registers to track. (Turned out it was a missing delimiter in the command stream that indicated end of data stream. So the CPU kept parsing into unknown memory. I had to hand parse over 1 meg by hand to find it.)
 
Last edited:
They formed a business model on untested software that wasn't intended for mainstream parts in the first place. So now they are crapping their pants there are bugs on software that was just released (making it new) when the stable proven software stack has been released for years.

Let me give you an example of why this is stupid.

Ideally you want a vector comparison that returns the closest value that returns 1 in similarity. Well you might have thousands of vectors to test to find the closest ideal one. You might have a software stack item that supports comparing 256 vectors at a time (internally) because DEDICATED AI hardware can perform a bitroll on 256 vectors and resolve which one is greatest in log2(256) = 8 clock ticks. But consumer hardware might only be able to compare 16 vectors at a time. This creates tournament rounds of multiple passes which takes more clock ticks. Well the internal stack might not take this into consideration and if the next uOp operates on the previous result based on the fact it will take 8 clock ticks, you get an incorrect answer as the mulitple round tournament isn't complete.

When you have quite literally hundreds or even over 1000 uOps, comparing the timing operations of all instructions becomes a very NP Complete problem and it takes time to solve as the domain space for testing exponentially grows based on the number of permutations for instruction orders increases. Each instruction can break down into dozens of uOps and just because one instruction breaks down to steps x1 y1 z1, the exact same instruction next time will break down to x2, y3, z20 (even though uOps are the same) because the resources are busy for x1 y1, z1. Then you have to verify they complete in the order expected (Data dependency race condition) Some uOps can have a variable amount of execution time and it's up to the OOE scheduler to account for this. NOW GP GPUs don't have OOE Schedulars. But their software stack can emulate them.
You have a lot of specifics there that I'm guessing are intended to show that the 7900xtx is not capable of doing what AMD is promising it can do with their stated support. AMD doesn't have footnotes to limit or parse their support claim. If AMDs support of the 7900xtx is limited in this regard then AMD should make it clear. Be the cause software or hardware it doesn't matter. Otherwise it is false advertising.
AMDs recent statements almost make you think you can run AI on a handheld, just at a slower rate.
Nvidia makes it clear.
 
Doesn't TinyBox just want ROCm to work as advertised?
It doesn't sound like they want server support, just advertised features to work.
https://www.amd.com/en/products/software/rocm.html
If AMD is locking out ROCm functionality from consumer GPUs (that they claim work with ROCm) with firmware, maybe they should be honest and state that ROCm support might be coming to those GPUs sometime in the future.

Why all of the hate for TinyBox when they seem to just be guilty of the crime of trusting AMD.
Doesn’t sound like AMD is locking out ROCm, just that there’s some form of bug in firmware related to TinyBox’s particular combination of multi-GPU setup and the software running in it that AMD is attempting to fix?

All of the above. Also, this discussion should be behind closed doors, in their testing phases, not publicly, unless it happened and AMD didn't solve the issues.

However, AMD should assing a person or two to be dedicated to talk to TinyBox, and help them solve the issue as quickly and as efficiently as possible. That's the kind of support I think Nvidia has, which could be one reason there are so many more Nvidia sponsored games and software.
The thing is that TinyBox is buying consumer cards instead of datacenter models, but seems to expect a datacenter SLA? They got revised firmware within 6 hours of the earliest linked tweet and a call with engineering the next day, which is more than I’d think a startup buying a bunch of gaming cards would qualify for when there are actual paying enterprise clients, and they appear to be having a public meltdown that AMD isn’t doing more for a startup with less than 100 servers built and none yet shipped (and expressly avoiding using pro GPUs).

As for Nvidia, their EULA forbids using GeForce products for datacenter CUDA applications, so they’d definitely not be dedicating anyone to talk to TinyBox in this situation, except maybe a lawyer.
 
You have a lot of specifics there that I'm guessing are intended to show that the 7900xtx is not capable of doing what AMD is promising it can do with their stated support. AMD doesn't have footnotes to limit or parse their support claim. If AMDs support of the 7900xtx is limited in this regard then AMD should make it clear. Be the cause software or hardware it doesn't matter. Otherwise it is false advertising.
AMDs recent statements almost make you think you can run AI on a handheld, just at a slower rate.
Nvidia makes it clear.

Well what do you honestly expect? 7900XTX is not a high value platform for AMD. Support will take longer as there are highing paying customers (ie: Govt/Universities/Large Corporate) that take higher presidence for dedicated AI hardware. And as I said, the kind of resources you are asking for are limited. The kind of talent that can fix these issues is limited.

It's not to say that AMD won't fix it. But by the way TinyBox is acting (impetulant child) they are asking for trouble. And in the end, AMD may just bail all together if they decide the payout isn't worth the headaches.

BTW: Using common off the shelf hardware has other issues if the 3rd party OEM doesn't 100% adhere to AMD's reference design. Simple things like a 5% overclock can crash ordering of some micro-ops. This is why Intel pulled the over clocked PIII because it was crashing during gnu compiles because the electroncs couldn't propagate fast enough for some uOps.

While exposing common developers and AI startups to entry level hardware is a good thing (Establishes low cost user base to erode NVIDIA's locked eco system of Dedicate AI hardware CUDA) it's going to come with challenges including entry level support.

And for the record, I was one of the advocates for AMD supporting ROCm on the 7000 series. But I never expected it to be bug free.
 
Last edited:
The article is about AMD's terrible drivers but my comment is ignorant and unrelated?
Yes, completely irrelevant. Making a claim that all AMD drivers have been bad for 25 years because a tech startup is cobbling together a bunch of consumer GPUs in never before used ways and they encountered a bug is completely illogical.
 
I'm not sure what they expected ATI/AMD GPU drivers have been awful for 25 years.

Funny, the number of problems I have had in over 12 years of AMD graphics cards is related to an OpenGL driver on 1 game. And that was fixed with a simple rollback till a patch came out.

I'm not saying they didnt have their stumbling blocks or glitches. But NVIDIA has them too. Reliability is actually quite high.
 
Status
Not open for further replies.