News Insane DDR4 price spikes mean last-gen RAM loses its value luster versus DDR5 — prices have nearly tripled in just two months

Page 3 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
I got

Geekbench v6 singleGeenbench v6 multiWatts at outlet screen offWatts at outlet screen onHWinfo CPU idle screen onWatts with WASM at outletWatts WASM hwinfo CPUWASM score
Ryzen 5800U187962973.5-57.5-92.625-2115-1252.95
Ryzen 7435HS201798685forgotsorry261353.57
Pentium N600556516617.510-114.529-2617-1518.26
Firefox @ Linux: 21.04 for my N97. I left the power limits at stock 12/25W for PL1 and PL2. During the turbo window, it boosted only to 13.5W (self-reported package power), running all 4 cores at 2.9 GHz. Once the turbo limit expired, it dropped to 12W and ran a solid 2.78 GHz on all 4 cores.

Nothing else was running. I didn't want to take the machine down to put the A/C power meter inline. I might do that later, but it's an industrial board and doesn't have the lowest idle power. Also, the 32 GB DIMM I mentioned is burning some juice, as is the 90x14mm Noctua fan that was running near max RPM (I have it in a very small case and the stock heatsink was not good). I bought mine before ODROID launched their H4 lineup, so mine was made by a Taiwanese company called Jetway.

I actually went ahead and bought an ODROID-H4 Ultra, which features the N305. However, I have yet to build that machine and there are a few projects in the queue before it.

I noticed some clown put in a CPU name of "Intel 4004", for the fastest single-threaded result. It's actually suspiciously fast, making me wonder if they might've slowed down their realtime clock or something. In case anyone doesn't get the joke: that was Intel's first CPU from 1971 and processed only 4 bits at a time (QLC! 😀).


BTW, until Skymont, the Intel E-cores are at a pretty big disadvantage on floating-point. So, this makes them look artificially bad (unless you're primarily interested in floating point workloads).
 
Last edited:
Just ran it in Chromium and got a score of only 12.51. The cores again initially boosted to 2.9 GHz, but this time burned only about 12.4 W, until the turbo limit expired and then it dropped to <= 12.0 W.

BTW, in case you didn't notice, the main page of that benchmark says (right near the top):

"(HINT: Firefox is usually the winner)"

I guess you couldn't take a hint!
: D
 
  • Like
Reactions: abufrejoval
Yet the Tiger Lake NUC11 (Tiger Lake and Jasper Lake were NUC11 with Intel...) with 96EU and without any eDRAM help gave linear performance improvements, 4x the NUC10 iGPU, which seemed pretty near impossible!
Tiger Lake uses Xe (Gen 12) iGPU, while Jasper Lake uses Gen 11.

a competent 4k with a 60Hz minimum includes video at the same resolution, fluid scrolling across large PDFs and web sites and just enough 3D to satisfy the info-worker.

And Google's 3D maps are about the best optimized 3D graphics presentation there is.
This is a mismatched standard. You're saying you care about 2D performance, yet you're measuring 3D performance. The latter is much harder than the former. When I run Google Earth on my 4k monitor at work, it runs like trash on my i5-1250P with 96 EU iGPU (Win 11). However, when I run it on another Alder Lake laptop with Nvidia A1000 that I have hooked up to my 120 Hz 1440p monitor, it's smooth as glass. Same corporate Win 11 OS image.

Yet, even the i5-1250P is fine for desktop graphics use. In fact, as I mentioned, my ancient Sandybridge iGPU is fine for regular desktop graphics use, and it's probably just a few % as fast as any of these GPUs.

It is so thoroughly optimized there is just no excuse not to do it, because it runs really well even at 4k on something as lowly as the Orange PI5 and a Jasper Lake Atom.
Again, you're comparing Linux vs. Windows and they each have different 3D backends which might enable/disable certain features in Google Earth.

The jump from 1440 to 4k more than doubles the pixels. It's a cut off point for quite a lot of graphics hardware, which does pretty well with the former, yet fails with the latter: there is a reason 1440p is one of the most popular gaming resolutions today.
Every single thing you're talking about is way faster than my old Sandybridge iGPU. Like way more than 2x. So, if it was fast enough for smooth desktop graphics for me on 1440p, then just about anything but maybe a Raspberry Pi should be rock-solid at 4k. I can't really vouch for the Pi 5, because I haven't setup mine and they don't publish specs on the iGPU (probably because it would be too embarrassing).

But then the overhead of corporate watchdog software is hard to overestimate, I still have one of those as well and it's not only a dog, but still a 10th gen quad core.
Agreed, you will want a beefy CPU for corporate laptops that are laden with corporate spyware and other security software.

I would have said the same, especially since quad rank isn't even officially supported by Zen.

But I believe this came from Wendel, who's pushing for 256GB on the desktop, so whatever the issue, I'd advise caution and testing.
Quad-rank DDR5 UDIMMs are literally not a thing! The way he proposes to reach 256 GB is by using 4x dual-rank 64 GB DIMMs @ 2DPC (two Dimms Per Channel), which you can do using the new 32 Gb chips. So, 2x dual-rank DIMMs per channel seems a little bit like a single quad-ranked DIMM, but the difference is that the corresponding quad-ranked DIMM doesn't exist. The ones that do exist are RDIMMs and only supported on server CPUs.

I know, I know and I try to contain the risk, and live with what's left.
But I am also driving way too fast, when the Autobahn is empty (150mph), Good thing it's rare...
To keep your overall risk low, you should compensate for taking a risk in one area by reducing risks in another. If you increase your risk-taking across the board, that greatly increases the likelihood of a negative outcome. Also, the thing you do regularly is the worst place to take risks, because the likelihood of a bad outcome goes up a lot faster, in such cases.

Who can you trust?
The Hardkernel people on their forums seem very honest and direct. They exposed the IB-ECC option, because we asked for it, but they don't even advertise it on the product page. That does not sound to me like a shady manufacturer. Their hardware seems pretty solid and they're pleasant to deal with. They've been in the ARM SBC game longer than most, and I think there are reasons for that.

And you could try Row Hammer to see if it triggers ECC. Because the guys who "invented" Row Hammer told me that not even ECC DIMMs protect against modern variants of it, it just takes a little longer...
The PassMark Memtest86 has a Rowhammer test. At the default duration, I've never had any memory fail it. Modern CPUs have hardware mitigations against it, but I don't know when those first appeared.
 
This is a deeply flawed argument and ignores each team's belief in their own technology. In particular, the Firefox team tries to write everything in Rust, while Chrome is C++ (last I checked) and would switch to Go if anything. Also, both browsers have their own Javascript engines. Firefox has SpiderMonkey, which explicitly says it also handles WebAssembly.

Chrome uses the V8 engine, which likewise explicitly says it handles WebAssembly:


It's right in the name!!! WebAssembly!
Names can be misleading or not tell the full story. In this case it's perhaps a bit of both.

Assembly langauge is human readable machine code. And an assembler turns that into binary machine code in a strict 1:1 mapping of mneumonic to instruction, because they have the same machine abstraction level, not a high-level language.

Famously there have been different assembly language formats e.g. for the 8080/Z80, Intel and Zilog mneumonics, which resulted in culture wars back then, but the resulting machine code was identical, if the assembly code used the same instructions even with different mneumonics.

Now in the case of WASM, WASM isn't actually typically represented in such human readable form, even if it represents machine code instructions for the WASM abstract machine, mostly for efficiency. And of course, one could translate a single WASM source code instructions into several functionally identical target variants that perform differently, if only by inserting random NOPs all over the place.

But if that were to turn out vastly different performance results, that would violate the purpose of WASM and the developers should have a look at the differences.
Wikipedia even cites a source to support the claim that its primary stated goal is to accelerate scripted content in web pages!
"The main goal of WebAssembly is to facilitate high-performance applications on web pages, but it is also designed to be usable in non-web environments."​
This is exactly the sort of case that a good compiler should help with. Hence, no reason to assume all implementations perform equally.
Compiler has an even bigger range of meanings than assembler. But WASM is typically the target result of compilation, not its source. The goal is to get as close as possible to binary target machine language without actually being the target ISA.

If there is vastly different performance figures for the same WASM on the same target, somebody didn't do their job properly.

WASM uses around 200 op codes for an abstract ISA, and yes, it now also includes some SIMD vector operations.

These OP codes are at the level of a typical machine op code, so that in most cases they can be translated near 1:1 to a target ISA. Of course, once could still do that job badly, but WASM as such is desgined to make it straight-foward and efficient.

Where I was cleary wrong is that the browsers all use the same assembler/transpiler/compiler code. Evidently they each do their own variant, which might have a measurable impact on the speed of the compiler, not the resulting code.

But again WASM to my understanding was designed to operate with little more than a single pass to both translate and to verify that it doesn't contain harmful side effects: it's not duplicating LLM with all its optimization logic, but taking the WASM to transform into x86, ARM or whatnot with little more than a mapping table. And the bigger effort is then in checking the resulting code for safety before running it, via static validations.
Because that's not how the software world works. Every browser has its own infrastructure, environment, security model, JIT engine, and a way for scripts to invoke platform library functionality. The incremental cost of adding a front-end for Web Assembly is small, while the lift of trying to integrate a foreign, 3rd party implementation in your browser is quite large, meanwhile you've now externalized a good deal of your attack surface. Not to mention the whole issue of language wars, that I mentioned above.
Whether you code a WASM translator in JavaScript, Go, Rust, C or whatever, shouldn't have much of an impact on the run-time performance of the generated machine code: it's the transformation logic, that dictates the performance and since the result is native machine code, it doesn't matter whose sandbox it runs in.

That said, WASM has seen relatively recent evolution in the support of multi-threading, which was considered potentially harmful in light of SPECTRE, and SIMD support, which is an area where CPU vendors go bonkers to catch up with GPUs. So if one variant already supports multi-threading or SIMD, while another doesn't, that should have measurable impact.

Where there could also be significant differences betweeen implementations is the I/O model, where WASM interacts with the outside, receives and sends data etc.: there the capability based security model may see vastly different implementations in terms of security and performance.

A compute performance benchmark might want to stay away from that, on the other hand not everyone is happy to just compute PI all day.
You wouldn't say that about CPUs. GCC and LLVM don't consider anything less than 2x insignificant. The SpiderMokey and V8 teams surely care about smaller than 2x differences in performance.

You just pulled this figure out of nowhere, and I can pretty much guarantee that you wouldn't accept a 2x performance difference in many performance contexts. Like, even your Google Earth benchmark, which you admit isn't based on any fundamental need to use Google Earth as part of your job, but just an arbitrary line in the sand you decided to draw.
Sure it's personal, I don't claim to be 100% objective in what I buy or use. But a lot of it is based on feedback from the people who get IT services from me, both corporate and family/friends.

Twice the performance/value at the same price usually is a call for at least investigation, anything less, and I might be looking for excuses to stay put, especially if the financial or human impact isn't felt. Half effort is a computer science classic, really, logarithmic instead of linear or god forbid exponential.

And 2x seems to work for quite a few more people, too, an easy sell. When I propose a 2-5% performance increase, my bosses rarely care, unless it's also on the bottom line. That's because we're not a hyperscaler. For AWS a 1% difference can pay a million $ bonus to whoever makes it happen.

Firefox went Rust, because globally those inner loops in a browser are the most executed code on the planet. A 1% difference could light up a small country. But mostly it needed to be safe and fast.

Absolutely not. It's basically just a modern alternative to Java bytecode.
Both were driven by very similar motives and in my view the precursor is UCSD p-code....
...while these days not even x86 CPUs execute their machine code directly, but do a similar WASM like transformation on the fly.

The "modern" aspect of WASM is just that, the need for native binary speed.

If they were happy with what Java or JavaScript can offer, they'd have stuck to it. Few people get paid if software just looks more modern.

"absolutely wrong" clearly isn't right, when native speed was the stated primary goal of WASM and portability the necessity.
I think you should stick to supportable facts, and not try to make factual claims that simply align with your values and imagination.
Like any good LLM, who evidently all learned from me, I also hallucinate: they can't help but learn our habits.

Which means I use System 2 as by "Thinking, Fast and Slow" by Daniel Kahneman, when System 1 would have given the more objective answer but cost too much time and energy.

It also means I abuse you to check my facts, ...which you actually seem to enjoy a bit.

Unfortunately it only means we might reach a consensus, not necessarily discover the real truth.
 
Just ran it in Chromium and got a score of only 12.51. The cores again initially boosted to 2.9 GHz, but this time burned only about 12.4 W, until the turbo limit expired and then it dropped to <= 12.0 W.

BTW, in case you didn't notice, the main page of that benchmark says (right near the top):
"(HINT: Firefox is usually the winner)"​

I guess you couldn't take a hint!
: D
As a fan of Firefox I was initially glad to see that remark (one for the team, go Firefox!), but it's also quite old and might have been more true when WASM support still young and evolving. And then you win with even with a small difference.

As I said, both the multi-thread and SIMD support are relatively recent additions to WASM and then any I/O gets treated by the sandbox, which might be vastly different.

As a consequence of that hint (which I took seriously) I had done some A/B testing between Firefox and Chrome-derivatives and the differences turned out much smaller than that statement implied (or I might have hoped for, as a Firefox fan).

That got me thinking back then and the result of the deliberation was, that the WASM design goal was that there should be no difference.

In practical terms Chrome/Brave often just didn't discover all the available cores on my big machines and that could be the effect of it trying to avoid finger printing, as it actually defaults to a random number of default threads, including odd numbers. So I had to manually adjust the thread count to get similar results.

And actually on Chromium, using much higher thread numbers than there are cores, closes the gap to Firefox a bit, so indeed in terms of multi-thread (SPECTRE?) safety measures, there seem to be differences between the two browser engines.

That said, I've just done a comparison on my main 24x7 desktop/server, a 5950X running Windows and the difference between Firefox 140.2 and Brave 1.80.113 (Chromium 138...) were significant, although not like your.

After adjusting Brave from 32 to 112 threads the difference got smaller, 155 instead of 115 for Chromium vs 168 on Firefox, where any number above the 32 threads that CPU offers, doesn't change things much.

But there is also a difference between single-thread performance 9.21 Megarads for Brave, 9.35 for Firefox.

Without digging deeper into the amount of I/O that's not easily explained, but far from the 21:12.5 results you were seeing, whicih definitely meet my 2x threshold for investigation.

But coming back to the original reason we go into all of this: if you compare Firefox to Firefox (or Chrome to Chrome), the numbers in this benchmark may give you a good-enough hint into the relative performance of the CPU underneath and even across ISAs (well SIMD could make that tricky), where other benchmarks are harder to run or impossible to obtain.

You could consider it the poor man's Cinebench while it avoids all the shenanigans that Geekbench has put into their benchmarks, to have modern mobile chips look better via extremely short sprint runs or a strong focus on certain ISA extensions.

Both APU, Zen 3 and Zen 4 are finally due to arrive today, so I'll dig into them and spare you my bla bla for a while.
 
Names can be misleading or not tell the full story. In this case it's perhaps a bit of both.
I'm done giving you the benefit of the doubt. You either know the lineage of their implementation, in which case you need to provide hard evidence, or you don't. I already wasted too much time on this nonsense.

Assembly langauge is human readable machine code. And an assembler turns that into binary machine code in a strict 1:1 mapping of mneumonic to instruction, because they have the same machine abstraction level, not a high-level language.
I know exactly what assembly language is. Web assembly is not that, because it's code for an abstract machine. Not only is there no direct mapping of the instructions, but its instructions are stack-based, whereas a real CPU uses registers.

These two facts means a WASM implementation must do a lot more than a normal assembler. Basically, it transforms WASM from a true assembly language into something that's simply a low-level programming language. That means you must use a full compiler and it opens the door for lots of compiler optimizations, which is surely among the reasons there was so much variation among the 10 different implementations in that benchmark I linked.

But if that were to turn out vastly different performance results, that would violate the purpose of WASM
Nope.

Compiler has an even bigger range of meanings than assembler. But WASM is typically the target result of compilation, not its source.
That's because people need a way to execute code in web pages. You could write Javascript code, but it's not the nicest nor fastest programming language. Tools exist for letting you write in a different language and transforming it to Javascript, but the resulting output is big and slow. So, what the industry wanted was some lower-level target that was more dense and easier to convert into fast machine code. We already had this, in Java Bytecode, but Java has fallen out of favor.

These OP codes are at the level of a typical machine op code, so that in most cases they can be translated near 1:1 to a target ISA.
Did you ever look at WASM code? Have you ever written traditional assembly? The WASM virtual machine is a stack machine. The last example I saw of that is 80x87. You cannot translate WASM 1:1 into target machine code, because almost no target machines are stack based. Real machines use registers, which a WASM compiler must allocate and generate spills to the real hardware stack when the WASM virtual machine has more state than it can map to real ISA hardware registers.

Evidently they each do their own variant, which might have a measurable impact on the speed of the compiler, not the resulting code.
Look at the benchmarks. First, the one I linked. Then, the results I got for Firefox and Chromium.

So if one variant already supports multi-threading or SIMD, while another doesn't, that should have measurable impact.
The benchmark you had me run is multithreaded. Their results database clearly indicates that many implementations support it, and that suggests your information on this subject is bad.

Where there could also be significant differences betweeen implementations is the I/O model, where WASM interacts with the outside, receives and sends data etc.: there the capability based security model may see vastly different implementations in terms of security and performance.
The benchmark you had me run is a simple compute benchmark. I didn't check the source code, but there's no way it should be making any API calls, other than occasionally copying a tile of results to the canvas.

And 2x seems to work for quite a few more people, too, an easy sell. When I propose a 2-5% performance increase, my bosses rarely care,
That's because you're not in a race. If you built any kind of product that had easily measurable metrics which could be directly compared to that of competitors' and represented the core value of the product, that story would probably change.

Firefox went Rust, because globally those inner loops in a browser are the most executed code on the planet. A 1% difference could light up a small country.
Their motivation for adopting Rust was security and reliability.

...while these days not even x86 CPUs execute their machine code directly, but do a similar WASM like transformation on the fly.
No, it's not like WASM. CPUs do not generate spills. They always have enough physical registers to support the ISA registers, so the only thing they have to do is wait for a physical register to become available, if they're all in use.
 
Last edited:
And actually on Chromium, using much higher thread numbers than there are cores, closes the gap to Firefox a bit, so indeed in terms of multi-thread (SPECTRE?) safety measures, there seem to be differences between the two browser engines.
Did you ever look at your CPU utilizization, when doing this? Maybe the browser limits how many threads are available to a WASM sandbox, so it can't hose your entire machine. Just a guess.

Also, if you have other processes running, then increasing the number of threads beyond what's physically available can usually crowd out other processes, although that depends on the thread scheduler. There's basically no other reason I can see why increasing threads beyond the number of physical hardware threads can improve performance on a compute-bound workload. If this were I/O bound, that would be a different story.

When I ran it, my machine was < 1%. Two back-to-back runs gave me exactly the same score, with Firefox. I didn't try this with Chromium.

the numbers in this benchmark may give you a good-enough hint into the relative performance of the CPU underneath and even across ISAs (well SIMD could make that tricky), where other benchmarks are harder to run or impossible to obtain.
As I said, it's doing floating-point computations, which makes these E-cores look much worse than an integer benchmark. I'd expect you to know that your benchmark should align well with your intended workload. So, are you using these machines for lots of floating point computations, then?
 
Did you ever look at your CPU utilizization, when doing this? Maybe the browser limits how many threads are available to a WASM sandbox, so it can't hose your entire machine. Just a guess.
When I bench, I observe pretty near everything, full set of HWinfo sensors with the highlights as graphs on the 2nd screen for Windows, perhaps also the Windows task manager, or its resource explorer, btop, s-tui or similar on Linux.

With less than the available threads, CPU wasn't max'ed out in utilization, which is no surprise, while of course it might still exhaust its thermal budget with fewer threads.

I've got the anti-fingerpriting settings set to the max on Brave (Chrome based) and I guess an artifically reduced number of cores may be one effect, especially with odd numbers like 5 or7.
Also, if you have other processes running, then increasing the number of threads beyond what's physically available can usually crowd out other processes, although that depends on the thread scheduler. There's basically no other reason I can see why increasing threads beyond the number of physical hardware threads can improve performance on a compute-bound workload. If this were I/O bound, that would be a different story.
Same understanding here, it doesn't make sense.

But sometimes obervations just don't match your expectations.

And if you can't find any obvious bugs in your methodology, you can evolve your preconception and test for new ones, or just go and have a beer.

I tend to dig down, at least a bit, because it's part of my job. But only as far, as is useful for me (I might be challenged by people like you).

For me it looks like a bit of a performance bug in WASM thread synchronization on Chromiums, but the fact that extra threads bring the overall result difference to the same relation as the single core results (~5%), means that fundamentally Firefox and Chromium results remain comparable with that small fix, no personal need to dig in.

And that was, after all, my main main purpose, being able to compare relative machine computing power across different ISAs, operating systems and browsers, without worrying too much about comparing apples with potatos.

Compiled benchmarks just aren't that useful any more, I've tried to use crypto mining code in the past as some of the best optimized code there is.

But a lot of that stuff turns out to be really hand optimized inline x86 assembly code, even if they also provide normal C alternate. But at that point it becomes pointless for comparison, because they high level code from the compiler evidently can't go near the hand crafted routines even on x86 itself. The financial gains from bespoke code first got you hand crafted optimizations, later even more expensive ASICs.

Good comparison benchmarks become more and more difficult to find, when the trend to ever more specialized ISA extensions becomes not only mainstream but a key differentiator, and vendor provided hand optimized libraries the norm.
When I ran it, my machine was < 1%. Two back-to-back runs gave me exactly the same score, with Firefox. I didn't try this with Chromium.
Apart from heat-soak, that benchmarks repeatability makes it useful.

And I'd dare say that when your results in Chromium were half of what Firefox had, CPU load might have shown some idle, even if TDP budget might have been exhausted on such a small machine.

That was one of the challenges I tried to explore with the mixed P/E Alder Lake mobile system, seeing which mix of P and E cores would yield the required/predicted compute throughput at the different TDP settings to come up with a setting that offered the minimal consumed energy for that workload.
As I said, it's doing floating-point computations, which makes these E-cores look much worse than an integer benchmark. I'd expect you to know that your benchmark should align well with your intended workload. So, are you using these machines for lots of floating point computations, then?
In my mind you're asking a question I don't care for, because I am not interested in E-core or P-cores per se.

I am interested in relative performance and in being able to adjust my infrastructure's energy consumption to deal with a given workload at a low price, both purchase price and energy consumption.

Differentiating between BIG and little cores in that context is already a complication I don't really care about. I see that as a technical debt incurred by chip vendors who couldn't design CPU cores that offered linear energy-performance scaling, it should be their problem to solve (e.g via the OS scheduler), not mine.

Differentiating not only between BIG and little cores, but different types of big and little, depending on the instruction mix of the workload, is even worse. It's an even heavier technical debt incurred by Intel, who had to resort to pre-existing Atom cores to fill a gap underneath P-cores to hungry to operate in a low power environment.

Intel seems to agree it's a crutch and they work hard on making that "FP weakness" of the new E-cores go away, e.g. on the "E-only" Xeons and I guess also on other "new E" cores. FP costs lots of transistors, but shrinks well, which makes this much less of an issue than with the first Atoms, where every transistor counted.

That doesn't stop the opposing trend of ISA extension differentiation, which is also going on, especially with the hyperscalers. Intel has also been putting crypto and network IP blocks on their weaker CPU cores, to support specific front-end use cases. We have a cambrian explosion of new CPU species, brought about by hyperscaling, the differentiation and economy of scale that provides.

General purpose computing is becoming niche and in the cloud we may end up with one infrastructure per workload.

But for my lab use cases, general purpose is what I need most, so I mostly don't want having to differentiate. But if the incentives are big enough (far beyond single digits), I use an accelerator, because there we are again talking about ratios even I find interesting enough to work on.

And AMD gets to be a big favorite of mine for not molesting me with all that extra complexity, especially since it doesn't come at the price of worse performance, efficiency, or economy. Even with the C and non-C cores AMD has introduced to make better use of chip area in energy constrained environment, they don't have me worried about the instruction mix of a given workload. Instead the more tightly packed C-cores just won't ever clock higher than their TDP shares as #5-12 would allow them anyway.
 
Compiled benchmarks just aren't that useful any more, I've tried to use crypto mining code in the past as some of the best optimized code there is.

But a lot of that stuff turns out to be really hand optimized inline x86 assembly code, even if they also provide normal C alternate. But at that point it becomes pointless for comparison, because they high level code from the compiler evidently can't go near the hand crafted routines even on x86 itself.
SPEC2017 is interesting for this. I guess I had assumed they avoid any inline assembly, but I know x264 is included and I know that has ASM for x86 and some other ISA targets. I believe they also froze the software versions, so no one adding new assembly would invalidate old results. However, now I wonder if the build scripts they use might explicitly disable hand-written ASM, since SPECbench is often seen almost as much a compiler benchmark as a hardware benchmark.

In any case, there are plenty of common workloads, like those included in SPEC2017, which do not include hand-written ASM. I'd regard GCC as one such instance, which does happen to be included in SPEC2017.

One could use Phoronix Test Suite to do some of this testing. 7zip and Linux kernel compilation are probably two good ways to characterize scalar integer performance.

Apart from heat-soak, that benchmarks repeatability makes it useful.
Eh, well it's somewhat long-running and that makes it less susceptible to the initial conditions of the turbo state machine. I also made sure my machine was basically idle and I took care to let it cool down between runs. So, in my case, it would've been weird if it didn't return almost the same result both times. Also, my machine stayed just shy of throttling territory.

And I'd dare say that when your results in Chromium were half of what Firefox had, CPU load might have shown some idle, even if TDP budget might have been exhausted on such a small machine.
Is that a question? Because it's a testable hypothesis and I just tested it. So, the specific chromium process went from using 0% of CPU time to 98.5% of the machines compute capacity, for the duration of the benchmark. Furthermore, the machine was running at 99.5% user time and 0.5% system (kernel). So, basically all cores were busy on the benchmark, for its entire duration.

It got 12.45 this time (12.51 last time) - not sure why the difference, but 0.5% variability doesn't particularly concern me, especially when I'm running a desktop environment and not going particularly far out of my way to control background services. I guess one difference between this time and last is that I had a window open and running top, so I could monitor CPU utilization. However, I doubt that could've used 0.5% all on its own.

In my mind you're asking a question I don't care for, because I am not interested in E-core or P-cores per se.
Well, we started talking about this in the context of Gracemont vs. Zen 3. In such a context, I was regarding Zen 3 as the P-core.

I am interested in relative performance and in being able to adjust my infrastructure's energy consumption to deal with a given workload at a low price, both purchase price and energy consumption.
Yeah, and unless your infrastructure is focused primarily on floating-point, I'd say this is not a good benchmark, as it doesn't well-characterize non-floating point.

General purpose computing is becoming niche and in the cloud we may end up with one infrastructure per workload.
Well, I do like that AMD at least gives us the option of buying the same (non-C) compute chiplets for desktop that they put in their server CPUs. Intel makes you buy a Xeon W, if you want their server cores in a desktop machine.
 
Anyhow with that I got (new Mini-ITX at the bottom):

Geekbench v6 singleGeenbench v6 multiWatts at outlet screen offWatts at outlet screen onHWinfo CPU idle screen onWatts with WASM at outletWatts WASM hwinfo CPUWASM score
Ryzen 5800U laptop187962973.5-57.5-92.625-2115-1252.95
Ryzen 7435HS laptop201798685forgotsorry261353.57
Pentium N6005 NUC56516617.510-114.529-2617-1518.26
Ryzen 5825U Mini-ITX200684639.5102110-9044-3795.53
Ryzen 8845HS Mini-ITX26611371815203.6105-9165-54118
I've since done a first round of testing with the both the €191.59 8-Bay NAS AMD Ryzen 7 5825U board with Fan and its slightly more modern AMD 9-bay NAS Ryzen 8845HS with dual SFF8643 and fan for €371.7 (last two lines).

These were tested with a BeQuiet 400Watt ATX PSU, which may not be the best to get the lowest power consumption on the wall: if you want to go really low there, you'll need something else. But if you actually use these as NAS boards, you'll also need something able to power the drives.

Both boards allow a broad range of control over the TDP settings, their BIOS is completely open and not all settings may actually work or be safe to use. The 5825U comes with a 15 Watt default, but I settled on the 25 Watts setting for the 5825U, because the very small copper fan that came with the unit managed to keep the heat down with even the most aggressive benchmarks and idle consumption would remain the same.

The 8845 comes with 45 Watt default, which I left on. Both will boost much higher for a second or two, then do a 3 minute turbo after than, and then settle on those "TDP" settings for sustained load after that. It's the typical "P-core" turbo sprint behavior introduced by Intel laptops, that's mostly designed for fast interactive responses, but can't be sustained for lack of battery power or cooling. It might look like cheating when you pay desktop prices for it, but when you get it at the Atom budgets, I can only see it as a benefit. And it can still be turned off via TDP settings in the BIOS.

So it's in fact 44-37-25 Watts on the 5825U and 65-54-45 on the 8845HS as measured by HWinfo for the APU, power consumption at the wall outlet is correspondingly higher. Both boards ran with 64GB of RAM, a 2GB Samsung 970 Evo+ NVMe and one on-board 2.5 Gbit Ethernet connected, display active at 4k HDR 120/144 Hz (except for the screen-off column).

The 5825U officially supports ECC, so I went ahead and ordered a set of DDR4-3200 ECC sticks at €300 (still en route), which seems a bit out of line for the base price, but if you want to trust your families digital treasures to a NAS, I tend to go with RAID6 or the matching ZFS variant and ECC to match. Not sure I'll keep them, unless I'm fairly certain ECC works.

It's sold by Topton which I've had issues with when it came to DRAM compatibility. This board initially also didn't want to post until I went down to a single stick of DDR4-2400. Turns out it just requires a reset after an initial post attempt with a changed RAM configuration but then it accepted the 64GB of DDR4-3200 (no ECC) I still managed to grab for €100.

The typical BIOS options for ECC support are there with background scrubbing etc., wether they actually work I'll try to find out as best as I can. The BIOS reports dating from 2012 or 10 years older than the Barcelo, which isn't encouraging. And of course there are no updates, not sure it's even legal.

The 8845 doesn't have an ECC option, AMD is definitely moving ECC into luxury territory, which is what I used to 'love' Intel for. It worked flawelessly and at top DDR5-5600 speeds with the Kingston sticks I gave it. There are tons of settings in the BIOS, but these mobile APUs really aren't designed for overclocking by AMD. The main dial is the ability to set the base TDP setting and that gives me all the options I want from this class of devices. You can play with the individual settings but there is a prepared eco-setting for 35 Watts instead of the 45 Watts default, that might be well tested and optimized.

In theory the iGPU should be able to freely share RAM with the OS, in practice many things work better with a hard allocation for the iGPU. With 64GB to go around I've chosen "game optimized" which translates to 4GB of UMA reserved for my tests, which included a bit of gaming on Windows.

The 5825U came with a small copper fan included, which is really just fine for the power envelope and runs very quiet.

The 8845HS comes with the usual copper shim to fit it to LGA1700 type coolers. This one comes with a tiny backplate for the shim that allows for pretty near anything cooler makers come up with these days. I've just used a small Noctua NH-9l, instead of the fan which came in the package, because I had it lying around. It's very quiet and cools it perfectly. Duronaut paste all around, mostly because all my Noctua paste leftovers have been exhausted.

Windows gaming on both is impressive enough even with the VEGA, the 780m RDNA3 even has some muscle at 1080p: I had to try, but that's not how I'll use them. Still, with a modern mid-range dGPU these certainly would work pretty well, since CPU performance really isn't that much of an issue, unless you're fixated on top frame rates, instead of just ~60-100Hz @1440p. I try to keep potential use cases for my hardware open...

My favorite data visualization benchmark, Google Maps in 3D at 4k, worked buttery smooth on both, VEGA topped out at 120Hz on the HDMI port of my monitor, RDNA3 went up to 144Hz, both with HDR and 10-bit per pixel and with the full RGB range.

Both are NAS designs without WIFI, but on-board upright USB2 ports (for boot sticks) and the ability to (auto re-)start on power-on. PCIe lanes are limited and mostly used by the onboard devices, 8/9 SATA ports via ASmedia and 2/4 2.5Gbit Intel NICs cost lanes. The 5825U is strictly PCIe v3 and has 8 lanes left in an open-ended x8 slot with a single M.2 slot for NVMe or 10Gbit Ethernet. Another M.2 slot only offers a single PCIe v3 lane, could fit more SATA ports, Wifi or a leftover M.2 drive that's better there than in a drawer.

The 8845HS is PCIe v4, also 8 electrical lanes on an x16 connector, which has its two M.2 slots physically split into x2: you can't get x4 performance out of your NVMe drive(s). It's a NAS move and could be unexpected. But it matches that a lot of entry level NVMe drives offer v4 bus speeds but only v3(x4) performance.

The USB4 port on the 8845 was able to drive my Aquantia Thunderbolt 10Gbase-T network adapter at full 950MByte/s, even with an OWC TB4 dock in between, which was unexpected and a relief, because it means that the x8 PCIe port remains free for something awsome, without compromising on connectivity.

Now if only it could also did real ECC...
 
Last edited:
SPEC2017 is interesting for this. I guess I had assumed they avoid any inline assembly, but I know x264 is included and I know that has ASM for x86 and some other ISA targets. I believe they also froze the software versions, so no one adding new assembly would invalidate old results. However, now I wonder if the build scripts they use might explicitly disable hand-written ASM, since SPECbench is often seen almost as much a compiler benchmark as a hardware benchmark.

In any case, there are plenty of common workloads, like those included in SPEC2017, which do not include hand-written ASM. I'd regard GCC as one such instance, which does happen to be included in SPEC2017.

One could use Phoronix Test Suite to do some of this testing. 7zip and Linux kernel compilation are probably two good ways to characterize scalar integer performance.


Eh, well it's somewhat long-running and that makes it less susceptible to the initial conditions of the turbo state machine. I also made sure my machine was basically idle and I took care to let it cool down between runs. So, in my case, it would've been weird if it didn't return almost the same result both times. Also, my machine stayed just shy of throttling territory.
My main interest at this point isn't comparing ISAs or the finer points of their implementations as much as comparing what type of computing performance you can get for a given Wattage and price.

Atoms and ARM SBCs gave me what I needed for functional testing and they were nice on the ear and the power meter.

And I have my big machines for the heavy loads, but I can't run those 24x7 and survive the heat they generate.

So I am curious about these new in-between machines, which combine a cost just slightly above the Atom range, yet deliver near desktop performance, certainly at peak. They seem to offer the best of both worlds and you can tune or configure them along a line that starts just above the Atom at 15 Watts or even less yet extends to 90 Watts peak and 55 Watts sustained, yet at an idle that's "unnoticable" in terms of noise and heat.

And those two boards (see above) represent interesting intermediate points with the 5825U at less than €200 for the board is certainly a great drop-in replacement for my N5005 Atoms, while it has vastly superior capabilities, potentially even including full ECC support, if it were to be recycled as a NAS later.

The 8845 isn't that much cheaper than the Minisforum BD790i which offers 16 full Zen 4 cores at €450 and 24 lanes of PCIe v5, but unfortunately also without ECC support, which plainly sucks.

All three represent an incredible amount of functionality within in a price range that doesn't seem to leave much space for a price/performance competitor.
 
Both boards allow a broad range of control over the TDP settings, their BIOS is completely open and not all settings may actually work or be safe to use. The 5825U comes with a 15 Watt default, but I settled on the 25 Watts setting for the 5825U, because the very small copper fan that came with the unit managed to keep the heat down with even the most aggressive benchmarks and idle consumption would remain the same.
I wonder if the easiest way to get a low-power but still capable home server is grabbing a NUC, MinisForum system, or similar, and either casemod or transplant the guts to a larger case – but definitely something with a bigger heatsink than those tiny form-factor systems come with.
 
I wonder if the easiest way to get a low-power but still capable home server is grabbing a NUC, MinisForum system, or similar, and either casemod or transplant the guts to a larger case – but definitely something with a bigger heatsink than those tiny form-factor systems come with.
Nothing is really easy in this space, I'm afraid, once you want to modify what vendors deem product worthy.

Mini-ITX Atoms had a short moment in history and then Intel decided to do NUCs.

Those were great in that they were first to allow powerful mobile CPUs in a "not laptop" via this new form factor of a NUC.

But while they were cute and small, they didn't offer the expandability of the slightly bigger Mini-ITX: it was either perfectly suited to your task or you where out of luck.

No matter, marketing and design over function won a few rounds and Mini-ITX didn't get those speedy mobile SoCs it could have really thrived on.

NUCs had Intel salivating, because they were selling a laptop without many of the expensive bits at the same price...

...only to have many of them wind up in warehouses, because few were as enthusiastic about that deal as Intel's marketing.

And that started NUCs, especially gamer variants, selling at far below original MSRP. Some really excellent deals could be made there from Intel bleeding: I loved those deals as much as Intel probably hated them, ...what was left of it.

NUC mainboards still suffer from a form factor that's really cute, but who cares when growing flat screens hide much bigger attachments?

There was a time when various companies offered passive chassis variants for NUCs (Akasa?), but those did cost enormous amounts of money and couldn't really solve the problem that at 15 Watts or more passive cooling runs into a wall put up by physics, while many users would have been happy with "unnoticealbe" over "passive/silent".

I'd say that Mini-ITX still offers a reasonable compromise between flexibility/expandability and the ability to glue it to the backside of your screen.

In China and on Aliexpress they vote with their feet and offer Mini-ITX, NUCs with eGPU options, and various other form factors, including truly tiny ones. Everywhere else they seem far less nimble or preferring to go down ideological debates.
 
Last edited:
The 5825U officially supports ECC, so I went ahead and ordered a set of DDR4-3200 ECC sticks at €300 (still en route), which seems a bit out of line for the base price, but if you want to trust your families digital treasures to a NAS, I tend to go with RAID6 or the matching ZFS variant and ECC to match. Not sure I'll keep them, unless I'm fairly certain ECC works.
I haven't delved into this, but you can find some info on how to check that the system is properly configured and the correct driver is loaded for ECC error reporting. I think that all depends on having enabled it in the BIOS configuration menus.

The 8845 doesn't have an ECC option, AMD is definitely moving ECC into luxury territory, which is what I used to 'love' Intel for.
Most socketed Intel i3 CPUs support it, I guess because their Xeon E lineup doesn't usually overlap with it. So, I had a fileserver with a Haswell i3, which I put in an ASRock Rack mini-ITX board. On the downside, that CPU ran a bit hot. I'm pretty sure my current Ryzen 5800X server idles at the same or lower power.

Duronaut paste all around, mostly because all my Noctua paste leftovers have been exhausted.
Yesterday, I just used Thermal Grizzly's KryoSheet graphene-based product, for the first time. It's working pretty well. Nearly as good as the Arctic MX-6 I had previously used with the same CPU and Noctua cooler. The main benefit of KryoSheet is that it's supposed to be maintenance-free and lasts forever. I actually put it in that 5800X server I mentioned. I'll probably post up a thread with some head-to-head measurements I took vs. MX-6.

KryoSheet installation requires care, but it was overall probably about the easiest application on a CPU that I've ever done. They now include some silicone oil to help hold it in place, which worked brilliantly.

My favorite data visualization benchmark, Google Maps in 3D at 4k, worked buttery smooth on both
I'm the opposite. My bulk/backup server has an AST2500 BMC and uses software rendering for all graphics. There exists a Mesa driver for the BMC, but Ubuntu doesn't include it, for some reason. If they did, I know it'd be dog slow. But, since the machine is mainly headless, I really don't care either way. If I can find my old HD 5450 card, which is passively-cooled, I might throw it in there just to get dual-monitor support if/when I ever need to login to it, but it's totally usable as is.

The USB4 port on the 8845 was able to drive my Aquantia Thunderbolt 10Gbase-T network adapter at full 950MByte/s
How did you measure that? I just used iperf3 and could not coax more than 9.2 Gbps out of my external loopback setup. Not that I really care, but it does make me curious.
 
  • Like
Reactions: thestryker
I wonder if the easiest way to get a low-power but still capable home server is grabbing a NUC, MinisForum system, or similar, and either casemod or transplant the guts to a larger case – but definitely something with a bigger heatsink than those tiny form-factor systems come with.
How many ports do you need and what LAN speed do you want? If you can deal with 4x SATA ports, 2.5 Gbps ethernet, and one PCIe 3.0 x4 NVMe drive, HardKernel has a N305 board you can use (passively-cooled, even) with a mini-ITX conversion kit.


It supports in-band ECC, as I mentioned. This will replace my mini-ITX N97 as a microserver. I actually have it, but didn't yet get around to transitioning from my N97 mini-ITX board:


I wouldn't recommend that board, mostly due to its inadequate heatsink. Otherwise, it's fine but idle power is a couple Watts higher than I'd wish. I was able to get a BIOS version supporting in-band ECC by inquiring the distributor, (mitxpc.com) who got the image from the manufacturer. MITX was actually great to deal with, BTW. Given that the board is also more expensive than the ODROID H4 Ultra, has half the cores, and has only a x2 M.2 port, pretty much the only mark in its favor would be the second Ethernet port and all the built-in serial ports (none of which I need) and its pedigree as an "industrial" board (not sure if that counts for anything).

No matter, marketing and design over function won a few rounds and Mini-ITX didn't get those speedy mobile SoCs it could have really thrived on.
LattePanda Mu also has a mini-ITX base board option. So, mini-ITX isn't as dead as you think for these "Atoms". It's also one of the OEMs which supports in-band ECC on them.

There was a time when various companies offered passive chassis variants for NUCs (Akasa?), but those did cost enormous amounts of money and couldn't really solve the problem that at 15 Watts or more passive cooling runs into a wall put up by physics, while many users would have been happy with "unnoticealbe" over "passive/silent".
HardKernel's ODROID-H4 can support passive operation. However, if you put it in "Unlimited Mode", you'll have to add a fan for it to avoid throttling.

I should add that they're talking about their smaller form-factor cases, not even a regular mini-ITX setup. And I believe that the default mode still adheres to the stock TDP settings. The H4 has a really big heatsink!


I think your sample size is too small. You seem to assume that everything in existence will show up on AliExpress, but they do not, in fact, have everything .
 
Last edited:
  • Like
Reactions: thestryker
Yesterday, I just used Thermal Grizzly's KryoSheet graphene-based product, for the first time. It's working pretty well. Nearly as good as the Arctic MX-6 I had previously used with the same CPU and Noctua cooler. The main benefit of KryoSheet is that it's supposed to be maintenance-free and lasts forever. I actually put it in that 5800X server I mentioned. I'll probably post up a thread with some head-to-head measurements I took vs. MX-6.

KryoSheet installation requires care, but it was overall probably about the easiest application on a CPU that I've ever done. They now include some silicone oil to help hold it in place, which worked brilliantly.
I've dabbled with it, the main issue was that my setup didn't generate enough pressure, because the base plate and the CPU cooler weren't compatible. When I inspeced it, it tore up very easily so in the end it was another €25 spent on experimentation.

In the mean-time I managed to find a cooler for that 1700 format which actually screws from the top and allows to use the custom baseplate which came with that mobile-on-desktop shim.

Not quite as good as the interim with liquid metal, but likely to last much longer. I still have the phase change material to test, but quite honestly I'm back to paste, because it just works, is easiest to use and I don't what to keep track of which material I used on which system.

And on most systems I'm far from pushing top heat. Just got a big tube of Duronaut for peace of mind and it's emptying faster than I would have thought.
How did you measure that? I just used iperf3 and could not coax more than 9.2 Gbps out of my external loopback setup. Not that I really care, but it does make me curious.
iperf3 -p <myport> -c <my-server> -P 4

Nearly all of my systems run Aquantia NICs and most are first generation AQC107 or their Sabrent TB3 equivalent, which are actually just reported as ACQ107 on Linux. 9.5 is the typical result, same with a lone Intel I also run, but it's only got Aquantias to talk to.

The -P4 helps getting better results via parallel streams and you could call that cheating or a better match for server loads. In my case I just wanted to make absolutely sure that it's actually not bottlenecking anywhere.

It's unclear just how good the USB part of the APU is connected internally and how it would react with the dock and an extra 4k display over it etc. HWinfo 'sees' an x1 link somewhere inside the APU, which wouldn't be enough at PCIe v4, but is probably just wrong reporting.

Thunderbolt can be tricky even on Intel, and it's not packet switching but using bandwidth allocations, where plugging and unplugging changes things. I've had plenty of issues when I actually tried to use both TB ports on Intels which have them.

My Sabrent TB3 adapters are invariably seen as AQC107 on my Intel systems on Linux and work as PCIe v3 x2 device there, just good enough for 10Gbit Ethernet.

It's only with the ACQ113 series that Marvell mentions USB4 support so I really didn't think it would work, especially with the TB4 dock in between...

But there you go, sometimes you do get lucky, but only if you already own a TB 10GBase-T adapter. They currently run €200, near twice what I paid for them, while normal AQC107 were more like €60-80 when I bought them.

I've been wishing for affordable 10GBase-T Ethernet for at least a decade, yet somehow vendors managed to always make it way too expensive.

Now 5Gbit USB3.2 Ethernet NICs are coming down to €25 from RealTek and there should really be a €50 option for 10Gbase-T, before that's just too slow to consider at 10x that for ordinary NVMe storage.

I can't quite see myself going to a 25/40/100 Gbit network, but with a free x8 slot, there is at least still some potential.
 
  • Like
Reactions: bit_user
iperf3 -p <myport> -c <my-server> -P 4
Thanks. --parallel|-P is the one thing I didn't really mess with. I tried -P 2, but it didn't seem to help and the output was a lot more cluttered. I was already thinking maybe I should go back and re-test with more parallelism, since it'd be nice to see just how close I can really get.

BTW, I have two 10GBase-T ports, which meant I could do a --bidir test. I'm consistently hitting 9.0 Gbps in each direction, simultaneously.

In exchange, I'll give you a tip: try disabling interrupt coalescing. You can use ethtool -c <interface> to see if any coalescing is currently enabled. If so, see the manpage on how to disable. I recommend doing this only for benchmarking purposes. I wouldn't leave it disabled, since doing so is too taxing on your CPU and has virtually no practical advantages, for most home users.

I've been wishing for affordable 10GBase-T Ethernet for at least a decade, yet somehow vendors managed to always make it way too expensive.
Yeah, the price of switches seems to be the biggest barrier, for a while.

Now 5Gbit USB3.2 Ethernet NICs are coming down to €25 from RealTek and there should really be a €50 option for 10Gbase-T, before that's just too slow to consider at 10x that for ordinary NVMe storage.
I'm running mostly 2.5Gbps, right now. Would love to upgrade to 5 Gbps, if/when switches get cheap.
 
I've been wishing for affordable 10GBase-T Ethernet for at least a decade, yet somehow vendors managed to always make it way too expensive.
This is why I've never gone away from SFP (at the time I first went 10Gb I just got a pair of "cheap" X520s and direct connected) as there are cheap switch options on the table if I want them (~$130 gets you 8x SFP 10Gb ports). I ended up getting a switch with 2x SFP 10Gb, 2x RJ45 2.5Gb and 8x RJ45 1Gb because at the time anything that had 2x SFP 10Gb with 8x 2.5Gb was over double the cost. Even though I'd rather have a better balance this still works for me as my server box is 10Gb, as is my primary, my router box is 2.5Gb and everything else is 1Gb or 100Mb.
 
This is why I've never gone away from SFP (at the time I first went 10Gb I just got a pair of "cheap" X520s and direct connected) as there are cheap switch options on the table if I want them (~$130 gets you 8x SFP 10Gb ports). I ended up getting a switch with 2x SFP 10Gb, 2x RJ45 2.5Gb and 8x RJ45 1Gb because at the time anything that had 2x SFP 10Gb with 8x 2.5Gb was over double the cost. Even though I'd rather have a better balance this still works for me as my server box is 10Gb, as is my primary, my router box is 2.5Gb and everything else is 1Gb or 100Mb.
"copper" SFP cables didn't exist (or cost their weight in diamonds) when I started on 10Gbit Ethernet, it was real fibre initially and 10GBase-T became an option only some time later. Initial NICs and ports were over 10 Watts per port just for the complex PHY modulations, so early NICs either had fans or required a server chassis with air flow to have survive.

In the corporate lab I used a lot of cross-connects from the typically dual NICs, because ordering a 48 port 10Gbase-T switch for some reason took ages, while the NICs came within weeks. Evidently somebody had decided that all faster than Gbit networking in the company had to be optical...

Only with NBase-T and Aquantia 10Gbase-T within the whole NBase-T really became affordable both in power consumption and cost, no idea where the cost cross-over for SFP would be today.

In the home-lab I didn't want to mix cable types and I got plenty of Cat-7 leftover cables from my corporate colleages in the network trenches. Once 10GBase-T switches dropped to below €50/port I just got me a pair of those and use them as my home-lab back-bone. Since 2.5GBase-T became cheap, I run a few of those in the house for everything else to connect to. Just got my last the day before I saw 5Gbit offered at similar prices...

I went back to cross connect using 100Gbit Mellanox NICs in a later coporate lab implementation, at the time Mellanox promised to support both Ethernet and Infiniband personalities without the need of a switch!

Once Nvidia bought them, switchless IB got dropped very quickly, which was a bit nasty, because CUDA at the time supported IB routing.

All long gone now, 100Gbit are "sneaker net" compared to what modern NV-link or similar fabrics do.

In the home-lab I'll probably draw the line at 10Gbit. It's really mostly VMs copying and migration and while Gbit networks will have me shout and yell, 10GBit is an invitation to another cup of something, gladly accepted.

It's mostly mental and terribly subjective.
 
  • Like
Reactions: thestryker
... so early NICs either had fans or required a server chassis with air flow to have survive.

...

Only with NBase-T and Aquantia 10Gbase-T within the whole NBase-T really became affordable both in power consumption and cost, no idea where the cost cross-over for SFP would be today.
Other than relatively high idle power consumption even the X520s (these were Intel's first 10GBase-T cards if I'm remembering right) never had an issue running. I'm sure that wouldn't have been the case if I used transceivers and RJ45 instead of SFP copper (or even one of Intel's RJ45 cards as those use way more power). I'm using X710s now and they're better though not down to Aquantia idle, but are closer under load. Biggest issue with SFP over copper is a more limited distance, but fortunately that's a non issue for me.
In the home-lab I'll probably draw the line at 10Gbit.
Same here though I dream of 25/100Gb just for something new. I originally got 10Gb because I just needed something faster than 1Gb and there were no 2.5Gb consumer products. That's how I ended up with two X520s and a SFP cable running between them and never looked back. I've never felt the need to update more systems, but I'm sure I'll feel differently the first time I need to copy a lot of data.
 
  • Like
Reactions: abufrejoval
Other than relatively high idle power consumption even the X520s (these were Intel's first 10GBase-T cards if I'm remembering right) never had an issue running. I'm sure that wouldn't have been the case if I used transceivers and RJ45 instead of SFP copper (or even one of Intel's RJ45 cards as those use way more power). I'm using X710s now and they're better though not down to Aquantia idle, but are closer under load. Biggest issue with SFP over copper is a more limited distance, but fortunately that's a non issue for me.
10Gbit wasn't just a speed shift but an attempt to redefine networking into unified fabrics and smart NICs.

So you had Ethernet and Fibre-Channel vendors push convergent adapters which veered off very classic Ethernet assumptions towards Token Ring arbitration and Fibre Channel reservations.

The first 10 Mbit Ethernet at West-Coast Xerox was CMSA/CD on that thick wire: everyone was free to go at any time and stations could just detect that something had gone wrong e.g. with two or more accessing the shared medium at the same time and then use a random back-off period. I basically hear the Beach Boys writing about this, a very laissez-faire attitude around communications hearking back to the origins of ARPANET. Or CB radio.

And that's just fine when there isn't a lot of traffic going on and that traffic isn't critical, time-dependent, can be made repeatable via TCP/IP etc.

In other words: not storage.

Fibre-channel and token ring were a) redundant with a ring, which always had an alternate if one direction failed and b) avoided collisions by having participants wait for an arbitration token. By contrast a very formal, suit and tie corporate approach, with the likes of IBM behind it.

In other words, degradation was less likely to happen and managed in a timely manner, which made storage use possible, where time-outs are considered hard failures and thus needed to be avoided by the fabric on top of those disks, who were bad enough on their own.

So initial 10Gbit ASICs and NICs came from storage companies like Emulex, but also from Ethernet companies like Broadcomm, who tried to unify networking and SANs and offered matching ASICs also for the switch side of the equation.

Add on top that the early days of virtualization where it was thought that virtual machines needed hardware support to overcome crippling virtualization overhead via hardware assists that gave every VM it's own virtual NIC and direct access to its data buffers and control interfaces.

That's something Intel specialized on, after their VMware shock, as they were neither in the SAN or the convergent camp, but very much wanted to push an Intel tax on virtualization.

That blood bath lasted several years, and it's Aquantia who broke it via an economical break through, that just went with the lowest common denominator or variants of Ethernet, that only took 1Gbit functionality and scaled that across 1/2.5/5 and 10 Gbit. Their other key ingredient was low power PHYs using advanced fab nodes and power negotiations on the wire to reduce the power cost from more than 10 Watts to less than 3 Watts per port--allowing for passsive components even without server class air-flow.

But that break-through got muddied very quickly by whoever survived the smart-NIC ASIC wars or was able to rule over motherboards, resulting in Aquantia flaming out a bit.

No idea if Netflix will ever take up this story, it just may be too boring for almost everyone, but I've been an unvoluntary participant for at least a decade.
 
Last edited:
  • Like
Reactions: thestryker
"copper" SFP cables didn't exist (or cost their weight in diamonds) when I started on 10Gbit Ethernet, it was real fibre initially and 10GBase-T became an option only some time later. Initial NICs and ports were over 10 Watts per port just for the complex PHY modulations, so early NICs either had fans or required a server chassis with air flow to have survive.
Well, I think DAC (Direct-Attach Copper) cables have been around for at least as long as SFP+, which is like 15 years.

10GBase-T is energy-intensive, for sure. I was doing that loop-back testing, I mentioned, as a way to stress-test the controller on my server board. It's an Intel X550-AT2, which is a decade-old chip made on 28 nm and spec'd to consume up to 12W for 2 ports. What makes it challenging to cool is that its operating temperature range only goes up to 55 C, for some reason. I wonder if that's related to electrical noise?

Since 2.5GBase-T became cheap, I run a few of those in the house for everything else to connect to.
Same. I recently downgraded switches to one with 4x 2.5G ports + 2x 10G (SFP+). A second switch is just gigabit. The switch I swapped out for it had 2 ports at 5 Gbps and one of the 10G ports was 10GBase-T, but it had a fan and used like 3x as much power.

All long gone now, 100Gbit are "sneaker net" compared to what modern NV-link or similar fabrics do.
Yeah, but that's also mostly for memory-to-memory data sharing, not served from storage. Unless you're doing supercomputing or large-scale AI, there's no way you can use that kind of bandwidth.

In the home-lab I'll probably draw the line at 10Gbit. It's really mostly VMs copying and migration and while Gbit networks will have me shout and yell, 10GBit is an invitation to another cup of something, gladly accepted.
I never really embraced VMs. I skipped right past them to containers. Those are more space-efficient, as well as being lighter-weight at runtime.

I've never felt the need to update more systems, but I'm sure I'll feel differently the first time I need to copy a lot of data.
Given that my bulk data is on hard disks, I have no real need of anything super-fast. It also helps that NFS has client-side caching.
 
Last edited:
But that break-through got muddied very quickly by whoever survived the smart-NIC ASIC wars or was able to rule over motherboards, resulting in Aquantia flaming out a bit.
Networking feels like it's been bashed really hard by consolidation. Aquantia ended up being bought by Marvell, Mellanox by nvidia (Intel failed with their bid here and had been underinvesting in network for years), Broadcom by Avago (who renamed to Broadcom to try to wash away the stink) and they tried to get Qualcomm, Bigfoot and Atheros were bought by Qualcomm.

Realtek has been driving lower cost on the consumer side, but we're just now seeing 5Gb from them. Intel technically has the Killer E5000, but it's using using the RTL8126 controller so they haven't even made their own 5Gb controller yet.
 
  • Like
Reactions: bit_user
Intel failed with their bid here and had been underinvesting in network for years
Are they now completely out of the sector?

That would be quite ironic, since Nvida and AMD have recently both gone into networking, big time. AMD with Pensando (and maybe a couple other acquisitions?).

Realtek has been driving lower cost on the consumer side, but we're just now seeing 5Gb from them. Intel technically has the Killer E5000, but it's using using the RTL8126 controller so they haven't even made their own 5Gb controller yet.
LOL, when Realtek is our savior!

BTW, my Intel X550 can do NBase-T. I'm currently running it in 2.5G mode. The HDDs behind it could support closer to 5Gbps, but the two main systems that are backing up to it only have 2.5 gigabit NICs, so running it faster is currently pretty pointless.

I wish 2.5 gig came along like 10 years ago. That's when we needed it. It was just starting to go mainstream, in 2019, and then the pandemic hit and threw everything into turmoil. If it had gone mainstream like 5-10 years ago, I'd bet 5 Gbps would be fully mainstream by now.