Question Live backup PC

Aug 29, 2024
10
0
10
I operate a business that mostly depends on design by simulation relying on constant operation at very high CPU utilization of big multi-core PC's. And because of the high utilization, I seem to kill them on a fairly routine basis, 3 Dell 7820 Xeon Gold's in the last two years, which is worse than average, I probably kill one on average every 2nd year. We can talk about what's dying separately, it doesn't matter here, the issue at hand is DOWN TIME.

We have a robust backup solution, first to NAS and then up to cloud, but the trouble is that it takes Dell a good 3 - 6 weeks to build and ship us a new PC each time I kill one, and that's a huge interruption in business. There are many times when even 2 hours is going to cost us a lot of money, and so I'm thinking I'd do well to at least have one backup PC ready to drop in place anytime I manage to kill the one I'm operating. This backup doesn't have to be an ideal permanent replacement, in fact I don't want the cost of always leapfrogging two of the latest and greatest, it just has to be good enough to keep basic operations going while the replacement gets built and shipped.

I can pick up reasonably good workstations off ebay for this purpose, ~$1500 for a Dell Precision 7820 with dual Xeon Gold 6000-series, sufficient RAM and an SSD. That'll do for this purpose. But how to set it up to constantly or hourly replicate my primary PC, with both running Win11? I'm thinking that's going to be a lot faster in a crunch, than pulling a backup down off my NAS from the primary PC to the emergency PC, at least that's my assumption.

I presently use only local login with all Windows PC's, I really dislike the constant nudging to store files on MS's cloud services, due both to speed (large simulation files = NVMe drives) and due to ease of backup, security, etc. Much of the work we do is export controlled, and some of it is military / ITAR.
 
I operate a business that mostly depends on design by simulation relying on constant operation at very high CPU utilization of big multi-core PC's. And because of the high utilization, I seem to kill them on a fairly routine basis, 3 Dell 7820 Xeon Gold's in the last two years, which is worse than average, I probably kill one on average every 2nd year. We can talk about what's dying separately, it doesn't matter here, the issue at hand is DOWN TIME.

We have a robust backup solution, first to NAS and then up to cloud, but the trouble is that it takes Dell a good 3 - 6 weeks to build and ship us a new PC each time I kill one, and that's a huge interruption in business. There are many times when even 2 hours is going to cost us a lot of money, and so I'm thinking I'd do well to at least have one backup PC ready to drop in place anytime I manage to kill the one I'm operating. This backup doesn't have to be an ideal permanent replacement, in fact I don't want the cost of always leapfrogging two of the latest and greatest, it just has to be good enough to keep basic operations going while the replacement gets built and shipped.

I can pick up reasonably good workstations off ebay for this purpose, ~$1500 for a Dell Precision 7820 with dual Xeon Gold 6000-series, sufficient RAM and an SSD. That'll do for this purpose. But how to set it up to constantly or hourly replicate my primary PC, with both running Win11? I'm thinking that's going to be a lot faster in a crunch, than pulling a backup down off my NAS from the primary PC to the emergency PC, at least that's my assumption.

I presently use only local login with all Windows PC's, I really dislike the constant nudging to store files on MS's cloud services, due both to speed (large simulation files = NVMe drives) and due to ease of backup, security, etc. Much of the work we do is export controlled, and some of it is military / ITAR.
Do your simulations use GPUs as well as CPU or is it strictly CPU simulation?
 
.....Much of the work we do is export controlled, and some of it is military / ITAR.
This severely restricts what you can and cannot do. Relying on random suggestions, some of which may originate in prohibited countries, may place your government contracts in jeapardy. Plus, it may well be prohibited to even reveal such information. You need to consult with your corporate security office before continuing down this road.
 
  • Like
Reactions: Grobe and punkncat
One pathway to consider proposing is having your work files on the server.
Then, it does not matter what system you access with. No having to sync things between the systems, except for the applications.
 
  • Like
Reactions: Grobe
Overall your workload is a better fit for a server. Running it as a VM would allow you to easily move it between 2 hosts AND if a host fails it could restart automatically on the 2nd host. You can also have the VM replicated to AWS instances that are ITAR compliant for added redundancy via Veeam.
 
This severely restricts what you can and cannot do. Relying on random suggestions, some of which may originate in prohibited countries, may place your government contracts in jeapardy. Plus, it may well be prohibited to even reveal such information. You need to consult with your corporate security office before continuing down this road.


My guess would be that simply by the asking they already understand the clearance issues. My thoughts in at least this situation are that this individual more than likely can't source a system for use outside qualified vendors anyway. It makes a neat thought exercise, but the reality is he probably needs to ask the rubber stamp to keep a capable spare system from "Dell" on hand for this in protection of at least that contract. Given the other details it must be fairly lucrative on its own if downtime has this level of priority in relation.
 
My first recommendation is that you should have a on-site spare. That way you don't have downtime and you have a test host for software and firmware updates. You never want to roll software or firmware to "operations" without a test platform. Your test platform becomes your temporary operations resource while you get a replacement.
I don't know if your application can run on a VM, but if so, that REALLY opens your options. You can have automatic restart of the application on a different physial host.
 
  • Like
Reactions: mocanet and Grobe
Be extremely careful with such blanket statements. There's a rather large number of things that can detect their environment and will not run if a VM environment is detected.
Those aren't usually going to be applications like the OP is running. More likely those are applications with trial licenses so you cannot reinstall the VM and get a new trial license.
 
Wow... great replies guys. I guess I shouldn't have waited 12 hours to check back! This is an active forum.

I'll try to answer some key points:

Do your simulations use GPUs as well as CPU or is it strictly CPU simulation?
Right now I'm running on CPU only, no hyperthreading and paralleling all cores. I actually used to run GPU's, and even parallel multiple machines, but that puts more limits on how the software can be used, as only certain solver processes can run on GPU or in parallel computing. Now that individual machines with multiple multi-core CPU's are getting fast enough, and also because the some of the new solvers are more efficient, there's less need for all of that. Right now I'm just running dual 12-core Xeon Gold CPU's in a single machine, and it does pretty well with most problems.

This severely restricts what you can and cannot do. Relying on random suggestions, some of which may originate in prohibited countries, may place your government contracts in jeapardy. Plus, it may well be prohibited to even reveal such information. You need to consult with your corporate security office before continuing down this road.
Not in this case. I've been doing this for more than 25 years, and know the rules well. I am not revealing any controlled data. Actually, very little of what I do is controlled by ITAR anymore, which is good, but it is controlled by BIS (old DoC).

One pathway to consider proposing is having your work files on the server.
Then, it does not matter what system you access with. No having to sync things between the systems, except for the applications.
I tried this at one point, many years ago, but ran into two problems:

1. Speed. Even my local PCI NVMe drives are slower than I would like, when saving or loading results.
2. Volume. I got a nasty call from the IT guys at my prior employer, when I essentially dragged down a whole section of their network, reading and writing massive files while running and saving simulations.

For this reason, I keep all simulations on local NVMe drives, and backup nightly to the NAS. I am not sure how this would impact or dictate the solution for a backup PC.

My guess would be that simply by the asking they already understand the clearance issues. My thoughts in at least this situation are that this individual more than likely can't source a system for use outside qualified vendors anyway. It makes a neat thought exercise, but the reality is he probably needs to ask the rubber stamp to keep a capable spare system from "Dell" on hand for this in protection of at least that contract. Given the other details it must be fairly lucrative on its own if downtime has this level of priority in relation.
My current venture is a self-funded startup. While another $10k new Dell system wouldn't kill me, I was really thinking more in terms of buying one identical to the one I bought just two years ago, as a <$2k refurb. Again, if I ever use the thing at all, it's only going to be for a few days or weeks, waiting on Dell to rebuild or replace my primary system.

Oh, one other thing I caught in reading the replies: I'm not producing software. I'm producing hardware, but each design requires roughly 3 weeks of simulation time to complete. That's where the computing power comes in, although at some point my ability to feed the software new inputs becomes the limiting factor on project speed. If I have to fall back on a somewhat-slower backup computer, and it takes 25% or even 50% longer to finish that phase of the project, that's still something from which I can recover. But I can't be running it on my laptop or the kids' spare computer... that'd take months!
 
Last edited:
1. Speed. Even my local PCI NVMe drives are slower than I would like, when saving or loading results.
2. Volume. I got a nasty call from the IT guys at my prior employer, when I essentially dragged down a whole section of their network, reading and writing massive files while running and saving simulations.
That mostly just needs a better network infrastructure.
 
Setup 100Gbps connection between a PC and a server/NAS

View: https://www.youtube.com/watch?v=-LytcXun4hU

You can use another 7820 as a backup server/nas

The Dell Precision 7820 has 2 gen3 PCIE x16 and gen3 1 PCIE x8 that you can plugin 100Gbps network cards, the 3rd 100Gbps performance however will be limited by it's x8 channels.
https://www.dell.com/support/manual...df22e3-9703-4e17-bf9b-4c6d02dddb24&lang=en-us
One gen3 PCIE channel ~ 1GB/s

All workstation PCs connected to the server directly, no switch required.

See if this ViceVersa Pro works for the backup/replication
https://www.tgrmn.com/web/features.htm
https://www.tgrmn.com/web/file_replication.htm
 
Last edited:
Wow... great replies guys. I guess I shouldn't have waited 12 hours to check back! This is an active forum.

I'll try to answer some key points:


Right now I'm running on CPU only, no hyperthreading and paralleling all cores. I actually used to run GPU's, and even parallel multiple machines, but that puts more limits on how the software can be used, as only certain solver processes can run on GPU or in parallel computing. Now that individual machines with multiple multi-core CPU's are getting fast enough, and also because the some of the new solvers are more efficient, there's less need for all of that. Right now I'm just running dual 12-core Xeon Gold CPU's in a single machine, and it does pretty well with most problems.


Not in this case. I've been doing this for more than 25 years, and know the rules well. I am not revealing any controlled data. Actually, very little of what I do is controlled by ITAR anymore, which is good, but it is controlled by BIS (old DoC).


I tried this at one point, many years ago, but ran into two problems:

1. Speed. Even my local PCI NVMe drives are slower than I would like, when saving or loading results.
2. Volume. I got a nasty call from the IT guys at my prior employer, when I essentially dragged down a whole section of their network, reading and writing massive files while running and saving simulations.

For this reason, I keep all simulations on local NVMe drives, and backup nightly to the NAS. I am not sure how this would impact or dictate the solution for a backup PC.


My current venture is a self-funded startup. While another $10k new Dell system wouldn't kill me, I was really thinking more in terms of buying one identical to the one I bought just two years ago, as a <$2k refurb. Again, if I ever use the thing at all, it's only going to be for a few days or weeks, waiting on Dell to rebuild or replace my primary system.

Oh, one other thing I caught in reading the replies: I'm not producing software. I'm producing hardware, but each design requires roughly 3 weeks of simulation time to complete. That's where the computing power comes in, although at some point my ability to feed the software new inputs becomes the limiting factor on project speed. If I have to fall back on a somewhat-slower backup computer, and it takes 25% or even 50% longer to finish that phase of the project, that's still something from which I can recover. But I can't be running it on my laptop or the kids' spare computer... that'd take months!
How parallel are these workloads? Like is this software that can take as many cores as possible or does it trail off at 24 cores. Does this software support instructions like AVX2 or AVX512? I ask these things as those Cascade Lake Xeons are old now and going with something like 4th Gen Epyc has a good 50% higher IPC so that could easily cut your processing time by a lot. 5th Gen Epyc comes out in a couple months and will offer even higher IPC and faster AVX512 performance. If going from 3 weeks processing to 2 weeks or less could be profitable.
 
Setup 100Gbps connection between a PC and a server/NAS

View: https://www.youtube.com/watch?v=-LytcXun4hU

You can use another 7820 as a backup server/nas

The Dell Precision 7820 has 2 gen3 PCIE x16 and gen3 1 PCIE x8 that you can plugin 100Gbps network cards, the 3rd 100Gbps performance however will be limited by it's x8 channels.
https://www.dell.com/support/manual...df22e3-9703-4e17-bf9b-4c6d02dddb24&lang=en-us
One gen3 PCIE channel ~ 1GB/s

All workstation PCs connected to the server directly, no switch required.

See if this ViceVersa Pro works for the backup/replication
https://www.tgrmn.com/web/features.htm
https://www.tgrmn.com/web/file_replication.htm
I think maybe I didn't do a good enough job at explaining my requirement, or maybe even thinking it though myself before posting. Bottom line, to me this option looks like a lot of work, with maybe little or no benefit to my application, but that's probably my fault for not really thinking through my own needs well enough.

The files are already backed up on my current NAS, so if it were just about accessing project files, I already have that in the bag. I wouldn't need to really restore the hundreds of GB that is presently on the workstation, but only the files that are in use at the time of failure, probably only a few GB maximum. The idea to have a second PC is really more about having access to a second big CPU unit, and having the profile, applications, etc. already there and ready to put into use.

Maybe even that is more than I need, as it honestly would only take me half a day to get a machine set up and running with the key applications I need, and could even be restoring from backup the key projects I'm working on that month while installing some of the applications. Not ideal, having it ready to rock before the failure would be nice, but then I have to deal with constantly updating software on two machines, which might be more work than it's even worth.

I could have the project files I'm using in a given week moved in a few minutes from NAS to secondary PC. This is more about replicating the entire profile and all applications, so that I need to only re-issue node-locked license files to enable the software on the new machine.


How parallel are these workloads? Like is this software that can take as many cores as possible or does it trail off at 24 cores.
The software I run has multiple different solvers, and each has slightly different behavior, but the primary solver I use most (time-domain electromagnetic FDTD) will peg all cores at 100% simultaneously and continuously. Most problems take only 3 to 30 minutes to solve, and then the software pauses to write results to the drive(s) before retrieving the next and pegging the processors again. But during optimizations or parameter sweeps, I'll run literally thousands of these 3 to 30 minute iterations, sometimes for several straight days.

I've tested parallelization on up to 96 cores, and that's the behavior, I don't know they have some parallelization limit above that... but my present machine is only 24 cores, so I'm not hitting it now. That particular process is also very memory-bound, solution time is very much governed by memory speed.

Some of the other solvers (frequency-domain or thermal) are more bound up by other system limitations, with hard disk speed being a big one for the thermal solver, mostly when interacting with the completed project and trying to load or post-process 3D field plots.

Does this software support instructions like AVX2 or AVX512? I ask these things as those Cascade Lake Xeons are old now and going with something like 4th Gen Epyc has a good 50% higher IPC so that could easily cut your processing time by a lot. 5th Gen Epyc comes out in a couple months and will offer even higher IPC and faster AVX512 performance. If going from 3 weeks processing to 2 weeks or less could be profitable.
Thanks for asking, and I guess this is something I should know, but I don't! I'm an electromagnetics guy, designing high power RF/microwave hardware. Ironically, I did my undergrad degree in computer engineering, but that was decades ago, shortly before the first Xeon processor (Pentium II Xeon 400) was even released. I don't even know what "AVX" means in your world, as in my world they're a capacitor company!

But I would be interested. The primary software I'm running is Dassault Systemes SIMULIA CST. It's been around for years, but previously under the CST name, they were only acquired by Dassault and moved under their SIMULIA product line a few years ago.

https://www.3ds.com/products/simuli...2MC4wLjA.*_gcl_au*OTE1MDEyMTk1LjE3MjgwMDk3MzM.
 
Oh, one other thing I caught in reading the replies: I'm not producing software. I'm producing hardware, but each design requires roughly 3 weeks of simulation time to complete. That's where the computing power comes in, although at some point my ability to feed the software new inputs becomes the limiting factor on project speed. If I have to fall back on a somewhat-slower backup computer, and it takes 25% or even 50% longer to finish that phase of the project, that's still something from which I can recover. But I can't be running it on my laptop or the kids' spare computer... that'd take months!
What prevents you from renting compute resources (AWS for example) to compensate if you have a local hardware failure to continue to have the same available compute resources ? Or even to "cloud burst" into AWS when you have an unexpected surge in requirements ?
 
I don't even know what "AVX" means in your world, as in my world they're a capacitor company!
AVX for computers stands for Advanced Vector Extensions. These are SIMD (single instruction multiple dataset) instructions that increase floating point power on a CPU. Looking at the software I think it supports the AVX512 instructions. There was a set of benchmarks comparing 3rd and 4th Gen Epycs and with the same core counts and using the high frequency CPUs has the 4th Gen about 50% faster than 3rd Gen. Some of that will be IPC plus higher base clocks. Other parts of it could be AVX instructions. Based on the uplift from 3rd > 4th Gen you could easily double your current performance just by getting current generation equipment. The 4th and soon 5th Gen Epycs also have support for 12 RAM channels, double your current per socket channels, so you get the added RAM bandwidth and capacity. Storage wise I know that the Dell hosts have a new version RAID card that does NVMe. If you did something like that and put 4x PCIe 4 drives into a RAID10 you would have A LOT of storage bandwidth. I know this wouldn't be cheap but time is money and cutting rendering time in half is nothing to sneeze at.
 
  • Like
Reactions: mocanet
What prevents you from renting compute resources (AWS for example) to compensate if you have a local hardware failure to continue to have the same available compute resources ? Or even to "cloud burst" into AWS when you have an unexpected surge in requirements ?
That's a good idea. I'll have to check the software licensing though, as there are a lot of restrictions on that. At my price point ($50k), each license is node-locked to a physical PC by MAC address. They also offer network/floating licenses, I did that once several years ago when I was operating an environment with several users, but the floating licenses run about $75k per seat. Since I'm now a single user and self-funded, a spare $2k refurb PC looks pretty attractive against that pricing.

If all you need is speed and volume for local nvme drives, then use bifurcation x4 card on x16 slot and create a raid. PCIe 3.0 x16 = 128Gbps = 16 GB/s

View: https://www.youtube.com/watch?v=5-OZYOsh6BU

ASUS Hyper M.2
https://www.amazon.com/ASUS-M-2-X16-V2-Threadripper/dp/B07NQBQB6Z
I don't need more or faster storage. I need a spare PC, for when the power supply or CPU or motherboard fails on the one I'm using.

AVX for computers stands for Advanced Vector Extensions. These are SIMD (single instruction multiple dataset) instructions that increase floating point power on a CPU. Looking at the software I think it supports the AVX512 instructions. There was a set of benchmarks comparing 3rd and 4th Gen Epycs and with the same core counts and using the high frequency CPUs has the 4th Gen about 50% faster than 3rd Gen. Some of that will be IPC plus higher base clocks. Other parts of it could be AVX instructions. Based on the uplift from 3rd > 4th Gen you could easily double your current performance just by getting current generation equipment. The 4th and soon 5th Gen Epycs also have support for 12 RAM channels, double your current per socket channels, so you get the added RAM bandwidth and capacity. Storage wise I know that the Dell hosts have a new version RAID card that does NVMe. If you did something like that and put 4x PCIe 4 drives into a RAID10 you would have A LOT of storage bandwidth. I know this wouldn't be cheap but time is money and cutting rendering time in half is nothing to sneeze at.
Wow. I'll definitely have to talk with you, next time I'm looking to replace my primary system. I usually just buy what the software manufacturer recommends, as they do a lot of benchmark testing specifically with their software. Right now, I'm running a dual Xeon Gold 6226R, which of course has a lower clock speed than some of the options available when I purchased (mid-2022), but actually benchmarked better than some of the systems running higher clock speeds.

But right now, I'm not looking to replace this system, just looking to set up a cheap spare to handle failures.
 
But right now, I'm not looking to replace this system, just looking to set up a cheap spare to handle failures.
That makes a lot of sense. For your way of not having to pull down backups from a NAS you might want to talk with a company like Veeam. I know they can replicate VMs and they offer physical device backups as well but they might be able to replicate from one workstation to another. They also offer, at least for VMs, continuous data protection so you would have very little downtime if this were to work. You would need to have some switches installed so the workstations can talk to each other and for that amount of replication you would want at least 10GbE.
 
  • Like
Reactions: mocanet
That's a good idea. I'll have to check the software licensing though, as there are a lot of restrictions on that. At my price point ($50k), each license is node-locked to a physical PC by MAC address. They also offer network/floating licenses, I did that once several years ago when I was operating an environment with several users, but the floating licenses run about $75k per seat. Since I'm now a single user and self-funded, a spare $2k refurb PC looks pretty attractive against that pricing.
Without the floating license, you have to transfer your node locked license to bring backup hardware on-line ? THAT sounds like fun ...
 
That makes a lot of sense. For your way of not having to pull down backups from a NAS you might want to talk with a company like Veeam. I know they can replicate VMs and they offer physical device backups as well but they might be able to replicate from one workstation to another. They also offer, at least for VMs, continuous data protection so you would have very little downtime if this were to work. You would need to have some switches installed so the workstations can talk to each other and for that amount of replication you would want at least 10GbE.
Thanks! I'll definitely check out Veeam. And I don't really need real-time replication. Something that replicates on a several minutes or hourly delay would be more than sufficient, and I could even live with overnight replication, which would lessen the Ethernet speed requirement.

Obviously, the shorter the window, the less I lose when a machine goes down. I don't mind so much losing the hour of work, and could even live with losing a day. But what sometimes happens is that mistakes creep in, when you've forgotten all the changes you made to the design in the prior day, and assume the backup you're operating from has fixes in it that were inside the time-loss window.

Without the floating license, you have to transfer your node locked license to bring backup hardware on-line ? THAT sounds like fun ...
Yeah, it's a PITA. Since this software is so expensive, they basically make you sign away your life and first-born, promising you're decommissioning the old machine before moving to the new, and won't try to run two licenses at once. They have systems to detect and block this, when you're internet-connected, but they also know there are ways around it for those who aren't so honest.

I have two software packages that have to be re-licensed each-time this happens, but they usually handle that within 1 day. When I do planned system replacement on 2- or 4-year increments, this is no big deal. But each of my last three Dell Precision workstations has died within weeks of installation, and that makes it a slightly bigger hassle.

Side note: but why does Dell take like 3 tries to build a system that doesn't suffer infant mortality, anymore? I've been running this type of software almost 25 years, but the number of brand-new Dell Precisions that have died almost straight out of the box went from zero before 2014, to almost 100% in the last 10 years. I can't remember the last time I bought a new system that didn't need either a motherboard or power supply replacement in the first month. After that, they seem to have a nice long and stable life, which is the only reason I haven't totally abandoned Dell, but they sure have some major hiccups in getting to that point.
 
Yeah, it's a PITA. Since this software is so expensive, they basically make you sign away your life and first-born, promising you're decommissioning the old machine before moving to the new, and won't try to run two licenses at once. They have systems to detect and block this, when you're internet-connected, but they also know there are ways around it for those who aren't so honest.
I worked on high performance computing systems for a defense contractor. I was responsible for purchasing a LOT of hardware and software. I think you need to look at your business plan and decide if having floating licenses can be justified by completing more work.
You should also consider if RAM disks could be beneficial to your workflow. TBs of RAM are now not uncommon. The 6000 series Xeons have 8 channel memory controllers. Maximum performance would require 16 DIMMs on a dual socket system. Sixteen 64GB DIMMs is a TB of RAM. That is something to think about.
Also, you should benchmark your workflows with hyperthreading enabled and disabled. You might find that your performance with hyperthreading disabled is higher.
A few insights from someone that has lived in an HPC world.