Computer hangs at any random time

Page 2 - Seeking answers? Join the Tom's Hardware community: where nearly two million members share solutions and discuss the latest tech.
If I were you I would also test under real load. Use Prime95 torture test, Custom mode. There select usage of your whole RAM minus 1.5 GB (for Windows and stuff) and let it test for several hours. If after several hours your computer is still running AND the Prime95 icon is still green, then I'd say yes, your rig is stable.
 
Hello again. I tried to run MPrime and it found 2 errors. That means that timings are incorrect. I was trying different combinations of frequency and timing and none of them ran flawlessly.
I tried these combinations:
1) 3700MHz + timings auto-auto-auto-10. - no overclock, everything works, errors are found in 1.5 hours.
2) 4200MHz + auto-auto-auto-10. - my usual overclock, computer restarts on power on. - discarded this variant.
3) 4200Mhz + auto-auto-auto-20. - starts fine, errors are found very quick.
4) 4200Mhz + 9-9-9-10 - doesn't start at all. Need to press MemOK button
5) 4200Mhz + 9-9-9-20 - starts, errors are found very quick.
6) 3700Mhz + 9-9-9-10 - starts, errors are found very quick
7) 3700MHz + 9-9-9-20 - starts, errors are found very quick
8) 3700MHz + 9-9-9-27 - starts, errors are found very quick.
9) 3700MHz + 10-10-10-10 - doesn't start
10) 3700MHz + 10-10-10-20 - errors are found very quick
11) 3700MHz + 10-10-10-24 - errors are found very quick
12) 3700MHz + 10-10-10-27 - errors are found very quick
13) 3700MHz + 9-9-9-24 - errors are found very quick (wow, you're kidding me)

So, I have no idea what to try next. Every variant I try gives errors. I don't know if anything still freezes, cause I was testing this until now. So, what do we have? Memtest with 3,7GHz and auto-auto-auto-10 (which I think was really 9-9-9-24 and I don't have any idea, why, because I did set this to 10), didn't give any errors, but MPrime found them in 1,5 hours.
Any other combination of freq and timings gives errors in 5 minutes or less. What's wrong with this machine?
I touched the DIMMs and they're not hot, so I don't think it's overheating, CPU also has a normal temperature of 70 C when set to 4,2GHz and 55 C when 3,7GHz.

And I forgot to tell that AISuite shows me that the manufacturer of DIMM2 is different from others. Don't know if this is important.
 

That is a false conclusion, and after the above explanations, you should know better. Timings would have been incorrect if slower timings had fixed the problem. Since they did not, this problem is not timing related.


Well, how about trying next what I suggested to you before: "Test one stick at a time to determine which of your sticks is faulty."

This problem is either a defective or incompatible module. If all these modules used to work nicely, one of them has become defective. These things do happen to memory modules sometimes.
 
Hmm, well, I plugged off the module which had a different manufacturer and MPrime was running fine for 2 hours. Anyway, it found an error, but 2 hours is better then 5 minutes :)
At this moment I have several variants of what to do - really try to test each module until I find the defective one, change them to 2 8GB dimms and see, how they'll behave or try to set timings again and continue tests with three of them. I think I'll try the first variant.

P.S oh, I've got a totally unexpected thing - linux showed me a kernel panic while shutdown after tests. It's becoming interesting.
 
Currently 1 dimm with timings manually set to 9-9-9-24 is being tested. And does anyone know what DRAM command mode is? It's set to 1 but when all the dimms were plugged, it was 2.
 
Probably same as Command rate

• Command Rate: Also called CPC (Command Per Clock). The amount of time in cycles when the chip select is executed and the commands can be issued. The lower (1T) the faster the performance, but 2T is used to maintain system stability.

Found in what I linked above.
 

IMHO it is not, because it means that your machine is still not running stable. To me it would matter little whether I get the error after 1 min or 1 day. I want my machine to function properly. To the contrary, getting the error after 1 min is better, because that makes it easier to conduct more tests and pinpoint the cause.



Good decision, seeing that it is the only professional approach. You got a problem, you need to track down the exact cause. Anything else would be blunder.


I would write that off as subsequent fault. Fix the Prime errors, then there will be no more kernel panics to worry about.


That can be explained. Timings are easier with less modules attached, because there are less conductive lines and less capacities that can disturb data transfer. Your BIOS may be aware of this and use a faster command mode with only one DIMM attached.

That being said, overall performance is still better with pairs of 2 identical DIMMs inserted, because then the BIOS can activate dual-channel access (access both modules simultaneously for faster data transfer). However, performance is out of scope for you right now. Before you can think about performance and optimizing stuff for speed, you need a stable platform to start from. Losing even 10% speed in return for stability would be a good deal (it will be less than that). You cannot track down such an error if you are not prepared to reset every setting to its most conservative value until you have pinpointed the real cause. You do not want multiple causes to interfere; it would render fixing your problem next to impossible.
 
So, I've been testing dimms for the whole day and the results are unexpected for me. I though that the dimm which manufacturer is different from others is the root of the evil, but it has passed the torture test successfully. And one of the rest three dimms failed. Now I'm testing this dimm with different manufacturer again to exclude the possibility of a slot or motherboard failure or any other random event that could lead to a failure. If it passes again, I'll retest the failed one for the same reason.
 
You see, that is the right way to track such an error down. Not impatient guessing and blundering. You will probably just have to replace that module and be fine. The downside is that you will no longer be able to use dual-channel-mode unless you manage to obtain a module with exactly identical architecture. Unfortunately even two modules having exactly the same part number is no guarantee for this.
 
I have no words. There's a great wish to say something impolite and uncensored towards the computer. The module, that previously successfully passed the tests, failed. I can't say i'm suprised much - if things go wrong, they go this way until the end. So, if the module with different manufacturer failed, and if one of the other three failed too, I think I should check whether the rest modules are stable at least. If one of them passes this test in three hours I'll be confident that it's not a motherboard problem. It's easier to replace two modules instead of the whole computer anyway...[strike] If everything goes the bad way, I'll check the last module, and if even it fails, I'll be very frustrated, cause it would mean I need to
1) run the small FFU test to see, if the problem is not processor-related. I did this before but who knows what could have changed.
2) Recheck the first failed module in the same slot to see if it's still failing.
3) recheck any other module to see if it's a systematic problem.
3) if it is, check the other slots by doing the same tests
4) if it fails again, replace all the modules with two 8GB, check and take it easy if problems still persist.
[/strike]
Added: after 4 hours of testing, no errors were found. Will retest the first failed module.
 
So, the final results are: 2 dimms pass the torture test in 3 hours, 2 - not. That means, I have two options. First - replace those two with similar kingston dimms of the same frequency and capacity. The other option - is to replace all dimms with 2 8GB Kingston. If I choose the first one - how should I plug them in to get the maximum perfomance or loose the least? Is it better then replacing all with two 8GBs?
 

Why same frequency and capacity, and why "similar" modules (whatever that may mean)? In order to offer dual-channel access, you need pairs of identical modules. Note: The two modules that comprise a pair need to be identical for this; the two pairs, however, may be as different as they like to. The only restriction is that all modules will be accessed with the same timing, meaning that the slower pair of DIMMs will define how fast your whole memory can be accessed.

The two DIMMs that you tested faulty may either be defective or incompatible to your mainboard. Personally, I have made good experience with Kingston modules, so these are indeed a good idea. In addition, many DIMM manufacturers - including Kingston - run compatibility lists that show which mainboards they have tested which of their DIMMs in with success and assure compatibility. Pick DIMMs from their list for your mainboard, and compatibility will be manufacturer-guaranteed.

Not that I had seen any Kingston module not run in any mainboard that I had in my fingers.


IMHO a pointless move. Why discard the modules that have served you well so far? You can simply get yourself another pair of modules. You can go for 2x4GB, or you can even get 2x8GB on top of the working 2x4GB you already have. (But consult your mainboard manual for details which combination of different-sized modules are allowed and into which slots they must be plugged. Some mainboards are a little picky about this.)

Also keep in mind that Win7 Home Edition does not support more than 16GB.
 
DeathAndPain
Just because I have an opportunity to. I really don't want to spend quite a lot of money for a Russian student to buy the same modules which can also be incompatible or faulty. I can either exchange all the modules to two 8GBs, so that all my modules go to the place where 8GBs are taken, or exchange two faulty modules to a pair of kingston, that have the same parameters as mine - frequency, capacity, timings (that's why "similar"). That means my two modules go to the same place. I just want to know which variant is the most productive. If it wasn't a question of money, I'd surely take the completely new set of DIMMs and my could be used later for some other computer. Or I'd take a pair of Hynix and see, what happens. But no money - no honey, so I have to use the opportunities I have.
 
Well, if you have access to a pair of 8GB modules in another computer so you can simply swap these against the modules you currently have in your system, then that is most obviously an excellent solution.
 
So, I was testing the configuration of 2 4GB Hynix and 2 4GB Kingston plugged in A1-A2 for Kingston and B1-B2 for Hynix. I haven't caught any errors or warnings after 4 hours of testing, so I think this configuration is stable and I can use it. Thank you all, especially DeathAndPain whose advices helped me to solve my problems and understand the main principles of testing and finding the faulty devices.
 
Hello again. The problem still seems to be unsolved. After changing the RAM dimms, hangs seemed to have gone, but after a day or two, everything repeated. The first move I did was testing everything with MPrime for 4 hours - no errors. Yesterday it hung three times while I was downloading a large torrent to the disk, that doesn't have any OS on it. I decided to test it with MPrime for the whole night (about 8 hours) - no errors again. So, what king of problem can it be now, if CPU and RAM pass the tests? For me it seems to be a king of problem with I/O, but the problem is that it has never hung while I was moving or copying anything anywhere.
 
First thing would be to check the cables to and from Harddisks, are they seated? Don't just look, pull em out and insert again. Also check with diagnostic tools from manufacturers that your Harddisks are OK (SMART-stuff).
Second would be to not overclock and see if it still happens. Maybe you've done this already, I just didn't read all posts.
3rd culprit could be the PSU.

I don't use Prime, don't particularly like it. Have you tried benching your CPU in OCCT? That program gives an error almost instantly if your system isn't stable, sometimes even says which core caused the error. 5 minute run should suffice. Keep an eye on temperatures, does push the machine a lot.
 
Well, I've checked the cables - they're OK. SMART for all disks are OK too. Overclocking is not really set - I decided to set only turboboost speeds to 4200MHz and it looks like Linux doesn't know how to use turboboost for some reason.
I found out that everything hangs when I download something very big with qbittorrent and doesn't hang when other client is used. Moreover it doesn't depend on what HDD or logical disk is used.
Also I've just tested the system with OCCT and can say this tool is odd. It definitely stops tests after some minutes because its monitor says that some core of my CPU (core #1 on first run, core #0 on second) has a max temperature of 127 C and min temperature of 1,5C. I think it's weird, isn't it? I have never got these values on any other systems with other programs. To see if OCCT is doing something wrong, I was monitoring temperatures with AIDA64 and its monitor was working well. And as OCCT doesn't want to change the monitoring tool, I can't trust it and its results.
 
I use OCCT with HWMonitor. Trust me, those temps are correct. OCCT is a torture test. If your comp can stand 5 minutes of it, it should be stable. You can lower the amount of threads the test runs if you are afraid of the heat it causes.

And if OCCT complains about a core, you have a heating issue, a voltage issue or a faulty CPU.
Only time I get core errors is when I overclock, either too low voltage or too much heat.

When you overclock, you should never have voltage on Auto. Put Load line calibration on high or extreme, whatever exists on your board. Might have to step back a bit on the voltage with extreme-setting. Because it really gives extreme voltages from the get go.
I use extreme setting

My settings:

amd fx-8350 4Ghz @ 4.4Ghz
corsair and crucial memory, 12 gigs, all modules CL9 1600 Mhz (Timings 10-10-10-30, haven't payed around with em much, find it dull. Default 9-9-9-24 didn't work.)
gigabyte ga-970a-ds3p motherboard (FSB 210, multiplier = + 1 or something)

Ram voltage: 1.57volt
CPU volt: +0.025 v (lands at 1.4v, can go up to 1.5 at load, extreme does that)
northbridge volt: +0.025 (lands at 1.20 volt, up to 1.27 at load)

I have only found 2 reasons when my system is unstable and hangs, gives blackscreens.
1. Too low voltage
2. Too much heat (VRMs, CPU or NB)
 
Wow, wow. You kidding? 1,5 °C is correct? I won't have this temp in my flat even if I open all the windows wide. Jumping from 60 °C to 127 °C in a second doesn't seem to be true either. If I disable all the limits in OCCT I can run those tests forever.

Just to prove my words I attach this image. It's AIDA64 running in parallel with OCCT. You can see the actual temps.
On this picture ЦП means CPU, Системная плата means Motherboard and Ядро means Core. I don't know exactly but looks like OCCT has some incompatibility with my sensors.

And by the reason of succesfull running of MPrime for 8 hours, normal SMARTs, well plugged SATA cables and all these weird things about qbittorrent, I assume that the problem is not related with either CPU or RAM. And I have no ideas what it's related with. The first idea coming to my head is SATA controller or South Bridge. If anyone knows how to test them (or some of them), I'll be thankful.
 
That is strange. When I run OCCT and have both HWmonitor gadget and Aida64 Gadget running, both report exact same temps. Probably is like you say the sensors acting weird.

Have you tinkered with BIOS? Particularly stuff that has to do with harddrives. Like AHCI/IDE mode (choose one or the other. If you change it, you have to reinstall WIndows, maybe Linux too...unless you change it back to previous value). IOMMU if you have that. Can it be some USB device?

I've had 2 strange things happen. One is no USB support in Linux unless I enabled IOMMU. No clue why.
Second is that I can choose to run Sata slot 5 and 6 on my mobo in either IDE or Sata mode. My DVD-burner is connected to one of those slots. It does NOT like Sata mode. I get I/O errors instantly.
DVD spins up really fast and then BOOM, BSOD.

I'm mentioning these because I spent weeks trying to find an answer on the net, nowhere to be found.