Ask HN: What are some Linux tools to diagnose a system which keeps hanging?
I have a laptop which runs Ubuntu. After running for a while (sometimes 5-10 mins, sometimes more), the system hangs and stops responding to keyboard or mouse inputs.
Before I open the system up or take them to a technician who can look at the hardware issues, I was wondering if there are ways I can diagnose the system using linux tools.
Can the issue happen due to bad blocks in the partition? The issue started after I used fdisk on the hard drive? Are there ways to confirm this? - Have you switched to a console screen vs X (control+alt+F1) to see if its actually X that is hanging and not the OS? Hanging 5 to 10 mins after using fdisk sounds like a red herring. If you are able to do this check dmesg. Just type dmesg. Then tac /var/log/Xorg.0.log | less - Do you have sys-rq enabled and if so does the system respond to an emergency sync then emergency dismount then emergency reboot? [0] sysctl kernel.sysrq - If you switch to run level 2 does this still occur? [1] - Is there anything interesting in syslog messages in /var/log? Most notably Xorg.0.log and messages - Do you have lm-sensors installed to watch temperature? sensors | grep -Ei ^temp Also check the temp of your drives smartctl -x /dev/sda | grep -Ei ^temp assuming your drive is sda. OpenIPMI can also get this data and fan speeds. If its hot do you have a dog or cat or something else that sheds hair? If so you may have to power it off, remove battery, open it up, use non metallic attachment on vacuum to remove hair, then compressed air, then vacuum again with non metallic attachment. That would be my starting point. It could go a million directions from there. [0] - https://en.wikipedia.org/wiki/Magic_SysRq_key [1] - https://www.tecmint.com/change-runlevels-targets-in-systemd/ Happened to me on what, if I recall correctly, was a thinkpad t61. In the end it was due to outdated BIOS. Updating it stopped the PC from hanging up. Perhaps check that your BIOS is the latest version first. And by the way run also a scan of the RAM with memtest. The memtest86 routinely fails ... I guess that would be an issue with the RAM? If it lists addresses that have failed - Yes. Generally when the same addresses reliably fail up over and over you might get away with configuring so at boot the OS can ignore certain memory addresses or a range if that's easier. Otherwise replace the ram that's bad. first, I'd watch dmesg from boot, to see if something useful pops up: open a console and type watch -n 1 "dmesg | tail -n 40" to watch the last 40 lines that are reported by dmesg (some distros will require you by default to be root to do this, so if you get an error, try that) Also, you may find information in system logs: A traditional (non-systemd) distro will store text files under /var/log - often /var/log/messages is what you're looking for - you can watch the tail as above: tail -f -n 40 /var/log/messages If your system is running systemd, you would do: journalctl -f -n 40 (again, you may need to be root to do this) I recently had lockups that seem to have been thermal in nature - there weren't a lot of related messages - just once I saw something about hitting thermal limit, but that was a bit before the lock-up, so I dismissed it as just the behavior of modern cpus which run full-tilt until they reach thermal throttling. The machine was sent in for repair and had liquid metal thermal paste re-applied, and temps are down by about 10 degrees - really not that much, and temps are still sitting in the 90s, but so far, over the last 1.5 days, no lockup, but I'd need to run for a while longer to verify.