Home

Server randomly locked up. Trying to find out why

$$5036
https://lemmy.world/u/ch00f posted on Feb 27, 2026 17:55

Woke up today to the homeserver being unresponsive. Couldn’t SSH, no video out when I connected a monitor, and even the reset button didn’t do anything. Had to hold the power button to shut it down.

/var/log/syslog doesn’t show anything interesting other than the issue happened at just after 4am. Log

2026-02-27T03:55:01.481794-08:00 blackbox CRON[1743418]: (www-data) CMD (/usr/bin/php8.3 /mnt/MONSTERDRIVE/pixelfeddata/pixelfed/artisan schedule:run >> /dev/null 2>&1)
2026-02-27T04:00:00.198504-08:00 blackbox smartd[2126]: Device: /dev/sdd [SAT], CHECK POWER STATUS spins up disk (0x81 -> 0xff)
2026-02-27T04:00:00.291853-08:00 blackbox systemd[1]: Starting sysstat-collect.service - system activity accounting tool...
2026-02-27T04:00:00.298344-08:00 blackbox systemd[1]: sysstat-collect.service: Deactivated successfully.
2026-02-27T04:00:00.298523-08:00 blackbox systemd[1]: Finished sysstat-collect.service - system activity accounting tool.
2026-02-27T04:00:00.299608-08:00 blackbox kernel: kauditd_printk_skb: 8 callbacks suppressed
2026-02-27T04:00:00.299613-08:00 blackbox kernel: audit: type=1130 audit(1772193600.298:798916): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2026-02-27T04:00:00.299615-08:00 blackbox kernel: audit: type=1131 audit(1772193600.298:798917): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='unit=sysstat-collect comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
2026-02-27T04:00:01.923610-08:00 blackbox kernel: audit: type=1101 audit(1772193601.922:798918): pid=1744810 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=PAM:accounting grantors=pam_permit acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.923614-08:00 blackbox kernel: audit: type=1103 audit(1772193601.922:798919): pid=1744810 uid=0 auid=4294967295 ses=4294967295 subj=unconfined msg='op=PAM:setcred grantors=pam_permit,pam_cap acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.923615-08:00 blackbox kernel: audit: type=1006 audit(1772193601.922:798920): pid=1744810 uid=0 subj=unconfined old-auid=4294967295 auid=33 tty=(none) old-ses=4294967295 ses=50544 res=1
2026-02-27T04:00:01.923615-08:00 blackbox kernel: audit: type=1300 audit(1772193601.922:798920): arch=c000003e syscall=1 success=yes exit=2 a0=7 a1=7fff81d75200 a2=2 a3=0 items=0 ppid=2654 pid=1744810 auid=33 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=50544 comm="cron" exe="/usr/sbin/cron" subj=unconfined key=(null)
2026-02-27T04:00:01.923616-08:00 blackbox kernel: audit: type=1327 audit(1772193601.922:798920): proctitle=2F7573722F7362696E2F43524F4E002D66002D50
2026-02-27T04:00:01.924259-08:00 blackbox CRON[1744811]: (www-data) CMD (/usr/bin/php8.3 /mnt/MONSTERDRIVE/pixelfeddata/pixelfed/artisan schedule:run >> /dev/null 2>&1)
2026-02-27T04:00:01.924614-08:00 blackbox kernel: audit: type=1105 audit(1772193601.923:798921): pid=1744810 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:session_open grantors=pam_loginuid,pam_env,pam_env,pam_permit,pam_umask,pam_unix,pam_limits acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:01.925610-08:00 blackbox kernel: audit: type=1110 audit(1772193601.924:798922): pid=1744811 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:setcred grantors=pam_permit,pam_cap acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T04:00:02.357616-08:00 blackbox kernel: audit: type=1104 audit(1772193602.356:798923): pid=1744810 uid=0 auid=33 ses=50544 subj=unconfined msg='op=PAM:setcred grantors=pam_permit acct="www-data" exe="/usr/sbin/cron" hostname=? addr=? terminal=cron res=success'
2026-02-27T09:23:35.786375-08:00 blackbox systemd-modules-load[904]: Inserted module 'dm_multipath'

Would something like this be a direct hardware failure? Like a power supply hiccup or something? It happening at 4am coincides with my electric car starting to charge, but the server is on a dedicated 20A circuit and behind a battery backup. I also don’t see any power issues on my Sense monitor at that time though it has limited resolution.

Mainboard is a Supermicro H13SAE-MF and I’m using ECC RAM.

I’ve been running this hardware for over a year and never had this issue, but I’m running out of places to look.

Might be time to finally get IPMI working.

https://lemmy.world/post/43647217
Reply
$$5043
https://programming.dev/u/CameronDev posted on Feb 27, 2026 18:12
In reply to: https://lemmy.world/post/43647217

Reset button not working, but power button working is quite odd.

Is it just the once that this happened? Can you reliably trigger it with the car charger? If yes, maybe worth plugging in a monitor while you triggering it and see what happens.

Are the server and chargers close to each other? Some kind of EMP effect? Seems unlikely, but who knows.

https://programming.dev/comment/22429715
Reply
$$5045
https://slrpnk.net/u/poVoq posted on Feb 27, 2026 18:15
In reply to: https://lemmy.world/post/43647217

Ghost in the machine 🤷

Impossible to tell and it sometimes happens.

https://slrpnk.net/comment/20970464
Reply
$$5044
https://lemmy.world/u/zewm posted on Feb 27, 2026 18:15
In reply to: https://lemmy.world/post/43647217

Memory leak eating all your ram them locking up? Is it a one time thing or is it a regular occurrence?

https://lemmy.world/comment/22381328
Reply
$$5050
https://lemmy.world/u/ch00f posted on Feb 27, 2026 18:21
In reply to: https://programming.dev/comment/22429715

Reset button not working, but power button working is quite odd.

Yeah makes me think something hardware level.

Are the server and chargers close to each other? Can you reliably trigger it with the car charger?

No. The car charges every night. This is the first time this has happened.

https://lemmy.world/comment/22381407
Reply
$$5053
https://lemmy.world/u/ch00f posted on Feb 27, 2026 18:22
In reply to: https://lemmy.world/comment/22381328

I think it’s the first time it’s happened since I upgraded my hardware over a year ago. 64 gigs of RAM and I rarely use more than 30% of it.

https://lemmy.world/comment/22381427
Reply
$$5055
https://lemmy.world/u/9tr6gyp3 posted on Feb 27, 2026 18:27
In reply to: https://lemmy.world/post/43647217

Solar flares and even the occasional random neutron hitting your equipment can cause some weird issues. If its just a one time occurrence and it doesn’t happen again, I wouldn’t worry too much about it.

https://lemmy.world/comment/22381526
Reply
$$5058
https://lemmy.dbzer0.com/u/empireOfLove2 posted on Feb 27, 2026 18:34
In reply to: https://lemmy.world/post/43647217

The reset button is basically just a signal to the CPU/BIOS that it should wipe memory and begin the boot process from scratch. If it was not working, that indicates the CPU was hard locked and not responding to any sort of input, not just an os fault The power button sends an actual trigger signal to the PSU through the ATX connector so it bypasses any mainboard lock.

Random shit happens, see if it does it again.
My go to for random stability issues is to always run a full deep memtest to look for bad RAM and then a CPU stress test to see if it’s a random thermal or core issue. More often than not I find stability problems just with these two steps.

https://lemmy.dbzer0.com/comment/24662623
Reply
$$5061
https://swg-empire.de/u/bjoern_tantau posted on Feb 27, 2026 18:36
In reply to: https://lemmy.world/comment/22381427

I still use swap for those rare moments i run out of RAM after all. Who knows maybe some heavy cronjobs will clash or whatever.

https://swg-empire.de/comment/9044622
Reply
$$5062
https://lemmy.today/u/tal posted on Feb 27, 2026 18:39
In reply to: https://lemmy.world/post/43647217

If you have Magic Sysrq enabled, you can do Magic Sysrq-t, which may give you some idea of what the system is doing, since you’ll get stack traces. As long as the kernel can talk to the keyboard, it should be able to get that.

https://en.wikipedia.org/wiki/Magic_sysrq

You maybe can’t see anything on your monitor, but if the system is working enough to generate the stack traces and log them to the syslog on disk (like, your kernel filesystem and disk systems are still functional), you’ll be able to view them on reboot.

If it can’t even do that, you might be able to set up a serial console and then, using another system running screen or minicom or something like that linked up to the serial port, issue Magic Sysrq to that and view it on that machine.

Some systems have hardware watchdogs, where if a process can’t constantly ping the thing, the system will reboot. That doesn’t solve your problem, but it may mitigate it if you just want it to reboot if things wedge up. The watchdog package in Debian has some software to make use of this.

https://lemmy.today/comment/22546940
Reply
$$5079
https://feddit.org/u/lemmlinger posted on Feb 27, 2026 19:28
In reply to: https://lemmy.world/post/43647217

Most likely memory issues

https://feddit.org/comment/11754857
Reply
$$5080
https://anarchist.nexus/u/Wildmimic posted on Feb 27, 2026 19:31
In reply to: https://lemmy.world/comment/22381526

Wanted to say that - Random shit does happen, even to the most stable systems. There’s a cutoff in consumer hardware where selecting for more stability simply isn’t worth the cost such as radiation hardening. Best you can do is ECC Ram.

https://anarchist.nexus/comment/2832731
Reply
$$5084
https://lemmy.world/u/k2r posted on Feb 27, 2026 19:38
In reply to: https://lemmy.world/post/43647217

Full disk maybe?

https://lemmy.world/comment/22382659
Reply
$$5093
https://lemmy.world/u/ch00f posted on Feb 27, 2026 19:59
In reply to: https://feddit.org/comment/11754857

In ECC memory?

https://lemmy.world/comment/22383042
Reply
$$5242
https://lemmy.nz/u/amorangi posted on Feb 28, 2026 06:56
In reply to: https://feddit.org/comment/11754857

I’ve resolved issues like this to RAM maybe twice, vs maybe 8 or 10 times being a faulty PSU. It’s also a pita to pin it down.

https://lemmy.nz/comment/20408825
Reply
$$5253
https://lemmy.world/u/Treczoks posted on Feb 28, 2026 08:48
In reply to: https://lemmy.world/post/43647217

If it randomly locks up, try memtest86. It can often be found as a boot alternative in Linux installation images, but it is probably available solo.

https://lemmy.world/comment/22391753
Reply
$$5270
https://lemmy.world/u/EarMaster posted on Feb 28, 2026 12:14
In reply to: https://lemmy.world/comment/22381427

That’s the beauty of a memory leak. Even plenty of RAM can get filled up to the brim…

https://lemmy.world/comment/22393865
Reply
$$5288
https://lemmy.world/u/AtHeartEngineer posted on Feb 28, 2026 14:54
In reply to: https://lemmy.world/post/43647217

if it keeps on happening, test your ram. any weird computer/hardware issues I’ve ever had (and I’ve done a lot of wild stuff for work), is either memory failure or a grounding problem.

https://lemmy.world/comment/22396021
Reply
$$5306
https://lemmy.zip/u/motruck posted on Feb 28, 2026 17:19
In reply to: https://lemmy.world/post/43647217

Look at dmesg?

Look at the last boots: >journalctl -o short-precise -k -b -2

https://lemmy.zip/comment/24940784
Reply
$$5772
https://lemmy.world/u/essell posted on Mar 1, 2026 17:28
In reply to: https://lemmy.world/post/43647217

Had this problem for a the last three months. Tested hard drives, memory and everything else.

Turned out to be a PSU with an intermittent issue where it would drop voltage. Very annoying!

https://lemmy.world/comment/22416527
Reply
$$5745
https://kbin.melroy.org/u/SharkAttak posted on Mar 1, 2026 17:42
In reply to: https://anarchist.nexus/comment/2832731

So you’re saying there’s a market for lead lined PC cases? 🤔

https://kbin.melroy.org/m/selfhosted@lemmy.world/t/1538266/-/comment/11347775
Reply
$$5803
https://anarchist.nexus/u/Wildmimic posted on Mar 1, 2026 18:55
In reply to: https://kbin.melroy.org/m/selfhosted@lemmy.world/t/1538266/-/comment/11347775

That might introduce more issues than help. If high speed particles impact your shielding, you might get a “particle shower” from the impact on your electronics. Radiation Hardening is part of the design of the chips - mainly creating less dense structures with bigger transistors, because they don’t flip as easily as the very small gates on a H200. That’s also the reason why most space based computers have the processing power of a system around 2005.

https://anarchist.nexus/comment/2859523
Reply