Server Crashing Experience

Last Friday, punknix.com was down. I found it because I usually remotely access the server
(mail and ssh) from office. I went back to home at night, and the server was hang. After
rebooting the server, kernel panic occurs. Rebooting multiple times it was still not
booted. At that moment I was very upset and was afraid all the data in the disk will be
lost. I shutted down the machine and detached the harddisk, and use some bootable CDs
such as Debian, Mandrake and Windows NT Server - and all hanged in some stage.

Lastly, I plugged the harddisk to my deskstop PC and boot it directly. Boom, it booted very
good, no kernel pancic, but when running init procss, it hang. When I saw this I was very
upset, but at least, it seemed that the PC can still access the disk, because it boots. I
then restoreed the PC with both original disk the disk from punknix and run VMWare with a Debian.
I configured the VM mount secondary IDE drive to the punknix disk. After Debain on VMWare boots,
I mounted the disk and examined the disk. Hopefully, the disk mounted. But when I looked into
the / directory I was stunned! Both /etc and /usr were both linked to /usr/bin/telnet.netkit.
That's definitely the cause of the disk not booting!
At that moment I was thinking "Damn! It has been hacked!"
If you looked to my face, you could see it all grey.

Later on, I looked into the some log files in detail, and it was found that the machine have some
exception behaviour at around 25 Aug 2003 04:05 HKT.
I saw kernel syslog reported no entry after 04:05, httpd access log showed last entry at 04:03,
and tomcat log even reported runtime exception. The home directory is still there. I didn't look
into detail but I concluded it is safe. When seeing this I was relieving, and I started believing
it is just a pure hardware failure instead of hacked, because the hacker will not so mercy not
deleting all files. But at that moment I didn't know if it was CPU or motherboard
that corrupt the disk. I run fsck under Debian/VMware and numerious duplicate nodes were found.

Next day, I get a Slot 1 PII 233 and anothor Celeron 300A CPU and plug them to the board,
same kernel panic occurred again. So, I certified the motherboard was broken and faulty.
Finally, I decided to rebuild punknix. I went to the computer shops and bought what I need.
Actually, I had been thinking about Dual Xeon or Opteron, but still it was an expensive option to me now.
After some serious thought I bought the following:

  • AMD Athlon XP 2500+ (333MHz FSB)
  • MagicPro MP-K7N-L Motherboard (nVIDIA nForce2 400 chipset)
  • 256MB DDR Ram (PC 2700 - 333MHz)

This is a cheap solution I can found. This solution give me a better price/performace than P4's.
I originally search for a replacement of Slot1/370 board (punknix
was running Celeron 700MHz), but I refused to pay around HK$250 for a 2nd-hand board and I was also
unable to find cheaper. This solution enables me, in future, to replace the current desktop PC (Celeron 1.3GHz)
I am now using, when I have the budget for upgrading to Dual Xeon or Opteron. So now I have better upgrade option.

After going through the Debian installation process, restoring home directory and tedious setup
for services such as exim, smbd, imapd-ssl. Kernel was also rebuilt and upgraded to 2.4.20.
If not, it would hang under very heavy load in samba.
Punknix is now running, as you see this page, and I am slowly rebuilding other missing pieces.
But the installation/setup process makes me very tired - so not again!