Other online diaries:
Fri, 25 Feb 2005
Our experience with Dell Thermal Event error - 22:18
This suggests something like fan failure allowing the case to become too hot, so last Friday I looked closer and noticed the power supply fan in one of the systems was indeed dead. We replaced the power supply and the machine appeared to work, at least initially. Ringing Dell we requested they rectify this problem with a few replacement power supplies (4 machines had failed at this point).
On Monday we discovered the machine with the replacement power supply dead again, and it would not stay up for more than 5 minutes after booting due to over heating for some reason. All this time I had been googling around a fair bit trying to find if anyone had any real suggestions about what could cause this (happening to more than one machine in so short a time was unlikely to be a simple hardware failure, too much of a coincidence).
Google showed some Dell support pages which were no real help, if you see the error "Previous shutdown due to thermal event." check that both fans are operating. Well yeah, they were not, but replacing them did not permanently fix it, and we had already been through that. A few comments in various user forums suggested some sort of mother board problem, however were not more specific and only said in some instances Dell had replaced the boards and the problem had gone away.
Dell phone support said they would send a few power supplies, and new CPU's with new fan units, this was their suggestion to us, so they had lined up a tech to come over (a day or two later than our support contract stipulated they should have fixed the problem by).
The tech rocked up with one replacement power supply and fortunately a replacement motherboard (just in case), even though we had reported 4 failures at this point. As soon as we described what was happening to the tech he said "Oh the capacitors on the board near the CPU have failed, they will be leaking or bloated". Apparently this has been happening with a large number of these Dell machines and other similar models. A worrying thing to find out when we remember we have approximately 120 Dell Optiplex gx270's in the department.
We had not even thought to look at the capacitors or anything, fan failure and overheating did not suggest to us that this could be the problem, of course that google searches and Dell tech support also did not suggest this as a possible cause is why I am writing this now (in the hopes, that if it happens to someone else, they can read this account of what a possible cause for the error "Previous shutdown due to thermal event." in Dell desktop machines and other similar hardware.
I suppose, possibly, it should have occurred to us to look at the capacitors, we have had large scale capacitor failure in the past as many nodes of our 192 CPU Bunyip Beowulf Cluster. The capacitors in many boards blew up, leaving large black holes around where they were mounted on the mother board. (Bob has some photos, I can not find them just now after a quick search though)
The failures in the cluster were after prolonged periods of running nothing but sse2 instructions (by prolonged I mean a few days or even weeks at a time), that sort of constant current load was not initially factored into the boards by the manufacturer (Epox). Fortunately in that case Epox replaced all the mother boards with boards that could handle the high current for sustained periods.
In the case of these Dell desktops, most of them have not been working too heavily, sure many run intensive integer stuff for one of our researches (a computer farm) in out of hours time, and they are all turned on 24/7, but this is not particularly high usage. It has been suggested by some other people on campus that
Dell is only the latest in a long line of affected electronics manufactures. MSI (used by Protech), Gigabyte, ABIT, ASUS have been affected over the past 2 years. Motherboards, video cards, TV tuners, apparently even some stand alone DVD players and other home electronics - anything these capacitors have been used in - have been playing up as well.
Since the Dell tech mentioned it had been happening a lot, we have started opening up a large number of the computers in the department and found many capacitors bloated or leaking, just waiting to fail. If you have a Optiplex gx270 maybe you want to have a look at the mother board, the large capacitors near the CPU are the main culprit.