Saturday 10 November 2012

Overcoming Powerouts

Regarding the power cut outs: on the main IO board inside the server are three jumpers which alter system behaviour. J3 controls how the system deals with the two PSUs (power supply unit): they can be used in dual mode, or redundancy mode. Dual mode is required for driving more than two CPUs or more than one tower of disks. We only have a single tower, and we only have 2x CPUs inside, so the very presence of the second PSU is a bit puzzling. Perhaps they had more than 2x CPUs at one time?

Nevertheless, enabling the J3 jumper sets PSU redundancy mode, where the second is used in the event of a fail of the first. I set this jumper and the machine powers up again. Hmm, does this indicate a problem with PSU #1? I swapped them over and... got exactly the same behaviour. Hmm x2.

There are two LEDs on each PSU visible at the rear of the machine. In the event of these auto power-offs, the top LED of each PSU remains lit but its lower LED goes out. I can't find a reference to this via google, but  it appears that the lower LED is extinguished when the PSU trips. Why they trip is another question (possibly answered below).

However, with J3 enabled, the system will now power up OK some of the time. The rest of the time, I have to first reset PSU #2, and then the system stays powered.

Some googling resulted in this nugget from the comp.sys.dec newsgroup (generally accesible via google here):
Please don't assume that any given switched mode PSU (which this
surely will be?) will operate correctly without a realistic load. It
is entirely possible for SMPSUs to not start or otherwise misbehave if
they have no load.
So, given that we're not booting properly, it could just be that the PSUs are bored and decide to take a nap. That'd be a nice outcome. Then I wouldn't have to order any replacement PSUs once the boot up problem is fixed. What's that? We only found that comp.sys.dec reference after ordering a replacement PSU from Ebay? Ah well, it's always good to have a spare. Or two.

The other jumpers on the IO board in the system control the Fail-Safe Loader (FSL) mode for the SRM firmware boot, which allows a minimal console to boot in the event of corruption of the main SRM firmware. Jumper J6 on the IO board enables the FSL mode, and jumper J5 enables write access to the FSL when you want to update it to a later firmware version. I enabled J6 and powered on (twice, sigh) but didn't see any change to the OCP display or a console via the serial port.

One of the friendly souls on the #vms IRC channel tried to help out by uploading a CD ISO image of the 5.3-2 SRM firmware which is the latest that supports the AlphaServer 2100, in an attempt to get the SRM updated via the CD, but no joy: we need to boot to FSL or SRM to update the SRM, and we're not even getting past the OCP diagnostics yet.

Another random doc found via google said clearly:
The FSL for the 2100 and 2100A are not compatible.
The same is true for the FSL loader for the EV4 and EV5 CPUs.

So, it is possible this system has CPU modules which the firmware simply can't be executed on. I have sent a mail to the original owner to ask if he knows if this system has ever run OK with these CPUs inside.

Another possibility is a failure on one of the circuit boards. There are really only 4 in play:

  • the backplate motherboard which houses the CPU and memory boards, the IO board and other PCI/EISA boards 
  • the IO board which contains the SRM firmware and provides floppy and SCSI interfaces
  • the OCP module board which contains on/off/halt/reset buttons and the diagnostic startup display
  • the Fan Module board which controls the fans and halts the system in the event of thermal problems
Of these, it seems to me that the most likely candidate for failure is the IO board (assuming the system previously ran OK with the current CPUs and firmware). I found a (quality unknown) IO board for cheap on Ebay, so bought it. Of course, it could be in the same position: containing firmware for EV4 CPUs. Is the only fix then to obtain a 4/275 CPU module just to get the SRM running?

So, now awaiting a 'new' PSU and IO board. While waiting, I stripped down all the cables from inside the machine and scrubbed the mould away.



No comments:

Post a Comment