Monday 3 December 2012

Developments!

The courier broke the base on the VT510, but the unit works great!


I had to fabricate an MMJ to MMJ cable. I used 6-core telephone wire from Maplins, and two normal phone jacks. Each phone jack underwent simple surgery to allow them to fit into the funny MMJ jack socket. The wires were hand-crimped into the jacks using a razorblade and a hammer. A continuity test with a multimeter on both ends of the cable proved successful, and use of the H8571-J to connect the terminal to a DS10L's serial port worked first time! Yay for small victories!

For the AlphaServer 2100 itself, Peter in Ireland has been incredibly helpful recently. He measured the voltages at the OCP connector during use:
1. 12V when system is on. 0v when system is off. 
2. 5.2V when system is on. 0v when system is off. 
3. 0V  
4. 5.2V when system is on. 0v when system is off. 
5. 0V  
6. 5.2V when system is on. 0v when system is off. 
7. 3.5V when system is on, dipping momentarily when reset is pressed. 
8. 4.8V when system is on, varying when the display changes 
9. 4.8V when system is on, varying when the display changes. 
10. 0.5V when power switch and door interlock are engaged, otherwise 2V

When I did the same, I found that my pin 7 was 0.3V, corresponding to the permanent SYSTEM RESET message on the OCP. This (to cut a long story short) allowed us to conclude that my existing two IO boards were faulty. He's been having fun himself, getting his second CPU to show up in VMS but that will come later in our story.

Shortly after concluding the IO boards were dead, the third IO board arrived and: it worked! It showed TEST CPU messages on the OCP, then TEST MEM messages! Hoorah!

Then FAIL MEM! Ah. Bugger.

I tried every combination of CPU and memory board with the new IO board, and concluded the following:

  • the original CPUs are dead
  • the original memory boards are dead.
The new CPU is the only one which will allow the OCP to show TEST CPUs/MEM etc. But all memory boards cause the FAIL, and the machine will go no further. 

Peter very kindly offered to send some spare memory boards (2x128MB). I have a very good feeling about these!

STOP PRESS: I've just been told there are some more AlphaServer 2100 components available in Bristol, and I'm off to pick them up today! This includes 1x B2040 CPU (5/300), 1x B2022-FA 512MB memory board, IO board, 2x PSU, OCP panel, Fans and control board, remote ports board.

With this + Peter's memory, SURELY we can now breathe some life into this thing!

Friday 23 November 2012

Keyboard and terminal serial adapter

To complement the AlphaServer, I've acquired a DEC LK450 keyboard with a PS2 connector. It arrived pretty filthy, but cleaned up nicely, and works fine when tested on a Linux box:


Also bought a DEC H8571-J DB9 RS232 to MMJ adapter:


This will be used by a DEC VT510 which is also due soon. Hopefully it won't be here for too long before I actually have a use for it...

Thursday 22 November 2012

New CPU arrives...

and STILL no difference in behaviour. This (much less dusty) B2040 (EV5/300) CPU causes the system to behave like the non-SYSTEM RESET CPU i.e.: no OCP wakeup, no report of NO MEM INSTALLED when the memory is removed, etc. So: now two CPUs behave like this, the other shows SYSTEM RESET. Sigh.

I noticed that the replacement IO board was pretty beaten up, and finally noticed that it was actually missing a capacitor on the rear. This isn't affecting anything so far, but is now unlikely to be kept except for parts if a working system is ever forthcoming.

Wondering whether the original IO board's DALLAS DS1287 RTC clock chip had run out of juice, I scratched off enough of the epoxy covering to expose the pins and measured it: 2.80V


So, that's likely enough juice to get the system working. So WHERE is it failing?

Because the replacement IO board (B2110-AA) is so beaten up, I decided to order another IO board from Ebay. I am singlehandedly keeping the courier market alive...

If that one doesn't help things, time to look to the three remaining boards: the 54-23151-01 remote ports board, the 54-23180-03 OCP board and the 54-23260-01 fan control module.

Then there are the cables... To get the SYSTEM RESET / NO MEM INSTALLED OCP display, all that is needed is the OCP cable from the IO board to the OCP module. So that seems OK.

The disk cables can't be required to work to get the SRM running. The remote ports cable links the IO board to the serial ports, keyboard ports etc. block. This would obviously be required to view the console output (assuming it isn't being channelled to VGA, but that will come later), but I can't see it being required to be connected for SRM initialisation. Either way, I've tried it connected and disconnected, and no change.

Roll on the next IO board...

Tuesday 20 November 2012

DALLAS DS1287 chip examined

Following the directions on the excellent: http://www.mcamafia.de/mcapage0/dsrework.htm page, I  exposed the internal pins on DALLAS DS1287 RTC chip on the spare IO board to see if the battery had died. It measured a healthy? 2.89V (should be 3.00V). So I suspect that's not the problem on this board. I used a craft knife, cutting away carefully at the pin positions until I exposed enough to cut them.



New backplane arrives...

...and no change in behaviour. Poop.

Meanwhile, I've had some interesting conversations with Peter from Ireland who also has an AlphaServer 2100 (among other items), and he's filled me in on a couple of things plus described some of his own issues:

  • PSU labels: the top LED is labelled AC OK and the bottom LED is labelled DC OK.
  • SYSTEM RESET only appears very briefly after the reset button is pressed
  • His machine previously had some power/fan issues, which would sometimes lead to nothing appearing on the OCP at power up, and no boot. This was normally fixed with another power on. Subsequently replacing the IO board appears to overcome this, but the CPUs and memory have also been changed since then.
  • Running an EV4 4/275 CPU module with an IO board running EV5 firmware leads to "FAIL I/O_00 0004" being displayed on the OCP. (I think this is also what you see when the FSL firmware boots, but I'll need to recheck the docs).
  • He has 2x 5/300 CPU modules, which both register OK in SRM, but his VMS currently only sees one. Both pass with P (rather than F for Fail) in the SRM tests, but whichever one is in slot CPU1 results in the ID of ?????-? being shown. 
  • Like a previous suggestion from a reader of this blog, he has suspicions that the IO board could cause problems if the battery/realtime clock had died. On the IO board, this is the DALLAS DS1287 and he wonders if perhaps surgery such as this could solve it? 
I have ordered 1x 5/300 CPU module from Ebay thinking that this is the next most likely component to have failed. If it turns out to the be the battery on the IO Boards, then hey, I'll have 3x CPUs for VMS to ignore.

Wednesday 14 November 2012

Rust-B-Gone

The computer with a water feature.
After 2 balls of wire wool, half a bottle of vinegar and most of the skin of my fingers, the rust is almost entirely gone from the rear and card cabinet:

Shiny!

Shiny! 
Shiny!
In other news, the new PSU works fine along with one of the old ones in dual mode. Huzzah! Also, a replacement backplane on its way from Ireland. Let's see if that makes a difference. The backplane, that is, not coming from Ireland.

If that doesn't have any change on the system behaviour, I think we're going to have to have a difficult conversation with the CPU modules.

Monday 12 November 2012

'New' IO board arrives...

But gives exactly the same results: CPU A shows SYSTEM RESET, CPU B shows nothing. Hmm.

So what other circuit boards do we have in play that can be replaced?

Card Cabinet:

  • 2x 5/300 CPU boards (known difference in behaviour, but both get warm)
  • 2x 512MB boards (system knows when both are missing)
  • backplane board
  • already swapped-out IO board (2 boards, exactly the same behaviour - I suspect the boards are OK)
  • Remote Ports board (serial, keyboard, etc. I missed this off the earlier list of components. However, I get the same OCP display regardless of whether this is connected or not.) 
Front cabinet:
  • Fan Control Module board (fans spin OK and don't cause system shutdown)
  • OCP board (Display works, Power & Halt buttons have LEDs which work. Reset button measured for switching behaviour) 
  • StorageWorks bay (can likely be completely ignored until the SRM boots)
The visible rust damage is much more pronounced towards the rear of the machine. So is corrosion damage more likely to be on components towards the rear? The backplane board doesn't seem to do very much, other than provide IO and expansion slots, but there's at least one large chip on there. Each of the above components has been removed and had its connectors cleaned at least once. The exception being the backplane, which only has receiving slots that I can't work out how to clean... Perhaps it's time for a gentle compressed air dust blower for the backplane. Other than that, time for a replacement?

One other angle could be the cables. Ignoring the EISA cable to StorageWorks, and the IDE? cable to the floppy, the IO board has two main cables: the OCP connection, and the Remote Ports connection. We know the OCP cable is good enough to enable the OCP to display NO MEM INSTALLED and SYSTEM RESET, so current assumption is that it's probably OK.





Sunday 11 November 2012

Original owner replies

The original owner has replied to my question about whether the box has successfully been used with the CPUs in place, and he said yes. The machine was left in a garage for a long time, and must have deteriorated.

So. A failed component somewhere. The IO board? Both CPUs failing seems less likely, somehow. Do I just need to clean things more? It's very positive to know that it was running OK at last use.

Another interesting discovery: the CPU which causes the SYSTEM RESET message also shows the OCP message: NO MEM INSTALLED when the memory is removed. The other CPU causes no OCP output at all with memory present or removed.

Hoff on #vms suggested checking the OCP board's reset switch. It seems OK to a quick multimeter test, the pins have 5v across them when the button is pressed, otherwise they don't.

One of these CPU modules is not the same...

The two CPU boards I have don't behave the same. Lets call them A and B. When A is in slot CPU0, we consistently get SYSTEM RESET displayed on the Operator Console Panel (but no other signs of booting). When B is in slot CPU0, we get nothing shown on the OCP (and no other signs of booting). This is the same behaviour when A or B is in slot CPU1 and CPU0 is left empty. In both cases, the above behaviour is the same if slot CPU1 is also populated or empty.

Can we conclude that one of A or B is broken? I've cleaned the contacts of both boards (and the memory, and the IO board) with Isopropyl alcohol, but haven't see any changes in behaviour.

Both CPUs get warm when connected in either slot, so they're both alive to some degree.

I also started work on removing some of the rust from the cabinet.

A really rusty rear.

Rusty brown is this year's black, apparently.

I removed the divider tray from the main compartment and scrubbed it clean of rust. Here's the backplate board unobstructed.

Backplate: 54-23149-01
I'm also managing to power up successfully every time now, by switching off PSU #1 and leaving PSU #2 to do all the work (with jumper J3 enabled).

Saturday 10 November 2012

Overcoming Powerouts

Regarding the power cut outs: on the main IO board inside the server are three jumpers which alter system behaviour. J3 controls how the system deals with the two PSUs (power supply unit): they can be used in dual mode, or redundancy mode. Dual mode is required for driving more than two CPUs or more than one tower of disks. We only have a single tower, and we only have 2x CPUs inside, so the very presence of the second PSU is a bit puzzling. Perhaps they had more than 2x CPUs at one time?

Nevertheless, enabling the J3 jumper sets PSU redundancy mode, where the second is used in the event of a fail of the first. I set this jumper and the machine powers up again. Hmm, does this indicate a problem with PSU #1? I swapped them over and... got exactly the same behaviour. Hmm x2.

There are two LEDs on each PSU visible at the rear of the machine. In the event of these auto power-offs, the top LED of each PSU remains lit but its lower LED goes out. I can't find a reference to this via google, but  it appears that the lower LED is extinguished when the PSU trips. Why they trip is another question (possibly answered below).

However, with J3 enabled, the system will now power up OK some of the time. The rest of the time, I have to first reset PSU #2, and then the system stays powered.

Some googling resulted in this nugget from the comp.sys.dec newsgroup (generally accesible via google here):
Please don't assume that any given switched mode PSU (which this
surely will be?) will operate correctly without a realistic load. It
is entirely possible for SMPSUs to not start or otherwise misbehave if
they have no load.
So, given that we're not booting properly, it could just be that the PSUs are bored and decide to take a nap. That'd be a nice outcome. Then I wouldn't have to order any replacement PSUs once the boot up problem is fixed. What's that? We only found that comp.sys.dec reference after ordering a replacement PSU from Ebay? Ah well, it's always good to have a spare. Or two.

The other jumpers on the IO board in the system control the Fail-Safe Loader (FSL) mode for the SRM firmware boot, which allows a minimal console to boot in the event of corruption of the main SRM firmware. Jumper J6 on the IO board enables the FSL mode, and jumper J5 enables write access to the FSL when you want to update it to a later firmware version. I enabled J6 and powered on (twice, sigh) but didn't see any change to the OCP display or a console via the serial port.

One of the friendly souls on the #vms IRC channel tried to help out by uploading a CD ISO image of the 5.3-2 SRM firmware which is the latest that supports the AlphaServer 2100, in an attempt to get the SRM updated via the CD, but no joy: we need to boot to FSL or SRM to update the SRM, and we're not even getting past the OCP diagnostics yet.

Another random doc found via google said clearly:
The FSL for the 2100 and 2100A are not compatible.
The same is true for the FSL loader for the EV4 and EV5 CPUs.

So, it is possible this system has CPU modules which the firmware simply can't be executed on. I have sent a mail to the original owner to ask if he knows if this system has ever run OK with these CPUs inside.

Another possibility is a failure on one of the circuit boards. There are really only 4 in play:

  • the backplate motherboard which houses the CPU and memory boards, the IO board and other PCI/EISA boards 
  • the IO board which contains the SRM firmware and provides floppy and SCSI interfaces
  • the OCP module board which contains on/off/halt/reset buttons and the diagnostic startup display
  • the Fan Module board which controls the fans and halts the system in the event of thermal problems
Of these, it seems to me that the most likely candidate for failure is the IO board (assuming the system previously ran OK with the current CPUs and firmware). I found a (quality unknown) IO board for cheap on Ebay, so bought it. Of course, it could be in the same position: containing firmware for EV4 CPUs. Is the only fix then to obtain a 4/275 CPU module just to get the SRM running?

So, now awaiting a 'new' PSU and IO board. While waiting, I stripped down all the cables from inside the machine and scrubbed the mould away.



Friday 9 November 2012

Not Quite DOA...

It might have only cost £10 on Ebay, but this thing weighs about 75kg, and organising a courier cost me £114 for the 377 mile trip from Ayr, Scotland to my home in Abergavenny, Wales. And that's after the first courier turned me down flat...

So, I paid up, and it soon arrived. It took two of us to wheel it from the driveway up to the house and struggle with it through the door. Once embedded in the study, I plugged in the two (!)  standard PC power supply cables, attached a VGA cable and a PS2 keyboard, and turned it on.

SYSTEM RESET said the operator console panel in bright green text.

Wow. It isn't dead. The Ebay listing had said:
The server hardware is complete with CPU, memory, networking, etc. but has not been booted for a while so it might not work.  
Operating system currently loaded is NT 4.
Consequently, I didn't know if this machine was going to power up or not. But you can (usually) replace failed components, from PSUs to CPUs. Finding out what's broken and sourcing a replacement was going to be the main source of fun for this project. (Remind me of that sentence when, in eight posts time, I'm knee deep in PCBs and multimeters and tufts of pulled hair.)

But, powering up is good: the PSUs are presumably OK. The Operator Console Panel works. The six hard disks are lighting up and spinning up. Good, good, good!

These servers can run a number of operating systems. This one supports Microsoft's NT, as well as OpenVMS and Tru64 Unix from DEC. We won't mention NT any more than strictly necessary, and I have equally little interest in Tru64 Unix. VMS is my thing, here. If I wanted Unix, I'd stick with one of the many Linux boxes dotted around the place. (I've never really understood the non-Linux Unix enthusiasts who will only run FreeBSD etc. Then again, none of those people would understand why I want to get OpenVMS working on 17 year old hardware either, so hey-ho.)

I'd read through the owner's guide and firmware guide for the AlphaServer 2100 which still exists on HP's site (HP acquired Compaq, who had previously acquired DEC. So now many historical web references to http.digital.com or ftp.compaq.com etc., require translation to an equivalent HP site. If you're lucky.) The firmware guide suggested that SYSTEM RESET occurs after the system reset button has been pressed. Seemed reasonable.

However, the firmware guide doesn't mention what happens if the SYSTEM RESET phrase stays on the operator control panel permanently. Which presumably means this isn't normal.

At power-up, the AlphaServer 2100, like most (all?) Alpha systems normally enters SRM (system reference manual) prior to operating system boot so that you can do some high-level tinkering to reconfigure some of the hardware settings, e.g. change the location of the boot disk, etc.). The expected order of events here is:
Operator Control Panel powerup display -> System Startup Screen -> SRM
The OCP display should show a brief series of component test indicators. This is followed by a console (VGA or serial - as configured) display of the system startup screen describing CPU and memory and bus probe/test results, and finally the console prompt of SRM (or an alternative console called ARC, if you're booting to NT. We're *so* not booting to NT.)

But we're stuck on the OCP displaying 'SYSTEM RESET'. Hmm.

Time to power off and take a look inside.

Yes. That's mould. No. It shouldn't be there.
Scuzzy SCSI cables.

I took a look at the CPUs and memory. From the Ebay listing I was expecting 2x 4/275 CPUs (21064A = EV45 at 275 MHz ) and 2x ??? memory (either 128MB, 256MB or 512MB boards). I was very pleased to learn that there are actually 2x 5/300 CPUs (21164 = EV5 at 291 MHz) and 2x 512MB RAM boards present. A dual EV5 with 1GB RAM? For £10? Awesome! (Except it doesn't work).


The two CPU boards have the silver heat sink on them.

I removed each removable board and reseated them, and retried the system, same behaviour. Hmm.

Previously, on the IRC channel for VMS (irc.2600.net - channel #vms - do pop in and say hi!) I had some help locating the existing documentation for the AlphaServer 2100, along with some descriptions of the process of upgrading a 2100 from the 4/275 CPUs to the 5/300 (because I'd seen such CPUs for sale on Ebay). The conversations went along these lines: upgrade the SRM before you upgrade the CPUs or the system won't start. The SRM for EV45s won't work with EV5 CPUs.

Ah.  Could this be what the problem here is?

Then the machine powered itself off. I started it again and it went off again after a couple of seconds.

Arse.



[000000]

Back in 1991, I entered University at Lampeter, Wales, UK.

For four years, while ostensibly studying Philosophy, I learned how to wash clothes for the first time in my life, and rediscovered a love of computing, dormant since I'd sold my 48K ZX Spectrum (+ microdrive!) in 1986 following the inevitable introduction of a teenage male to booze, girls and used record shops.

At Lampeter, I discovered Microsoft Windows 3.11, VAX/VMS 5.2 and the Internet.  All three of these have evolved a lot since then, but while two of them went on to became household names, VMS stayed resolutely in the background, quietly running many of the world's mission critical applications until cheaper hardware and software made many such installations financially unviable. But: Oh, My! VMS was lovely.

Lampeter was an arts university, with no full time computing course, but even back in 1991 most students needed computers to produce and print documents, and the small but perfectly formed Computer Unit provided the required tools in the shape of PCs, served by a VAX 3400 (if memory serves). Electronic Mail was offered via (emulated) terminal connection to the VAX from the PCs, and users would have to learn and use VMS to read and send their mail. That was probably the only use of VMS for most users, with the Windows PC providing the majority of the tools. But some of us chose the red pill, and life was never the same again: NETHACK, FINGER, anonymous FTP, TELNET, MUDs, chatrooms, ADVENTURE, MORIA, EMPIRE, programming in DEC PASCAL and DEC C...

In 1992, DEC (or Digital as they were known to aficionados) introduced a new CPU and hardware platform to allow the venerable VAX hardware line to sleep the good sleep. They called this CPU the Alpha, and ported VMS to it (and changed the name to OpenVMS for both Alpha and VAX versions). They made several Alpha models, and these were some of the most desirable processing tools available at the time, with a pricetag to reflect the cutting edge capability, e.g.  the AlphaServer 2100 with 3x 5/300 CPUs retailed at more than US$600 000.

In 2012, I bought one of these for £10 on Ebay.

It doesn't work. But that's where the fun starts. This blog will detail the trials and tribulations of my attempts at restoring this machine to the point where it can run OpenVMS, and ultimately join in a local cluster of DEC hardware, and even join a network of DEC machines around the world.

The AlphaServer 2100 5/300 (the CPUs have actually been upgraded from 4/275s to 5/300s). The box beside it is a 3.5" hard disk in a StorageWorks enclosure.

There's a CD-ROM at the top right and floppy drive at top centre. Six 4GB disks are in the central column. This is a big beast of a server. (But they do come much bigger...)