CEOs Blog

October 21, 2009

Fun with SGIs. (Long and extremely technical)

Filed under: Uncategorized — admin @ 6:42 pm

Background

I have always been an extreme fan of SGI. Ever since I saw the first stereographics demo. I undertook one course of postgraduate work over others to work on the universities big 8 processor SGI Power Challenge, while my friends fought with the little Cray EL92.    Onyx2

When I had the chance to build a 4 CPU Onyx2 with a single graphics pipe into an InfiniteReality3  into a 24 CPU machine with two graphics pipes, I went for it. I sourced parts from Italy, Sweeden, France, Canada, but most of it came from the east and west coasts of the USA.

Here is a picture of the machine and a hardware inventory:

http://pymblesoftware.com/onyx2.html

I have collected The afore mentioned 24 CPU cray linked Onyx2, 8 SGI Indys, a pair of O2s, a pair of Octance, a Origin 200 with GigaChannel, three Origin 300s, a pair of Personal IRISes, an Indigo2, an Indigo and I am probably forgetting something..  Like I said, I am a bit of an SGI fan.

My relationship with the company however has always been interesting. To say the least.

The problem

I wanted to link 3 of the Origin 300s into a single system image. Instead of 3 separate servers with 4 CPUs and 4Gb of RAM each, they will look like a single system motherboard with 12 RISC CPUs and 12 Gb of RAM. Done with special cables as thick as your arm with LEDs in the connectors.

I called the sales rep at SGI. The same one I always seems to deal with. I have three Origin 300s and I want to ccNUMA link them into a single system image, I say. Several days later, if the parts are available, then maybe they can do it for about $50,000. “Interesting”, I thought.  So I get my hands on an L2 controller and a Origin 3000 series ccNUMAlink router brick for less than a thousand dollars. Here is where things start to get interesting. I connect everything up, even though it is 3000 series router brick and not a Origin 300 NUMALink module.

600px-Serial

Serial number mismatch

Dealing with SGI

I called the sales rep at SGI. The same one I always seems to deal with. I have three Origin 300s and I want to ccNUMA link them into a single system image, I say. Several days later, if the parts are available, then maybe they can do it for about $50,000. “Interesting”, I thought.  So I get my hands on an L2 controller and a Origin 3000 series ccNUMAlink router brick for less than a thousand dollars. Here is where things start to get interesting. I connect everything up, even though it is 3000 series router brick and not a Origin 300 NUMALink module.

Serial number mismatch

So I fire the the machine up and immediately get a “Serial number mismatch” error on the L1 LCDs of every module. Fine. Lets find a way around this.

A bit of investigation

L1001231-001-L2>ver

L2 version: 1.36.0
L1001231-001-L2>serial
L2 system serial number: L1001231.
L1001231-001-L2>

L1001231-001-L2>1.25 l1
entering L1 mode 001r25, to escape to L2

001r25-L1>ver
L1 1.40.4 (Image B), Built 09/29/2005 13:42:59 [Base 1MB image]
001r25-L1>serial all

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM L1001241
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MHR829
Reference Brick Serial Number NVRAM MHR829

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ————- ——————– — ——
POWER RPWR MHR829 030_1631_002 E 00
LOGIC ROUTER LGP402 030_1634_002 B 00

001r25-L1>

L1001231-001-L2>reboot_l2
will reboot in 5 seconds…
INIT: Switching to runlevel: 6
Sending processes the TERM signal
Restartinÿ

Validating L2 Controller Flash image….OK
Booting…

Ethernet address from Motorola VPD EEPROM is 08:00:69:11:B1:77

Linux/PPC load:
Uncompressing Linux…done.
Now booting the kernel
Linux version 2.4.7-sgil2 (dsd@tstorm) (gcc version 2.95.2 19991030 (2.95.3 prerelease/franzo)) #1 Mon Feb 28 14:51:03 CST 2005
On node 0 totalpages: 4096
zone(0): 4096 pages.
zone(1): 0 pages.
zone(2): 0 pages.
Kernel command line: root=/dev/ram panic=5
Decrementer Frequency = 187500000/60
Calibrating delay loop… 49.76 BogoMIPS
Memory: 11904k available (952k kernel code, 512k data, 180k init, 0k highmem)
Dentry-cache hash table entries: 2048 (order: 2, 16384 bytes)
Inode-cache hash table entries: 1024 (order: 1, 8192 bytes)
Mount-cache hash table entries: 512 (order: 0, 4096 bytes)
Buffer-cache hash table entries: 1024 (order: 0, 4096 bytes)
Page-cache hash table entries: 4096 (order: 2, 16384 bytes)
POSIX conformance testing by UNIFIX
PCI: Probing PCI hardware
I/O resource not set for host bridge 0
Memory resource not set for host bridge 0
PCI: Cannot allocate resource region 0 of PCI bridge 0
PCI: resource is 80000000..7fffffff (100), parent c011c314
PCI:00:04.0: Resource 0: c0000000-c0000fff (f=200)
PCI:00:05.0: Resource 0: c0001000-c0001fff (f=200)
Linux NET4.0 for Linux 2.4
Based upon Swansea University Computer Society NET3.039
Starting kswapd v1.8
i2c-core.o: i2c core module
i2c-dev.o: i2c /dev entries driver module
i2c-core.o: driver i2c-dev dummy driver registered.
i2c-algo-8xx.o: i2c mpc8xx algorithm module
i2c-rpx.o: i2c RPX Lite/MBX module
i2c-dev.o: Registered ‘rpx’ as minor 0
i2c-core.o: adapter rpx registered as adapter 0.
Console: switching to frame buffer device
fb0: SGI L2 (SED137x LCD controller) frame buffer device
fb0: Display panel [mono]: Hantronix HDM3224 (320×240, 4-bit Greyscale)
CPM UART driver version 0.03
ttyS00 at 0x0000 is a SCC
ttyS01 at 0x0100 is a SCC
ttyS02 at 0x0200 is a SCC
ttyS03 at 0x0300 is a SCC
WDT_8xx: Software Watchdog Timer version 0.3, 30 second timeout
block: queued sectors max/low 7810kB/2603kB, 64 slots per queue
RAMDISK driver initialized: 16 RAM disks of 4096K size 1024 blocksize
eth0: FEC ENET Version 0.1, 08:fec: Phy @ 0x0, type 0x78100003
fec: link down
00:fec: 10 Mbps, Half-Duplex
69:11:b1:77
PowerPC realtime clock driver, version 0.1.
usb.c: registered new driver usbdevfs
usb.c: registered new driver hub
PCI: Enabling device 00:04.0 (0000 -> 0002)
usb-ohci.c: USB OHCI at membase 0xc2002000, IRQ 8
usb-ohci.c: usb-00:04.0, PCI device 11c1:5802 (Lucent Microelectronics)
usb.c: new USB bus registered, assigned bus number 1
Product: USB OHCI Root Hub
SerialNumber: c2002000
hub.c: USB hub found
hub.c: 2 ports detected
PCI: Enabling device 00:05.0 (0000 -> 0002)
usb-ohci.c: USB OHCI at membase 0xc2004000, IRQ 10
usb-ohci.c: usb-00:05.0, PCI device 11c1:5802 (Lucent Microelectronics)
usb.c: new USB bus registered, assigned bus number 2
Product: USB OHCI Root Hub
SerialNumber: c2004000
hub.c: USB hub found
hub.c: 2 ports detected
usb-ohci.c: v5.2:USB OHCI Host Controller Driver
usb.c: registered new driver sgil1
usb.c: registered new driver sgil1
usb.c: registered new driver sgil1
sgil1.c: SGI L1 controller support registered
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP, IGMP
IP: routing cache hash table of 512 buckets, 4Kbytes
TCP: Hash tables configured (established 1024 bind 1024)
NET4: Unix domain sockets 1.0/SMP for Linux NET4.0.
RAMDISK: cramfs filesystem found at block 0
RAMDISK: overriding ramdisk block size to 4096 for cramfs filesystem
RAMDISK: Loading 2012 blocks [1 disk] into ram disk… done.
Freeing initrd memory: 2012k freed
VFS: Mounted root (cramfs filesystem).
Freeing unused kernel memory: 180k init
INIT: version 2.77 booting
cp: /rhosts.allow: No such file or directory
Starting DHCP client daemon….
hub.c: USB new device connect on bus1/1, assigned device number 2
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 64 device: 1.2
hub.c: USB new device connect on bus1/2, assigned device number 3
hub.c: USB hub found
hub.c: 7 ports detected
hub.c: USB new device connect on bus1/2/1, assigned device number 4
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 65 device: 1.4
hub.c: USB new device connect on bus1/2/2, assigned device number 5
usb.c: USB device not accepting new address=5 (error=-110)
hub.c: USB new device connect on bus1/2/2, assigned device number 6
usb.c: USB device not accepting new address=6 (error=-110)
hub.c: USB new device connect on bus1/2/4, assigned device number 7
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 66 device: 1.7
hub.c: USB new device connect on bus1/2/5, assigned device number 8
Manufacturer: Silicon Graphics, Inc.
Product: SN1 L1 System Controller
SerialNumber: 00000000
sgil1.c: SGI L1 connected, minor: 67 device: 1.8
dhcpcd[28]: timed out waiting for a valid DHCP server response

INFO: No DHCP server found, starting local DHCP server (to serve L3 clients).
INFO: DHCP: new IP address is 10.17.177.119
INIT: Entering runlevel: 5

SGI L2 Controller
Current L2 version: 1.36.0 (L2 emulator: 1.36.0)
Flashed L2 version: 1.36.0

INFO: opened USB control /dev/sgil1_cs
INFO: opened USB device at b1;p1;d2 (/dev/sgil1_0)
INFO: opened USB device at b1;p2/1;d4 (/dev/sgil1_1)
INFO: opened USB device at b1;p2/4;d7 (/dev/sgil1_2)
INFO: opened USB device at b1;p2/5;d8 (/dev/sgil1_3)
INFO: SMP listening on port: 8001
INFO: attempting connection to localhost:9002

INFO: auto power up appears enabled
INFO: attempting connection to localhost:9002
INFO: auto power up in 30 seconds…
L1001231-001-L2>INFO: Validating connection from ‘localhost’ (127.0.0.1)
INFO: connection to localhost:9002 established.
INFO: Validating connection from ‘localhost’ (127.0.0.1)
INFO: connection to localhost:9002 established.
INFO: Connection – sgi (sgi) @ localhost running ‘l2flash’.
INFO: Connection – sgi (sgi) @ localhost running ‘l2gui’.
INFO: attempting connection to localhost:9002
INFO: Validating connection from ‘localhost’ (127.0.0.1)
INFO: connection to localhost:9002 established.
INFO: Connection – sgi (sgi) @ localhost running ‘l2part’.
INFO: auto power up in 25 seconds…
INFO: auto power up in 20 seconds…
INFO: auto power up in 15 seconds…
INFO: auto power up in 10 seconds…
INFO: auto power up in 5 seconds…
INFO: initiating auto power up.
ERROR: auto power up error.
001r25ERROR: SerNum:System Serial Number mismatch. See log for details.
001c01ERROR: SerNum:System Serial Number mismatch. See log for details.
001c02ERROR: SerNum:System Serial Number mismatch. See log for details.
001c03ERROR: SerNum:System Serial Number mismatch. See log for details.
001c01ERROR: power appears off.
001c02ERROR: power appears off.
001c03ERROR: power appears off.
001r25ERROR: power appears off.

L1001231-001-L2>serial all
001c01:

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2001411
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM KJD687
Reference Brick Serial Number NVRAM KJD687

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ————- ——————– — ——
NODE IP45_4CPU KJD687 030_1797_001 B 00
IO8 IO8 MHL579 030_1673_003 E 00

EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
———- ———————— —————— —- —— ——–
DIMM 0 CE0000000000000026051400 M3 46L2820BT1-CA0 0B 8.0 N/A
DIMM 2 CE000000000000000C7D8000 M3 46L2820DT2-CA0 2D 10.0 N/A
DIMM 1 7F7FFE000000000012000092 CM2201B 2 8.0 N/A
DIMM 3 7F7FFE000000000012000151 CM2201B 2 8.0 N/A

001c02:

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2001411
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MLJ194
Reference Brick Serial Number NVRAM MLJ194

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ———- ——————– — ——
NODE IP45_4CPU MLJ194 030_1779_001 C 00
IO8 IO8 MLG164 030_1673_003 F 00

EEPROM JEDEC Info Part Number Rev Speed (ns)
———- ———————— —————— — ———-
DIMM 0 7F7FFE000000000012000145 CM2201B 2 08.0
DIMM 2 7F7FFE000000000012000141 CM2201B 2 08.0
DIMM 1 CE0000000000000026061400 M3 46L2820BT1-CA0 0B 08.0
DIMM 3 CE000000000000000C698000 M3 46L2820DT2-CA0 2D 10.0

001c03:

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2100226
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MJN491
Reference Brick Serial Number NVRAM MJN491

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ———- ——————– — ——
NODE IP45_4CPU MJN491 030_1728_002 D 00
IO8 IO8 MHR530 030_1673_003 F 00

EEPROM JEDEC Info Part Number Rev
———- ———————— —————— —
DIMM 0 10000000000000002E015A00 444BH 2
DIMM 2 100000000000000034015A00 444BH 2
DIMM 1 10000000000000006E005A00 444BH 2
DIMM 3 10000000000000000A015A00 444BH 2

001r25:

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM L1001241
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MHR829
Reference Brick Serial Number NVRAM MHR829

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ————- ——————– — ——
POWER RPWR MHR829 030_1631_002 E 00
LOGIC ROUTER LGP402 030_1634_002 B 00

L1001231-001-L2>
L1001231-001-L2> 1.1 l1
entering L1 mode 001c01, to escape to L2

001c01-L1>serial all

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2001411
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM KJD687
Reference Brick Serial Number NVRAM KJD687

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ————- ——————– — ——
NODE IP45_4CPU KJD687 030_1797_001 B 00
IO8 IO8 MHL579 030_1673_003 E 00

EEPROM JEDEC-SPD Info Part Number Rev Speed SGI
———- ———————— —————— —- —— ——–
DIMM 0 CE0000000000000026051400 M3 46L2820BT1-CA0 0B 8.0 N/A
DIMM 2 CE000000000000000C7D8000 M3 46L2820DT2-CA0 2D 10.0 N/A
DIMM 1 7F7FFE000000000012000092 CM2201B 2 8.0 N/A
DIMM 3 7F7FFE000000000012000151 CM2201B 2 8.0 N/A

001c01-L1>ver
L1 1.30.14 (Image B), Built 08/05/2004 11:09:57 [Base 1MB image]
001c01-L1>
L1001231-001-L2>1.2 l1
entering L1 mode 001c02, to escape to L2

001c02-L1>serial all

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2001411
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MLJ194
Reference Brick Serial Number NVRAM MLJ194

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ———- ——————– — ——
NODE IP45_4CPU MLJ194 030_1779_001 C 00
IO8 IO8 MLG164 030_1673_003 F 00

EEPROM JEDEC Info Part Number Rev Speed (ns)
———- ———————— —————— — ———-
DIMM 0 7F7FFE000000000012000145 CM2201B 2 08.0
DIMM 2 7F7FFE000000000012000141 CM2201B 2 08.0
DIMM 1 CE0000000000000026061400 M3 46L2820BT1-CA0 0B 08.0
DIMM 3 CE000000000000000C698000 M3 46L2820DT2-CA0 2D 10.0

001c02-L1>ver
L1 1.12.6 (Image B), Built 04/22/2002 08:13:40 [1MB image]
001c02-L1>

L1001231-001-L2>1.3 l1
entering L1 mode 001c03, to escape to L2

001c03-L1>serial all

Data Location Value
—————————— ———— ——–
Local System Serial Number NVRAM M2100226
Reference System Serial Number Attached L2 L1001231
Local Brick Serial Number EEPROM MJN491
Reference Brick Serial Number NVRAM MJN491

EEPROM Product Name Serial Part Number Rev T/W
———- ————– ———- ——————– — ——
NODE IP45_4CPU MJN491 030_1728_002 D 00
IO8 IO8 MHR530 030_1673_003 F 00

EEPROM JEDEC Info Part Number Rev
———- ———————— —————— —
DIMM 0 10000000000000002E015A00 444BH 2
DIMM 2 100000000000000034015A00 444BH 2
DIMM 1 10000000000000006E005A00 444BH 2
DIMM 3 10000000000000000A015A00 444BH 2

001c03-L1>ver
L1 1.8.4 (Image B), Built 10/30/2001 11:47:34 [P1 support]
001c03-L1>

Resolution

You can’t set a serial number with the prefix ‘L’ on an O300
L2 command processor engaged, for console mode.
L1001231-001-L2>1.1 l1
entering L1 mode 001c01, to escape to L2
serial clear
001c01-L1>serial
BSN: KJD687 SSN: L0000000 Time: 06/04/2009 08:14:15 CDT
001c01-L1>
001c01-L1>
001c01-L1>serial clear
001c01-L1>
001c01-L1>
001c01-L1>001c01 INFO: System serial number reassigned to Mxxxxxx from attached L2.

So I ask around..

You’ll want to get all of your L1s (and PROMs if they’re not) at the same version after you get
everything talking, it’ll save you some headaches in the longrun. I’m at 1.22.4 on everything except
the L2 (which is 1.32.4) and it’s been working flawlessly (and no, it didn’t enable security on my O300s
and I’m still able to do a “serial clear” from the L1 successfully). To be safe though, get them
connected together first.
Oh, btw…the easiest way to get the serials on your O300s in sync is to set the L2 to one they can set
(prefix ‘M’) then do a ‘serial clear’ on each O300…if everything is wired properly it should pick up
the serial from the L2 Controller automagically.

I respond

I used Mxxxxxx for L2 and cleared the Origin300s…
The router brick is a no go…
1.25 serial clear
001r25:
INFO: command not supported on bricks that enforce security.
Mxxxxxxx-001-L2>

Solution to by pass the security:

unplug the USB on the router brick..
Power up the router brick at the back.
push the button on the front
Reconnect the USB cable at the back..
connect to the l1 on the router brick.
flash default a
reboot_l1
escape back to the l2
1.1 reset
1.2 reset
1.3 reset

….and…
Code:
***Warning: Board in module 001c01 is missing or disabled
It previously contained a New Type board, barcode laser 0
***Warning: Found a new IP35 board in module 001c03, serial MJN491
Please use the ‘update’ command from the PROM Monitor to update the inventory
***Warning: Found a new IBRICK board in module 001c03, serial MHR530
Please use the ‘update’ command from the PROM Monitor to update the inventory
***Warning: Found a new IBRICK board in module 001c03, serial MHR530
Please use the ‘update’ command from the PROM Monitor to update the inventory
DONE

**** System Configuration and Diagnostics Summary ****
CONFIG:
No. of NODEs enabled = 3
No. of NODEs disabled = 0
No. of CPUs enabled = 12
No. of CPUs disabled = 0
Mem enabled = 12288 MB
Mem disabled = 0 MB
No. of RTRs enabled = 1
No. of RTRs disabled = 0

DIAG RESULTS:
ALL DIAGS PASSED.
**** End System Configuration and Diagnostics Summary ****

System Maintenance Menu

1) Start System
2) Install System Software
3) Run Diagnostics
4) Recover System
5) Enter Command Monitor

Option?
Some Explaination
With the mismatching serial numbers, the compute modules and the router brick will not fire up together.
Disconnected (from L2 & USB) the router brick will start.Disconnected it is not mis-matched with anything.
Having started the router brick it is then ok (with the older firmware) to reconnect it..
Because the L1s on the Origin 300 compute modules were started without the router present they don’t know of each others existence.
reseting all the compute modules cause them to “discover” the r-brick during initialisation and thus each other and become a single system image.

La vie en stereo.

Filed under: Stereographics — admin @ 1:22 pm

I have a pair of “Cyrstal Eyes” 3D shutter glasses and a transmitter box hooked up to my dual CPU SGI Octane with 8Gb of RAM.  The shutter glasses produce the same kind of popping out of the screen images that are making a come back at the cinema. I had some one over the other day. Do you have any vision problems I asked. No, just colour blindness he said. Ok, put these on and check this out… I can see it coming out of the screen he says. Then he takes the glasses off. Bang instant dizziness, nausea, headaches. I am used to them. The mistake was take the glasses off too quickly on a day that was too bright and sunny. It wasn’t a ruse to send people home. No really.

Now for some Stereographics programming. Follow it under the Stereographics category.

« Newer Posts

Powered by WordPress