Difference between revisions of "Bootstrapping the Lab"
Revision as of 12:31, 7 July 2014
Bootstrapping the Lab
This guide provides a basic outline of the process required to bootstrap the Computer Systems Lab infrastructure after a major outage such as a complete loss of Power or Air Conditioning. It covers the order and the rough process to start the necessary infrastructure for basic lab operations.
In order to bootstrap the lab, you will need the following items:
- This guide
- The guide to the SAN
- Either physical access to the lab or an account on Moon
- A fairly complete root passcard
- A copy of the tjhsst.edu zone (not required but very useful)
- Lots of patience
The first step in bootstrapping the lab is getting the network online. In most cases, this should happen more or less automatically when power is applied. If you are manually powering on devices, the core switch should be powered on first and be allowed to boot, followed by any sub-switches.
Remote Access (emergency)
Once networking is online, the next step is to get ssh access to the lab. Moon should automatically power on as soon as it receives power. It will take a bit to boot and then provide emergency access to the lab as well as a console line to the core switch if needed.
The third step is DNS; very little works well without DNS and what does work is generally very slow. The easiest way to get DNS is to start the ns1 VM that runs from local storage on Galapagos. To get into Galapagos you will need to either ssh in as root or access the iLO and log into the virtual serial port as root (user accounts will not work; there is no LDAP and no Kerberos at this point). Once you have logged into Galapagos, start ns1 and wait for it to boot. You should then be able to ssh to systems by name instead of having to manually look up IPs yourself.
Kerberos / LDAP
Kerberos and LDAP are next since they will vastly simplify starting the remaining services. For Kerberos, you will need to start at least one of the KDC systems. In general, they will power on nicely after outages, however, they may require a reboot to start working properly. For LDAP, you will need one of the openldap VMs online; the easiest one to start is openldap1, also stored locally on Galapagos. Once you have a KDC and an LDAP server online, you should be able to log into VM servers as yourself and then ksu to gain root.
After Kerberos and LDAP are online, the SANs are your next priority. Boot at least one of the storage servers for each SAN and initialize them according to the documentation.
VM Servers / VMs
Now you can bring the rest of the VM servers online and connect them to the SAN. You can also now connect Galapagos to the SAN. As VM servers boot and connect to the SAN, you can begin booting VMs. A good order to start critical services in (taking into account dependencies) is:
- AFS (these should start first since they likely need to salvage
- Mail / Lists
- DB Services (MySQL and LDAP)
- WWW (once AFS finishes salvaging)
- Remote Access Servers (once AFS finishes salvaging)
- Everything else
Congratulations! You now have a functional infrastructure. Be sure to check Nagios to see if there are systems that are still down or did not take well to the power outage. Also do a sanity check of critical services to ensure they are operating properly.