Warning Livedoc is no longer being updated and will be deprecated shortly. Please refer to https://documentation.tjhsst.edu.

i7 Cluster Administration

From Livedoc - The Documentation Repository
Jump to: navigation, search

Initial Setup

A script exists in the cluster image that will automatically setup and configure a cluster node. This script is located at /usr/csl/bin/image.initialinstall.i7.

Running the Setup Script

Using Netroot:

  1. Enter BIOS and change the boot order so that the second NIC in the list is the primary boot device.
  2. Save and exit BIOS, boot and enter "netroot" at the prompt.
  3. Make sure both disks are accessible by running lsblk.
  4. Unmount all volumes and clear the partition table, dd if=/dev/zero of=/dev/sdX bs=512M count=1 (X being the correct partition number) should remove all partitions.
  5. Reboot back into netroot.
  6. Clone the git repository at i7-imaging.git to /tmp/i7-imaging.
  7. Edit the autoinstall script with the appropriate variables; make sure the $myname, $image, and $imageserver variables are correct. The values should be the hostname of the client being imaged, "asm", and "haimageserver" respectively.
  8. Run the script ./autoinstall.
  9. When the script is finished, reboot the machine and check for any issues.

Using LiveUSB:

  1. Create/acquire a 64bit gentoo liveusb drive (minimal image is fine). See the Gentoo guide
  2. Enter BIOS and change the boot order so that the liveusb is the primary boot device.
  3. Boot to the drive. The nocache option is useful, so the device does not have to stay in the machine.
  4. Run the hostname command, setting the hostname to the final desired name
  5. Make sure networking is running and both disks are accessible
  6. Follow steps 6-9 described under "Using Netroot".

Imaging

Dedicated cluster nodes run a different image than workstations. This image is maintained through a different set of tools than the main workstation image.

Changelog

/usr/csl/etc/image.changelog is the changelog for the image. This file also instructs the release system whether clients should be auto-updated to a particular release. See the updating clients section for more details.

Sample file:

200909110100 - <CURRENT> - RELEASE,NOREBOOT
   username:
      * world update: foo bar baz
      * install: foobar

The header of this sample file indicates that this image version should automatically be released to clients and that the release does not require a reboot. NORELEASE indicates that the image should not be released, and REBOOT indicates that the image requires a reboot. The dates in the first two fields of the header are automatically updated by the image scripts.

Updating the Image on the Server

In a nutshell:

  1. Make changes to the image.
  2. Edit the changelog file.
  3. Run ./scripts/updateimageserver asm

./scripts/updateimageserver asm will upload the new image to the server. The script will replace the current asm image at /var/imageserver/images/asm

Updating Clients

Clients check for a new image release every fifteen minutes, starting on the hour. If a new image is available and the new image has been released, then the client will automatically update itself. This is performed by /usr/csl/bin/image.updaterelease.

/usr/csl/bin/image.updateclient is responsible for actually updating the image. The script will not allow multiple instances of the script to run at once, and will not run without forcing if the image version has not changed. If the image has changed, the script will rsync changes from the server and then run the post setup scripts (/usr/csl/bin/image.post_setup).

IMPORTANT: There is are two potential issues with this imaging system:

  1. Say a client is taken offline due to hardware issues. During this time period, the image is updated and released. The image is updated again, but not released. The client is then brought back online. The client will NOT be updated until another image is released, even though there has been a newer released image since the client was last updated. Because there is no version control system, the older released image is considered safer than the newer non-released image.
  2. Because an image is updated every fifteen minutes, starting on the hour, there is a possible race condition if an image is being uploaded to the server while the client is updating. The client will think it has the latest version of the image, but it may miss files. Forcing another sync or releasing another image will fix this issue, if the client is still able to run, but the problem will not be fixed automatically. Do not run the updateserver script just before clients are scheduled to check for a new release!

Exclude Files

/usr/csl/etc/image.exclude.download lists the files and directories that are not downloaded from the imageserver to the client. If the files and directories exist on the client, they will not be changed.

/usr/csl/etc/image.exclude.upload lists the files and directories that are not uploaded to the imageserver from the client. Note: if a file in this list exists on the server, then it will remain on the server, even if the same file is deleted on the system updating the server. This is desirable for some files.

CUDA

CUDA requires X to be running for peak performance. Ensure that X is running, even though the cluster nodes are headless.

CUDA Permissions

/etc/modprobe.d/nvidia changes the permissions of /dev/nvidia* device nodes to 666, to allow all users to use CUDA. Previously this was accomplished by a udev rule, but this is no longer necessary.

FindCUDA.cmake

/usr/local/share/cmake/ contains the FindCUDA.cmake script from https://gforge.sci.utah.edu/gf/project/findcuda/ for using CUDA with CMake.

See Also