I7 Cluster Administration
A script exists in the cluster image that will automatically setup and configure a cluster node. This script is located at
Running the Setup Script
- Create/acquire a 64bit gentoo liveusb drive (minimal image is fine). See the Gentoo guide
- Boot the drive. The nocache option is useful, so the device does not have to stay in the machine.
- Run the hostname command, setting the hostname to the final desired name
- Make sure networking is running and both disks are accessible
- rsync the /usr/csl directory from the cluster image
- Run this script as ./csl/bin/image.initialinstall.i7 - the path is important
- When the script is finished, reboot the machine, removing the USB drive
Dedicated cluster nodes run a different image than workstations. This image is maintained through a different set of tools than the main workstation image.
/usr/csl/etc/image.changelog is the changelog for the image. This file also instructs the release system whether clients should be auto-updated to a particular release. See the updating clients section for more details.
200909110100 - <CURRENT> - RELEASE,NOREBOOT username: * world update: foo bar baz * install: foobar
The header of this sample file indicates that this image version should automatically be released to clients and that the release does not require a reboot. NORELEASE indicates that the image should not be released, and REBOOT indicates that the image requires a reboot. The dates in the first two fields of the header are automatically updated by the image scripts.
Updating the Image on the Server
In a nutshell:
- Make changes to the image.
- Edit the changelog file.
/usr/csl/bin/image.updateserver will process the changelog flags and upload the new image to the server. The script will replace <CURRENT> with the current timestamp, the new image version. This timestamp is put in
/usr/csl/etc/image.version. If the image is to be released,
/usr/csl/etc/image.release is created. If the image is to be released and the image requires a reboot,
/usr/csl/etc/image.release.reboot is created. These files are used by the update scripts to determine the proper update behavior.
Clients check for a new image release every fifteen minutes, starting on the hour. If a new image is available and the new image has been released, then the client will automatically update itself. This is performed by
/usr/csl/bin/image.updateclient is responsible for actually updating the image. The script will not allow multiple instances of the script to run at once, and will not run without forcing if the image version has not changed. If the image has changed, the script will rsync changes from the server and then run the post setup scripts (
IMPORTANT: There is are two potential issues with this imaging system:
- Say a client is taken offline due to hardware issues. During this time period, the image is updated and released. The image is updated again, but not released. The client is then brought back online. The client will NOT be updated until another image is released, even though there has been a newer released image since the client was last updated. Because there is no version control system, the older released image is considered safer than the newer non-released image.
- Because an image is updated every fifteen minutes, starting on the hour, there is a possible race condition if an image is being uploaded to the server while the client is updating. The client will think it has the latest version of the image, but it may miss files. Forcing another sync or releasing another image will fix this issue, if the client is still able to run, but the problem will not be fixed automatically. Do not run the updateserver script just before clients are scheduled to check for a new release!
/usr/csl/etc/image.exclude.download lists the files and directories that are not downloaded from the imageserver to the client. If the files and directories exist on the client, they will not be changed.
/usr/csl/etc/image.exclude.upload lists the files and directories that are not uploaded to the imageserver from the client. Note: if a file in this list exists on the server, then it will remain on the server, even if the same file is deleted on the system updating the server. This is desirable for some files.
CUDA requires X to be running for peak performance. Ensure that X is running, even though the cluster nodes are headless.
/etc/modprobe.d/nvidia changes the permissions of
/dev/nvidia* device nodes to 666, to allow all users to use CUDA. Previously this was accomplished by a udev rule, but this is no longer necessary.
/usr/local/share/cmake/ contains the FindCUDA.cmake script from https://gforge.sci.utah.edu/gf/project/findcuda/ for using CUDA with CMake.