Warning Livedoc is no longer being updated and will be deprecated shortly. Please refer to https://documentation.tjhsst.edu.

SAN/Installation

From Livedoc - The Documentation Repository
Revision as of 00:29, 27 February 2016 by 2016fwilson (talk | contribs) (Resource Agents: categorize)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Summary

This article will cover the process of installing and initially configuring the hardware and software required to run the CSL SAN.

Prerequisites

This setup assumes that you have a multipath-capable SAS storage array and at least two servers and the requisite HBAs and cables to connect the array to the servers. You will also need a third server to serve as a dummy node in the cluster to break ties. See the main SAN article for details on the hardware currently used in the CSL.

Preparation

Install and update the servers that you will be using in the SAN. You will need a total of three systems, at least two of which should be as close to identical as reasonable. For detailed instructions on installing systems with the CSL Gentoo Server image, see the Gentoo_Server_Install guide.

Hardware Configuration

Install an SAS HBA into each of the two servers that will be connected to the storage array. Then connect each server to the storage array using appropriate SAS cables. Make sure to check the array's documentation to ensure that both servers are connected to all of the drives in the array. For the SC847, this means that each server should have one connection to the bottom pair of ports and one connection to the top pair of ports. At this point you should also connect any network cables to the servers; a minimum of two NICs is recommended for both redundancy and performance.

Once everything is connected, boot both servers and make sure that all of the hardware shows up correctly. In particular make sure that all of the drives in the array are listed in /dev/ on both servers. If not, double-check the cabling before continuing.

Software Installation

Install the following packages onto the two storage nodes:

  • iproute2
  • ZFS on Linux (ZoL)
  • pacemaker 1.1.8 or higher
  • corosync 2.3.0 or higher
  • nfs-utils

On gentoo this can be done with the following steps. First, add the following lines to /etc/portage/package.keywords (replace version numbers where applicable):

<sys-kernel/spl-9999
<sys-kernel/zfs-kmod-9999
<sys-kernel/zfs-9999
=sys-cluster/corosync-2.3.0-r1 ~amd64
=sys-cluster/libqb-0.14.4 ~amd64
=sys-cluster/pacemaker-1.1.8-r2
=sys-cluster/crmsh-1.2.5-r3

Add the following line to /etc/portage/package.unmask (replace version numbers if needed):

=sys-cluster/corosync-2.3.0-r1

Add the following line to /etc/portage/package.use:

sys-cluster/pacemaker smtp snmp

Run the following command to emerge the necessary packages and dependencies onto the storage servers:

emerge -a sys-kernel/zfs sys-apps/iproute2 net-fs/nfs-utils sys-cluster/pacemaker

Required Kernel Options

You will need the following options enabled in storage server kernels for the LIO iSCSI Target:

Device Drivers  --->
   <M> Generic Target Core Mod (TCM) and ConfigFS Infrastructure  --->
      <M>   TCM/IBLOCK Subsystem Plugin for Linux/BLOCK
      <M>   TCM/FILEIO Subsystem Plugin for Linux/VFS
      <M>   Linux-iSCSI.org iSCSI Target Mode Stack

You will need the following options enabled in your storage server kernels for the NFS File Server:

File systems  --->
   [*] Network File Systems  --->
      <M>   NFS server support
      -*-     NFS server support for NFS version 3
      [*]     NFS server support for NFS version 4 (EXPERIMENTAL)

Quorum Node

The quorum node only needs the basic cluster software; it does not need any of the resource specific files. Repeat the above steps to install the required software on the quorum node, however, you may optionally skip anything related to the following packages:

  • sys-kernel/spl
  • sys-kernel/zfs-kmod
  • sys-kernel/zfs
  • LIO kernel configuration
  • NFS kernel configuration

Software Configuration

With all of the necessary software installed, it is now time to configure and initialize the cluster. All of the below steps should be performed on all three cluster nodes (both storage servers and the quorum node) unless otherwise indicated.

ZFS on Linux

Storage Servers Only For ease of disk management as well as identifying and replacing lost disks, we will configure friendly names for our disks before creating the zpools. This is done through /etc/zfs/vdev_id.conf which should be created with the following format:

#     Disk Alias        #Disk by-path entry
alias apocalypse0       /dev/disk/by-path/pci-0000:43:00.0-sas-0x5000c50055df0f1e-lun-0
alias apocalypse1       /dev/disk/by-path/pci-0000:43:00.0-sas-0x5000c50055df9d46-lun-0
alias apocalypse2       /dev/disk/by-path/pci-0000:43:00.0-sas-0x5000c50042012e72-lun-0

After this, run the following command to create the drive aliases in /dev/disk/by-vdev/:

snares ~ # udevadm trigger

Next you need to create one or more zpools out of the drives in your array. Currently we use a single zpool comprised of a 10-disk RAID-z2 array with a hot spare. Depending on your planned I/O and redundancy needs, you may benefit from different drive configurations. When in doubt, the ZFS on Linux zfs-discuss mailing list is a good place to ask. Our zpool was created with the following command:

snares ~ # zpool create apocalypse raidz2 apocalypse0 \
apocalypse1 apocalypse2 apocalypse3 \
apocalypse4 apocalypse5 apocalypse6 \
apocalypse7 apocalypse8 apocalypse9 \
spare apocalypse10

IMPORTANT - only create the zpool on one of the two servers. The same zpool will be migrated between servers using the zpool export and import commands.

Once the zpool is created, there are a number of configuration options that can be changed. We recommend immediately creating a reserved 'safety' partition as well as enabling transparent data compression. The safety partition is necessary because ZFS is a copy-on-write filesystem and will lock up and be unable to write or delete files if it completely runs out of disk space. The above steps can be done with the following commands:

snares ~ # zfs create apocalypse/safety
snares ~ # zfs set reservation=1G apocalypse/safety
snares ~ # zfs set compression=on apocalypse

Networking

A stable, high-speed network is crucial to the performance and operation of the cluster. In addition, because iSCSI traffic is unencrypted and only lightly secured, it is highly recommended that it only be run on an isolated VLAN to prevent snooping. Because our storage servers each have two NICs, we use the following network configuration:

  • eth0 - tagged onto our server VLAN for management, tagged onto the SAN VLAN for storage operations
  • eth1 - untagged onto the SAN VLAN for storage operations.

This is done on Gentoo using a configuration similar to the following:

config_eth0="null"
vlans_eth0="16 1600"
vlan16_name="vlan16"
vlan1600_name="vlan1600"
config_vlan16="172.16.17.54/16"
config_vlan1600="198.38.17.54/23"
routes_vlan1600="default via 198.38.17.254"
dns_servers_vlan1600="198.38.16.40 198.38.16.41 151.188.14.2"
dns_search_vlan1600="csl.tjhsst.edu tjhsst.edu sun.tjhsst.edu"
config_eth1="172.16.170.54/16"

Each storage server is given two separate IPs on the SAN network which then allows the VM Servers to use multipath I/O to increase bandwidth between them and the storage servers.

Once all addresses are configured, make sure reverse DNS is working for all cluster IPs. All cluster nodes need to be able to resolve all cluster IPs in order for the cluster to function properly. You may need to add reverse zones to your site DNS servers for the SAN network if they do not already exist.

Corosync

We need to configure a corosync 'ring' on each of our SAN interfaces for redundancy and communication. Corosync will use these rings to share cluster information among the various nodes.

Add or edit the following lines in /etc/corosync/corosync.conf. Note that the bindnetaddrs must match the addresses configured on each server, however, the mcastaddrs must be THE SAME on each server.

totem {
    crypto_cipher: aes256
    crypto_hash: sha512
    interface {
        ringnumber: 0
        bindnetaddr: 172.16.17.54
        mcastaddr: 239.255.1.1
        mcastport: 5405
    }
    interface {
        ringnumber: 1
        bindnetaddr: 172.16.170.54
        mcastaddr: 239.255.1.2
        mcastport: 5405
    }
}
quorum {
    provider: corosync_votequorum
    expected_votes: 2
}

Now start corosync on each of the cluster nodes with the following command:

snares ~ # /etc/init.d/corosync start

Corosync should now be running on each of the cluster nodes. You can check the status of the cluster with the following commands:

snares ~ # corosync-quorumtool
Quorum information
------------------
Date:             Tue Jul 16 11:53:17 2013
Quorum provider:  corosync_votequorum
Nodes:            3
Node ID:          907088044
Ring ID:          1348
Quorate:          Yes

Votequorum information
----------------------
Expected votes:   3
Highest expected: 3
Total votes:      3
Quorum:           2  
Flags:            Quorate 

Membership information
----------------------
    Nodeid      Votes Name
 118558892          1 rockhopper
 907088044          1 snares (local)
 923865260          1 bottom

snares ~ # corosync-cfgtool -s
Printing ring status.
Local node ID 907088044
RING ID 0
        id      = 172.16.17.54
        status  = ring 0 active with no faults
RING ID 1
        id      = 172.16.170.54
        status  = ring 1 active with no faults

Pacemaker

Start pacemaker on each of the cluster nodes with the following command:

snares ~ # /etc/init.d/pacemaker start

Make sure that pacemaker has started successfully on all of the nodes with the following command:

snares ~ # crm_mon -1

Start the pacemaker configuration shell, change to configure mode, and display the current configuration:

snares ~ # crm
crm(live)# configure
crm(live)configure# show

Edit the configuration for the quorum-only node and add the standby attribute so that it does not attempt to start cluster resources:

crm(live)configure# edit rockhopper
node $id="118558892" rockhopper \
       attributes standby="on"

STONITH

The first resource that we will configure on our new cluster is STONITH. STONITH is needed to ensure the safe failover of resources in the event of a failed node.

We use STONITH's iLO plugin to forcibly power-off nodes via the Integrated Lights-out management built into our HP servers. To configure this, first create a stonith user on each storage server's iLO and assign it power privileges. Then use the following command in the configure shell to create the primitive for each iLO device. Note that we specify the iLO by IP Address to ensure that STONITH will function even if DNS is offline.

crm(live) configure# primitive snares-ilo stonith:external/riloe \
params hostlist="snares" ilo_hostname="198.38.23.54" ilo_user="stonith" \
ilo_password="XXXXX" ilo_can_reset="1" ilo_protocol="2.0" \
ilo_powerdown_method="button"

We also setup a backup STONITH system called meatware. This is a manual STONITH system that requires a Systems Administrator to manually ensure that the affected system is powered off and then acknowledge to the cluster that this has been done. It is ordinarily not needed, however, it provides a backup if one of the iLOs dies.

crm(live) configure# primitive meatware stonith:meatware \
params hostlist="snares bottom rockhopper"

Resource Agents

Resource agents form the backbone of the cluster. Primarily they manage cluster resources, however, there are a few that are essential to the proper operation of the cluster.

We setup a ping resource that monitors each storage node's connection to our central switching infrastructure. This is then combined with rules that force a node that loses its connection to the network to remove itself from the cluster before it needs to be killed.

crm(live) configure# primitive ping ocf:pacemaker:ping \
params dampen="60s" name="ping" \
host_list="198.38.16.254 198.38.31.254 198.38.23.252 2001:468:cc0::2 2001:468:cc0::" \
attempts="4" timeout="2" \
op start interval="0" timeout="60s" \
op stop interval="0" timeout="20s" \
op monitor interval="10s" timeout="60s"
crm(live) configure# clone connected ping \
meta clone-max="2"