Sun Cluster Cheat Sheet

This cheatsheet contains common commands and information for both Sun Cluster 3.1 and 3.2, there is some missing information and over time I hope to complete this i.e zones, NAS devices, etc

Also both versions of Cluster have a text based GUI tool, so don't be afraid to use this, especially if the task is a simple one

Also all the commands in version 3.1 are available to version 3.2

Daemons and Processes

At the bottom of the installation guide I listed the daemons and processing running after a fresh install, now is the time explain what these processes do, I have managed to obtain informtion on most of them but still looking for others.

Versions 3.1 and 3.2
clexecd This is used by cluster kernel threads to execute userland commands (such as the run_reserve and dofsck
commands). It is also used to run cluster commands remotely (like the cluster shutdown command).
This daemon registers with failfastd so that a failfast device driver will panic the kernel if this daemon is killed and not restarted in 30 seconds.
cl_ccrad This daemon provides access from userland management applications to the CCR. It is automatically restarted if it is stopped.
cl_eventd The cluster event daemon registers and forwards cluster events (such as nodes entering and leaving the cluster). There is also a protocol whereby user applications can register themselves to receive cluster events.
The daemon is automatically respawned if it is killed.
cl_eventlogd cluster event log daemon logs cluster events into a binary log file. At the time of writing for this course, there is no published interface to this log. It is automatically restarted if it is stopped.
failfastd This daemon is the failfast proxy server.The failfast daemon allows the kernel to panic if certain essential daemons have failed
rgmd The resource group management daemon which manages the state of all cluster-unaware applications. A failfast driver panics the kernel if this daemon is killed and not restarted in 30 seconds.
rpc.fed This is the fork-and-exec daemon, which handles requests from rgmd to spawn methods for specific data services. A failfast driver panics the kernel if this daemon is killed and not restarted in 30 seconds.
rpc.pmfd This is the process monitoring facility. It is used as a general mechanism to initiate restarts and failure action scripts for some cluster framework daemons (in Solaris 9 OS), and for most application daemons and application fault monitors (in Solaris 9 and10 OS). A failfast driver panics the kernel if this daemon is stopped and not restarted in 30 seconds.
pnmd Public managment network service daemon manages network status information received from the local IPMP daemon running on each node and facilitates application failovers caused by complete public network failures on nodes. It is automatically restarted if it is stopped.
scdpmd

Disk path monitoring daemon monitors the status of disk paths, so that they can be reported in the output of the cldev status command. It is automatically restarted if it is stopped.

Multi-threaded DPM daemon runs on each node. It is automatically started by an rc script when a node boots. It monitors the availibility of logical path that is visiable through various multipath drivers (MPxIO), HDLM, Powerpath, etc. Automatically restarted by rpc.pmfd if it dies.

Version 3.2 only
qd_userd This daemon serves as a proxy whenever any quorum device activity requires execution of some command in userland i.e a NAS quorum device
cl_execd  
ifconfig_proxy_serverd  
rtreg_proxy_serverd  
cl_pnmd is a daemon for the public network management (PMN) module. It is started at boot time and starts the PMN service. It keeps track of the local host's IPMP state and facilities inter-node failover for all IPMP groups.
scprivipd This daemon provisions IP addresses on the clprivnet0 interface, on behalf of zones.
sc_zonesd This daemon monitors the state of Solaris 10 non-global zones so that applications designed to failover between zones can react appropriately to zone booting failure
cznetd It is used for reconfiguring and plumbing the private IP address in a local zone after virtual cluster is created, also see the cznetd.xml file.
rpc.fed This is the "fork and exec" daemin which handles requests from rgmd to spawn methods for specfic data services. Failfast will hose the box if this is killed and not restarted in 30 seconds
scqdmd The quorum server daemon, this possibly use to be called "scqsd"
pnm mod serverd  

File locations

Both Versions (3.1 and 3.2)
man pages /usr/cluster/man
log files /var/cluster/logs
/var/adm/messages
Configuration files (CCR, eventlog, etc) /etc/cluster/
Cluster and other commands /usr/cluser/lib/sc
Version 3.1 Only
sccheck logs /var/cluster/sccheck/report.<date>
Cluster infrastructure file /etc/cluster/ccr/infrastructure
Version 3.2 Only
sccheck logs /var/cluster/logs/cluster_check/remote.<date>
Cluster infrastructure file /etc/cluster/ccr/global/infrastructure
Command Log /var/cluster/logs/commandlog

SCSI Reservations

Display reservation keys

scsi2:
/usr/cluster/lib/sc/pgre -c pgre_inkeys -d /dev/did/rdsk/d4s2

scsi3:
/usr/cluster/lib/sc/scsi -c inkeys -d /dev/did/rdsk/d4s2

determine the device owner

scsi2:
/usr/cluster/lib/sc/pgre -c pgre_inresv -d /dev/did/rdsk/d4s2

scsi3:
/usr/cluster/lib/sc/scsi -c inresv -d /dev/did/rdsk/d4s2

Command shortcuts

In version 3.2 there are number of shortcut command names which I have detailed below, I have left the full command name in the rest of the document so it is obvious what we are performing, all the commands are located in /usr/cluster/bin

 
shortcut
cldevice cldev
cldevicegroup cldg
clinterconnect clintr
clnasdevice clnas
clquorum clq
clresource clrs
clresourcegroup clrg
clreslogicalhostname clrslh
clresourcetype clrt
clressharedaddress clrssa

Shutting down and Booting a Cluster

 
3.1
3.2
shutdown entire cluster (All Nodes will be brought down to init 0)

##other nodes in cluster
scswitch -S -h <host>
shutdown -i5 -g0 -y

## Last remaining node
scshutdown -g0 -y

cluster shutdown -g0 -y
shutdown single node
scswitch -S -h <host>
shutdown -i5 -g0 -y
clnode evacuate <node>
shutdown -i5 -g0 -y
reboot a node into non-cluster mode
ok> boot -x
ok> boot -x

Cluster information

 
3.1
3.2
Cluster scstat -pv cluster list -v
cluster show
cluster status
Nodes scstat –n clnode list -v
clnode show
clnode status
Devices scstat –D cldevice list
cldevice show
cldevice status
Quorum scstat –q clquorum list -v
clquorum show
clqorum status
Transport info scstat –W clinterconnect show
clinterconnect status
Resources scstat –g clresource list -v
clresource show
clresource status
Resource Groups scstat -g
scrgadm -pv

clresourcegroup list -v
clresourcegroup show
clresourcegroup status

Resource Types   clresourcetype list -v
clresourcetype list-props -v
clresourcetype show
IP Networking Multipathing scstat –i clnode status -m
Installation info (prints packages and version) scinstall –pv clnode show-rev -v

Cluster Configuration

 
3.1
3.2
Release   cat /etc/cluster/release
Integrity check sccheck cluster check -v
Configure the cluster (add nodes, add data services, etc)
scinstall
scinstall
Cluster configuration utility (quorum, data sevices, resource groups, etc) scsetup clsetup
Rename   cluster rename -c <cluster_name>
Set a property   cluster set -p <name>=<value>
List   ## List cluster commands
cluster list-cmds

## Display the name of the cluster
cluster list

## List the checks
cluster list-checks

## Detailed configuration
cluster show -t global
Status   cluster status
Reset the cluster private network settings   cluster restore-netprops <cluster_name>
Place the cluster into install mode   cluster set -p installmode=enabled
Add a node scconf –a –T node=<host><host> clnode add -c <clustername> -n <nodename> -e endpoint1,endpoint2 -e endpoing3,endpoint4
Remove a node scconf –r –T node=<host><host> clnode remove
Prevent new nodes from entering scconf –a –T node=.  
Put a node into maintenance state

scconf -c -q node=<node>,maintstate

Note: use the scstat -q command to verify that the node is in maintenance mode, the vote count should be zero for that node.

 
Get a node out of maintenance state

scconf -c -q node=<node>,reset

Note: use the scstat -q command to verify that the node is in maintenance mode, the vote count should be one for that node.

 

Node Configuration

 
3.1
3.2
Add a node to the cluster   clnode add [-c <cluster>] [-n <sponsornode>] \
-e <endpoint> \
-e <endpoint>
<node>
Remove a node from the cluster   ## Make sure you are on the node you wish to remove
clnode remove
Evacuate a node from the cluster scswitch -S -h <node> clnode evacuate <node>
Cleanup the cluster configuration (used after removing nodes)   clnode clear <node>
List nodes  

## Standard list
clnode list [+|<node>]

## Destailed list
clnode show [+|<node>]

Change a nodes property   clnode set -p <name>=<value> [+|<node>]
Status of nodes   clnode status [+|<node>]

Admin Quorum Device
 
Quorum devices are nodes and disk devices, so the total quorum will be all nodes and devices added together. You can use the scsetup(3.1)/clsetup(3.2) interface to add/remove quorum devices or use the below commands.

 
3.1
3.2
Adding a SCSI device to the quorum

scconf –a –q globaldev=d11

Note: if you get the error message "uable to scrub device" use scgdevs to add device to the global device namespace.

clquorum add [-t <type>] [-p <name>=<value>] [+|<devicename>]
Adding a NAS device to the quorum n/a clquorum add -t netapp_nas -p filer=<nasdevice>,lun_id=<IDnumdevice nasdevice>
Adding a Quorum Server n/a clquorum add -t quorumserver -p qshost<IPaddress>,port=<portnumber> <quorumservername>
Removing a device to the quorum scconf –r –q globaldev=d11 clquorum remove [-t <type>] [+|<devicename>]
Remove the last quorum device ## Evacuate all nodes

## Put cluster into maint mode
scconf –c –q installmode

## Remove the quorum device
scconf –r –q globaldev=d11

## Check the quorum devices
scstat –q
## Place the cluster in install mode
cluster set -p installmode=enabled

## Remove the quorum device
clquorum remove <device>

## Verify the device has been removed
clquorum list -v
List  

## Standard list
clquorum list -v [-t <type>] [-n <node>] [+|<devicename>]

## Detailed list
clquorum show [-t <type>] [-n <node>] [+|<devicename>]

## Status
clquorum status [-t <type>] [-n <node>] [+|<devicename>]

Resetting quorum info

scconf –c –q reset

Note: this will bring all offline quorum devices online

clquorum reset
Bring a quorum device into maintenance mode (3.2 known as enabled) ## Obtain the device number
scdidadm –L
scconf –c –q globaldev=<device>,maintstate
clquorum enable [-t <type>] [+|<devicename>]
Bring a quorum device out of maintenance mode (3.2 known as disabled) scconf –c –q globaldev=<device><device>,reset clquorum disable [-t <type>] [+|<devicename>]

Device Configuration  

 
3.1
3.2
Check device   cldevice check [-n <node>] [+]
Remove all devices from node   cldevice clear [-n <node>]
Monitoring  

## Turn on monitoring
cldevice monitor [-n <node>] [+|<device>]

## Turn off monitoring
cldevice unmonitor [-n <node>] [+|<device>]

Rename   cldevice rename -d <destination_device_name>
Replicate   cldevice replicate [-S <source-node>] -D <destination-node> [+]
Set properties of a device   cldevice set -p default_fencing={global|pathcount|scsi3} [-n <node>] <device>
Status  

## Standard display
cldevice status [-s <state>] [-n <node>] [+|<device>]

## Display failed disk paths
cldevice status -s fail

Lists all the configured devices including paths across all nodes. scdidadm –L ## Standard List
cldevice list [-n <node>] [+|<device>]

## Detailed list
cldevice show [-n <node>] [+|<device>]
List all the configured devices including paths on node only. scdidadm –l see above
Reconfigure the device database, creating new instances numbers if required. scdidadm –r cldevice populate
cldevice refresh [-n <node>] [+]
Perform the repair procedure for a particular path (use then when a disk gets replaced) scdidadm –R <c0t0d0s0> - device
scdidadm –R 2          - device id
cldevice repair [-n <node>] [+|<device>]

Disks group

 
3.1
3.2
Create a device group n/a cldevicegroup create -t vxvm -n <node-list> -p failback=true <devgrp>
Remove a device group n/a cldevicegroup delete <devgrp>
Adding scconf -a -D type=vxvm,name=appdg,nodelist=<host>:<host>,preferenced=true cldevicegroup add-device -d <device> <devgrp>
Removing scconf –r –D name=<disk group> cldevicegroup remove-device -d <device> <devgrp>
Set a property   cldevicegroup set [-p <name>=<value>] [+|<devgrp>]
List scstat

## Standard list
cldevicegroup list [-n <node>] [-t <type>] [+|<devgrp>]

## Detailed configuration report
cldevicegroup show [-n <node>] [-t <type>] [+|<devgrp>]

status scstat cldevicegroup status [-n <node>] [-t <type>] [+|<devgrp>]
adding single node scconf -a -D type=vxvm,name=appdg,nodelist=<host> cldevicegroup add-node [-n <node>] [-t <type>] [+|<devgrp>]
Removing single node scconf –r –D name=<disk group>,nodelist=<host> cldevicegroup remove-node [-n <node>] [-t <type>] [+|<devgrp>]
Switch scswitch –z –D <disk group> -h <host> cldevicegroup switch -n <nodename> <devgrp>
Put into maintenance mode scswitch –m –D <disk group> n/a
take out of maintenance mode scswitch -z -D <disk group> -h <host> n/a
onlining a disk group scswitch -z -D <disk group> -h <host> cldevicegroup online <devgrp>
offlining a disk group scswitch -F -D <disk group> cldevicegroup offline <devgrp>
Resync a disk group scconf -c -D name=appdg,sync cldevicegroup syn [-t <type>] [+|<devgrp>]

Transport Cable

 
3.1
3.2
Add   clinterconnect add <endpoint>,<endpoint>
Remove   clinterconnect remove <endpoint>,<endpoint>
Enable scconf –c –m endpoint=<host>:qfe1,state=enabled clinterconnect enable [-n <node>] [+|<endpoint>,<endpoint>]
Disable scconf –c –m endpoint=<host>:qfe1,state=disabled

Note: it gets deleted
clinterconnect disable [-n <node>] [+|<endpoint>,<endpoint>]
List scstat ## Standard and detailed list
clinterconnect show [-n <node>][+|<endpoint>,<endpoint>]
Status scstat clinterconnect status [-n <node>][+|<endpoint>,<endpoint>]

Resource Groups

 
3.1
3.2
Adding (failover)

scrgadm -a -g <res_group> -h <host>,<host>

clresourcegroup create <res_group>
Adding (scalable)   clresourcegroup create -S <res_group>
Adding a node to a resource group   clresourcegroup add-node -n <node> <res_group>
Removing scrgadm –r –g <group>

## Remove a resource group
clresourcegroup delete <res_group>

## Remove a resource group and all its resources
clresourcegroup delete -F <res_group>

Removing a node from a resource group   clresourcegroup remove-node -n <node> <res_group>
changing properties scrgadm -c -g <resource group> -y <propety=value> clresourcegroup set -p Failback=true + <name=value>
Status scstat -g clresourcegroup status [-n <node>][-r <resource][-s <state>][-t <resourcetype>][+|<res_group>]
Listing scstat –g clresourcegroup list [-n <node>][-r <resource][-s <state>][-t <resourcetype>][+|<res_group>]
Detailed List scrgadm –pv –g <res_group> clresourcegroup show [-n <node>][-r <resource][-s <state>][-t <resourcetype>][+|<res_group>]
Display mode type (failover or scalable) scrgadm -pv -g <res_group> | grep 'Res Group mode'  
Offlining scswitch –F –g <res_group>

## All resource groups
clresourcegroup offline +

## Individual group
clresourcegroup offline [-n <node>] <res_group>

clresourcegroup evacuate [+|-n <node>]

Onlining scswitch -Z -g <res_group>

## All resource groups
clresourcegroup online +

## Individual groups
clresourcegroup online [-n <node>] <res_group>

Evacuate all resource groups from a node (used when shutting down a node)   clresourcegroup evacuate [+|-n <node>]
Unmanaging

scswitch –u –g <res_group>

Note: (all resources in group must be disabled)

clresourcegroup unmanage <res_group>
Managing scswitch –o –g <res_group> clresourcegroup manage <res_group>
Switching scswitch –z –g <res_group> –h <host> clresourcegroup switch -n <node> <res_group>
Suspend n/a clresourcegroup suspend [+|<res_group>]
Resume n/a clresourcegroup resume [+|<res_group>]
Remaster (move the resource group/s to their preferred node) n/a clresourcegroup remaster [+|<res_group>]
Restart a resource group (bring offline then online) n/a clresourcegroup restart [-n <node>] [+|<res_group>]
 
Resources
 
 
3.1
3.2
Adding failover network resource scrgadm –a –L –g <res_group> -l <logicalhost> clreslogicalhostname create -g <res_group> <lh-resource>
Adding shared network resource scrgadm –a –S –g <res_group> -l <logicalhost> clressharedaddress create -t -g <res_group> <sa-resource>
adding a failover apache application and attaching the network resource scrgadm –a –j apache_res -g <res_group> \
-t SUNW.apache -y Network_resources_used = <logicalhost>
-y Scalable=False –y Port_list = 80/tcp \
-x Bin_dir = /usr/apache/bin
 
adding a shared apache application and attaching the network resource scrgadm –a –j apache_res -g <res_group> \
-t SUNW.apache -y Network_resources_used = <logicalhost>
-y Scalable=True –y Port_list = 80/tcp \
-x Bin_dir = /usr/apache/bin
 
Create a HAStoragePlus failover resource scrgadm -a -g rg_oracle -j hasp_data01 -t SUNW.HAStoragePlus \
> -x FileSystemMountPoints=/oracle/data01 \
> -x Affinityon=true
clresource create -t HAStorage -g <res_group> \
-p FilesystemMountPoints=<mount-point-list> \
-p Affinityon=true <rs-hasp>
Removing

scrgadm –r –j res-ip

Note: must disable the resource first

clresource delete [-g <res_group>][-t <resourcetype>][+|<resource>]
changing or adding properties scrgadm -c -j <resource> -y <property=value>

## Changing
clresource set -t <type> -p <name>=<value> +

## Adding
clresource set -p <name>+=<value> <resource>

List scstat -g clresource list [-g <res_group>][-t <resourcetype>][+|<resource>]

## List properties
clresource list-props [-g <res_group>][-t <resourcetype>][+|<resource>]
Detailed List scrgadm –pv –j res-ip
scrgadm –pvv –j res-ip
clresurce show [-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>]
Status scstat -g clresource status [-s <state>][-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>]
Disable resoure monitor scrgadm –n –M –j res-ip clresource monitor [-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>]
Enable resource monitor scrgadm –e –M –j res-ip clresource unmonitor [-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>]
Disabling scswitch –n –j res-ip clresource disable <resource>
Enabling scswitch –e –j res-ip clresource enable <resource>
Clearing a failed resource scswitch –c –h<host>,<host> -j <resource> -f STOP_FAILED clresource clear -f STOP_FAILED <resource>
Find the network of a resource scrgadm –pvv –j <resource> | grep –I network  
Removing a resource and resource group ## offline the group
scswitch –F –g rgroup-1

## remove the resource        
scrgadm –r –j res-ip

## remove the resource group             
scrgadm –r –g rgroup-1         
## offline the group
clresourcegroup offline <res_group>

## remove the resource        
clresource [-g <res_group>][-t <resourcetype>][+|<resource>]

## remove the resource group             
clresourcegroup delete <res_group>

Resource Types
                     
 
3.1
3.2
Adding (register in 3.2) scrgadm –a –t <resource type>    i.e SUNW.HAStoragePlus clresourcetype register <type>
Register a resource type to a node n/a clresourcetype add-node -n <node> <type>
Deleting (remove in 3.2) scrgadm –r –t <resource type> clresourcetype unregister <type>
Deregistering a resource type from a node n/a clresourcetype remove-node -n <node> <type>
Listing scrgadm –pv | grep ‘Res Type name’ clresourcetype list [<type>]
Listing resource type properties   clresourcetype list-props [<type>]
Show resource types   clresourcetype show [<type>]
Set properties of a resource type   clresourcetype set [-p <name>=<value>] <type>