Sun Cluster Cheat Sheet
This cheatsheet contains common commands and information for both Sun Cluster 3.1 and 3.2, there is some missing information and over time I hope to complete this i.e zones, NAS devices, etc
Also both versions of Cluster have a text based GUI tool, so don't be afraid to use this, especially if the task is a simple one
Also all the commands in version 3.1 are available to version 3.2
Daemons and Processes
At the bottom of the installation guide I listed the daemons and processing running after a fresh install, now is the time explain what these processes do, I have managed to obtain informtion on most of them but still looking for others.
Versions 3.1 and 3.2 |
|
clexecd | This is used by cluster kernel threads to execute userland commands (such as the run_reserve and dofsck commands). It is also used to run cluster commands remotely (like the cluster shutdown command). This daemon registers with failfastd so that a failfast device driver will panic the kernel if this daemon is killed and not restarted in 30 seconds. |
cl_ccrad | This daemon provides access from userland management applications to the CCR. It is automatically restarted if it is stopped. |
cl_eventd | The cluster event daemon registers and forwards cluster events (such as nodes entering and leaving the cluster). There is also a protocol whereby user applications can register themselves to receive cluster events. The daemon is automatically respawned if it is killed. |
cl_eventlogd | cluster event log daemon logs cluster events into a binary log file. At the time of writing for this course, there is no published interface to this log. It is automatically restarted if it is stopped. |
failfastd | This daemon is the failfast proxy server.The failfast daemon allows the kernel to panic if certain essential daemons have failed |
rgmd | The resource group management daemon which manages the state of all cluster-unaware applications. A failfast driver panics the kernel if this daemon is killed and not restarted in 30 seconds. |
rpc.fed | This is the fork-and-exec daemon, which handles requests from rgmd to spawn methods for specific data services. A failfast driver panics the kernel if this daemon is killed and not restarted in 30 seconds. |
rpc.pmfd | This is the process monitoring facility. It is used as a general mechanism to initiate restarts and failure action scripts for some cluster framework daemons (in Solaris 9 OS), and for most application daemons and application fault monitors (in Solaris 9 and10 OS). A failfast driver panics the kernel if this daemon is stopped and not restarted in 30 seconds. |
pnmd | Public managment network service daemon manages network status information received from the local IPMP daemon running on each node and facilitates application failovers caused by complete public network failures on nodes. It is automatically restarted if it is stopped. |
scdpmd | Disk path monitoring daemon monitors the status of disk paths, so that they can be reported in the output of the cldev status command. It is automatically restarted if it is stopped. Multi-threaded DPM daemon runs on each node. It is automatically started by an rc script when a node boots. It monitors the availibility of logical path that is visiable through various multipath drivers (MPxIO), HDLM, Powerpath, etc. Automatically restarted by rpc.pmfd if it dies. |
Version 3.2 only |
|
qd_userd | This daemon serves as a proxy whenever any quorum device activity requires execution of some command in userland i.e a NAS quorum device |
cl_execd | |
ifconfig_proxy_serverd | |
rtreg_proxy_serverd | |
cl_pnmd | is a daemon for the public network management (PMN) module. It is started at boot time and starts the PMN service. It keeps track of the local host's IPMP state and facilities inter-node failover for all IPMP groups. |
scprivipd | This daemon provisions IP addresses on the clprivnet0 interface, on behalf of zones. |
sc_zonesd | This daemon monitors the state of Solaris 10 non-global zones so that applications designed to failover between zones can react appropriately to zone booting failure |
cznetd | It is used for reconfiguring and plumbing the private IP address in a local zone after virtual cluster is created, also see the cznetd.xml file. |
rpc.fed | This is the "fork and exec" daemin which handles requests from rgmd to spawn methods for specfic data services. Failfast will hose the box if this is killed and not restarted in 30 seconds |
scqdmd | The quorum server daemon, this possibly use to be called "scqsd" |
pnm mod serverd |
File locations
Both Versions (3.1 and 3.2) |
|
man pages | /usr/cluster/man |
log files | /var/cluster/logs /var/adm/messages |
Configuration files (CCR, eventlog, etc) | /etc/cluster/ |
Cluster and other commands | /usr/cluser/lib/sc |
Version 3.1 Only |
|
sccheck logs | /var/cluster/sccheck/report.<date> |
Cluster infrastructure file | /etc/cluster/ccr/infrastructure |
Version 3.2 Only |
|
sccheck logs | /var/cluster/logs/cluster_check/remote.<date> |
Cluster infrastructure file | /etc/cluster/ccr/global/infrastructure |
Command Log | /var/cluster/logs/commandlog |
SCSI Reservations
Display reservation keys | scsi2: scsi3: |
determine the device owner | scsi2: scsi3: |
Command shortcuts
In version 3.2 there are number of shortcut command names which I have detailed below, I have left the full command name in the rest of the document so it is obvious what we are performing, all the commands are located in /usr/cluster/bin
shortcut |
|
cldevice | cldev |
cldevicegroup | cldg |
clinterconnect | clintr |
clnasdevice | clnas |
clquorum | clq |
clresource | clrs |
clresourcegroup | clrg |
clreslogicalhostname | clrslh |
clresourcetype | clrt |
clressharedaddress | clrssa |
Shutting down and Booting a Cluster
3.1 |
3.2 |
|
shutdown entire cluster (All Nodes will be brought down to init 0) | ##other nodes in cluster |
cluster shutdown -g0 -y |
shutdown single node | scswitch -S -h <host> shutdown -i5 -g0 -y |
clnode evacuate <node> shutdown -i5 -g0 -y |
reboot a node into non-cluster mode | ok> boot -x |
ok> boot -x |
Cluster information
3.1 |
3.2 |
|
Cluster | scstat -pv | cluster list -v cluster show cluster status |
Nodes | scstat –n | clnode list -v clnode show clnode status |
Devices | scstat –D | cldevice list cldevice show cldevice status |
Quorum | scstat –q | clquorum list -v clquorum show clqorum status |
Transport info | scstat –W | clinterconnect show clinterconnect status |
Resources | scstat –g | clresource list -v clresource show clresource status |
Resource Groups | scstat -g scrgadm -pv |
clresourcegroup list -v |
Resource Types | clresourcetype list -v clresourcetype list-props -v clresourcetype show |
|
IP Networking Multipathing | scstat –i | clnode status -m |
Installation info (prints packages and version) | scinstall –pv | clnode show-rev -v |
Cluster Configuration
3.1 |
3.2 |
|
Release | cat /etc/cluster/release | |
Integrity check | sccheck | cluster check -v |
Configure the cluster (add nodes, add data services, etc) | scinstall |
scinstall |
Cluster configuration utility (quorum, data sevices, resource groups, etc) | scsetup | clsetup |
Rename | cluster rename -c <cluster_name> | |
Set a property | cluster set -p <name>=<value> | |
List | ## List cluster commands cluster list-cmds ## Display the name of the cluster cluster list ## List the checks cluster list-checks ## Detailed configuration cluster show -t global |
|
Status | cluster status | |
Reset the cluster private network settings | cluster restore-netprops <cluster_name> | |
Place the cluster into install mode | cluster set -p installmode=enabled | |
Add a node | scconf –a –T node=<host><host> | clnode add -c <clustername> -n <nodename> -e endpoint1,endpoint2 -e endpoing3,endpoint4 |
Remove a node | scconf –r –T node=<host><host> | clnode remove |
Prevent new nodes from entering | scconf –a –T node=. | |
Put a node into maintenance state | scconf -c -q node=<node>,maintstate Note: use the scstat -q command to verify that the node is in maintenance mode, the vote count should be zero for that node. |
|
Get a node out of maintenance state | scconf -c -q node=<node>,reset Note: use the scstat -q command to verify that the node is in maintenance mode, the vote count should be one for that node. |
Node Configuration
3.1 |
3.2 |
|
Add a node to the cluster | clnode add [-c <cluster>] [-n <sponsornode>] \ -e <endpoint> \ -e <endpoint> <node> |
|
Remove a node from the cluster | ## Make sure you are on the node you wish to remove clnode remove |
|
Evacuate a node from the cluster | scswitch -S -h <node> | clnode evacuate <node> |
Cleanup the cluster configuration (used after removing nodes) | clnode clear <node> | |
List nodes | ## Standard list ## Destailed list |
|
Change a nodes property | clnode set -p <name>=<value> [+|<node>] | |
Status of nodes | clnode status [+|<node>] |
Admin Quorum Device
Quorum devices are nodes and disk devices, so the total quorum will be all nodes and devices added together. You can use the scsetup(3.1)/clsetup(3.2) interface to add/remove quorum devices or use the below commands.
3.1 |
3.2 |
|
Adding a SCSI device to the quorum | scconf –a –q globaldev=d11 Note: if you get the error message "uable to scrub device" use scgdevs to add device to the global device namespace. |
clquorum add [-t <type>] [-p <name>=<value>] [+|<devicename>] |
Adding a NAS device to the quorum | n/a | clquorum add -t netapp_nas -p filer=<nasdevice>,lun_id=<IDnumdevice nasdevice> |
Adding a Quorum Server | n/a | clquorum add -t quorumserver -p qshost<IPaddress>,port=<portnumber> <quorumservername> |
Removing a device to the quorum | scconf –r –q globaldev=d11 | clquorum remove [-t <type>] [+|<devicename>] |
Remove the last quorum device | ## Evacuate all nodes ## Put cluster into maint mode scconf –c –q installmode ## Remove the quorum device scconf –r –q globaldev=d11 ## Check the quorum devices scstat –q |
## Place the cluster in install mode cluster set -p installmode=enabled ## Remove the quorum device clquorum remove <device> ## Verify the device has been removed clquorum list -v |
List | ## Standard list ## Detailed list ## Status |
|
Resetting quorum info | scconf –c –q reset Note: this will bring all offline quorum devices online |
clquorum reset |
Bring a quorum device into maintenance mode (3.2 known as enabled) | ## Obtain the device number scdidadm –L scconf –c –q globaldev= |
clquorum enable [-t <type>] [+|<devicename>] |
Bring a quorum device out of maintenance mode (3.2 known as disabled) | scconf –c –q globaldev=<device><device>,reset | clquorum disable [-t <type>] [+|<devicename>] |
Device Configuration
3.1 |
3.2 |
|
Check device | cldevice check [-n <node>] [+] | |
Remove all devices from node | cldevice clear [-n <node>] | |
Monitoring | ## Turn on monitoring ## Turn off monitoring |
|
Rename | cldevice rename -d <destination_device_name> | |
Replicate | cldevice replicate [-S <source-node>] -D <destination-node> [+] | |
Set properties of a device | cldevice set -p default_fencing={global|pathcount|scsi3} [-n <node>] <device> | |
Status | ## Standard display ## Display failed disk paths |
|
Lists all the configured devices including paths across all nodes. | scdidadm –L | ## Standard List cldevice list [-n <node>] [+|<device>] ## Detailed list cldevice show [-n <node>] [+|<device>] |
List all the configured devices including paths on node only. | scdidadm –l | see above |
Reconfigure the device database, creating new instances numbers if required. | scdidadm –r | cldevice populate cldevice refresh [-n <node>] [+] |
Perform the repair procedure for a particular path (use then when a disk gets replaced) | scdidadm –R <c0t0d0s0> - device scdidadm –R 2 - device id |
cldevice repair [-n <node>] [+|<device>] |
Disks group
3.1 |
3.2 |
|
Create a device group | n/a | cldevicegroup create -t vxvm -n <node-list> -p failback=true <devgrp> |
Remove a device group | n/a | cldevicegroup delete <devgrp> |
Adding | scconf -a -D type=vxvm,name=appdg,nodelist=<host>:<host>,preferenced=true | cldevicegroup add-device -d <device> <devgrp> |
Removing | scconf –r –D name=<disk group> | cldevicegroup remove-device -d <device> <devgrp> |
Set a property | cldevicegroup set [-p <name>=<value>] [+|<devgrp>] | |
List | scstat | ## Standard list ## Detailed configuration report |
status | scstat | cldevicegroup status [-n <node>] [-t <type>] [+|<devgrp>] |
adding single node | scconf -a -D type=vxvm,name=appdg,nodelist=<host> | cldevicegroup add-node [-n <node>] [-t <type>] [+|<devgrp>] |
Removing single node | scconf –r –D name=<disk group>,nodelist=<host> | cldevicegroup remove-node [-n <node>] [-t <type>] [+|<devgrp>] |
Switch | scswitch –z –D <disk group> -h <host> | cldevicegroup switch -n <nodename> <devgrp> |
Put into maintenance mode | scswitch –m –D <disk group> | n/a |
take out of maintenance mode | scswitch -z -D <disk group> -h <host> | n/a |
onlining a disk group | scswitch -z -D <disk group> -h <host> | cldevicegroup online <devgrp> |
offlining a disk group | scswitch -F -D <disk group> | cldevicegroup offline <devgrp> |
Resync a disk group | scconf -c -D name=appdg,sync | cldevicegroup syn [-t <type>] [+|<devgrp>] |
Transport Cable
3.1 |
3.2 |
|
Add | clinterconnect add <endpoint>,<endpoint> | |
Remove | clinterconnect remove <endpoint>,<endpoint> | |
Enable | scconf –c –m endpoint=<host>:qfe1,state=enabled | clinterconnect enable [-n <node>] [+|<endpoint>,<endpoint>] |
Disable | scconf –c –m endpoint=<host>:qfe1,state=disabled Note: it gets deleted |
clinterconnect disable [-n <node>] [+|<endpoint>,<endpoint>] |
List | scstat | ## Standard and detailed list clinterconnect show [-n <node>][+|<endpoint>,<endpoint>] |
Status | scstat | clinterconnect status [-n <node>][+|<endpoint>,<endpoint>] |
3.1 |
3.2 |
|
Adding (failover) | scrgadm -a -g <res_group> -h <host>,<host> |
clresourcegroup create <res_group> |
Adding (scalable) | clresourcegroup create -S <res_group> | |
Adding a node to a resource group | clresourcegroup add-node -n <node> <res_group> | |
Removing | scrgadm –r –g <group> | ## Remove a resource group ## Remove a resource group and all its resources |
Removing a node from a resource group | clresourcegroup remove-node -n <node> <res_group> | |
changing properties | scrgadm -c -g <resource group> -y <propety=value> | clresourcegroup set -p Failback=true + <name=value> |
Status | scstat -g | clresourcegroup status [-n <node>][-r <resource][-s <state>][-t <resourcetype>][+|<res_group>] |
Listing | scstat –g | clresourcegroup list [-n <node>][-r <resource][-s <state>][-t <resourcetype>][+|<res_group>] |
Detailed List | scrgadm –pv –g <res_group> | clresourcegroup show [-n <node>][-r <resource][-s <state>][-t <resourcetype>][+|<res_group>] |
Display mode type (failover or scalable) | scrgadm -pv -g <res_group> | grep 'Res Group mode' | |
Offlining | scswitch –F –g <res_group> | ## All resource groups ## Individual group clresourcegroup evacuate [+|-n <node>] |
Onlining | scswitch -Z -g <res_group> | ## All resource groups ## Individual groups |
Evacuate all resource groups from a node (used when shutting down a node) | clresourcegroup evacuate [+|-n <node>] | |
Unmanaging | scswitch –u –g <res_group> Note: (all resources in group must be disabled) |
clresourcegroup unmanage <res_group> |
Managing | scswitch –o –g <res_group> | clresourcegroup manage <res_group> |
Switching | scswitch –z –g <res_group> –h <host> | clresourcegroup switch -n <node> <res_group> |
Suspend | n/a | clresourcegroup suspend [+|<res_group>] |
Resume | n/a | clresourcegroup resume [+|<res_group>] |
Remaster (move the resource group/s to their preferred node) | n/a | clresourcegroup remaster [+|<res_group>] |
Restart a resource group (bring offline then online) | n/a | clresourcegroup restart [-n <node>] [+|<res_group>] |
3.1 |
3.2 |
|
Adding failover network resource | scrgadm –a –L –g <res_group> -l <logicalhost> | clreslogicalhostname create -g <res_group> <lh-resource> |
Adding shared network resource | scrgadm –a –S –g <res_group> -l <logicalhost> | clressharedaddress create -t -g <res_group> <sa-resource> |
adding a failover apache application and attaching the network resource | scrgadm –a –j apache_res -g <res_group> \ -t SUNW.apache -y Network_resources_used = <logicalhost> -y Scalable=False –y Port_list = 80/tcp \ -x Bin_dir = /usr/apache/bin |
|
adding a shared apache application and attaching the network resource | scrgadm –a –j apache_res -g <res_group> \ -t SUNW.apache -y Network_resources_used = <logicalhost> -y Scalable=True –y Port_list = 80/tcp \ -x Bin_dir = /usr/apache/bin |
|
Create a HAStoragePlus failover resource | scrgadm -a -g rg_oracle -j hasp_data01 -t SUNW.HAStoragePlus \ > -x FileSystemMountPoints=/oracle/data01 \ > -x Affinityon=true |
clresource create -t HAStorage -g <res_group> \ -p FilesystemMountPoints=<mount-point-list> \ -p Affinityon=true <rs-hasp> |
Removing | scrgadm –r –j res-ip Note: must disable the resource first |
clresource delete [-g <res_group>][-t <resourcetype>][+|<resource>] |
changing or adding properties | scrgadm -c -j <resource> -y <property=value> | ## Changing ## Adding |
List | scstat -g | clresource list [-g <res_group>][-t <resourcetype>][+|<resource>] ## List properties clresource list-props [-g <res_group>][-t <resourcetype>][+|<resource>] |
Detailed List | scrgadm –pv –j res-ip scrgadm –pvv –j res-ip |
clresurce show [-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>] |
Status | scstat -g | clresource status [-s <state>][-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>] |
Disable resoure monitor | scrgadm –n –M –j res-ip | clresource monitor [-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>] |
Enable resource monitor | scrgadm –e –M –j res-ip | clresource unmonitor [-n <node>] [-g <res_group>][-t <resourcetype>][+|<resource>] |
Disabling | scswitch –n –j res-ip | clresource disable <resource> |
Enabling | scswitch –e –j res-ip | clresource enable <resource> |
Clearing a failed resource | scswitch –c –h<host>,<host> -j <resource> -f STOP_FAILED | clresource clear -f STOP_FAILED <resource> |
Find the network of a resource | scrgadm –pvv –j <resource> | grep –I network | |
Removing a resource and resource group | ## offline the group scswitch –F –g rgroup-1 ## remove the resource scrgadm –r –j res-ip ## remove the resource group scrgadm –r –g rgroup-1 |
## offline the group clresourcegroup offline <res_group> ## remove the resource clresource [-g <res_group>][-t <resourcetype>][+|<resource>] ## remove the resource group clresourcegroup delete <res_group> |
3.1 |
3.2 |
|
Adding (register in 3.2) | scrgadm –a –t <resource type> i.e SUNW.HAStoragePlus | clresourcetype register <type> |
Register a resource type to a node | n/a | clresourcetype add-node -n <node> <type> |
Deleting (remove in 3.2) | scrgadm –r –t <resource type> | clresourcetype unregister <type> |
Deregistering a resource type from a node | n/a | clresourcetype remove-node -n <node> <type> |
Listing | scrgadm –pv | grep ‘Res Type name’ | clresourcetype list [<type>] |
Listing resource type properties | clresourcetype list-props [<type>] | |
Show resource types | clresourcetype show [<type>] | |
Set properties of a resource type | clresourcetype set [-p <name>=<value>] <type> |