What to do When Failover Cluster fails

Featured Image and great Blog with an Introduction to Windows Clusters for SQL DBAs from Brent Ozar’s site: Intro to SQL Server Clusters  (Love Kendra Little’s Art)

WFC (Windows Failover Clustering or FCM Failover Cluster Manager) is used to configure the failover cluster for a clustered SQL Server database or SQL Server Always On use. If you need High availability with SQL Server then you will need a cluster. If you are a DBA, hopefully the cluster is created for you by the Sys Admin. If you need to create the cluster there are many blogs on how to create the cluster. Here is a good starting link on how to create a cluster:Microsoft page on Creating Clusters

What do you do when the cluster fails

What do you do when the cluster fails and you (the DBA) need to troubleshoot the issue.

  1. Open Windows Failover Cluster and get the node names and server names.
  2. Go to PowerShell and create the cluster log on each node (Server). This log needs to be manually created. The command is “Get-ClusterLog -Node NodeName”GetLog
  3. Review the created log (C:\Windows\Cluster\Reports\Cluster.log) for information on the time of failover and the reason why the cluster failed over. The log will be on the actual server for the node you ran the log for.
  4. Go to Failover Cluster Manager and review the Cluster Events – Same on all nodes
  5. Go to Administrative Tools\Event Viewer for each node and review the events for the time period of the cluster failover in these files:
    • Windows Logs\Application
    • Windows Logs\System
    • Applications and Services Logs\Microsoft\Hyper-V-High-Availability\Admin
    • Applications and Services Logs\Microsoft\FailoverClustering-Manager
    • Applications and Services Logs\Microsoft\FailoverClustering\Operational
  6. optional: Create an excel spreadsheet with each meaningful event from the log files. This helps me to see the events and order them. I also like to send this to management as documentation.
    • Cluster Node
    • Cluster Server
    • date
    • time
    • Where found (example: Event Viewer\Hyper-V-High-Availablity\Admin)
    • level (Error,Information, Warning, or Critical)
    • Source
    • event source
    • eventID
    • Task Category
    • Desc
  7. Go to (C:\Program Files (x86)\Microsoft SQL Server\110\Tools\Policies\DatabaseEngine\1033\Windows Event Log Cluster Disk Resource Corruption Error.XML) on each node and review.

 

Gather all your data, review, and send an email out with the information on why the cluster failed. If you find the error is Always-On or SQL Server related, more research is needed into the SQL Server logs.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s