Xplat Xperts

  • Home
  • Archives
  • Subscribe

OpsMgr - Cross Platform Discovery Errors

 The key to being able to monitor a server is being able to discover that server :), until you can get the server into Operations Manager you aren't going to be able to do much with it.  While the discovery process for Unix and Linux servers seems simple enough, there is a lot going on behind the scenes that is hidden by the wizard.  In a previous entry I went over a successful discovery path (OpsMg and Cross Plat-Getting Started), for this post I'm going to go over some of the errors that can occur and how to resolve them.

The first one I'll talk about is Not Enough Entropy, this one required a little digging to figure out what was wrong.  The exact error is Failed to allocate resource of type random data: Failed to get random data - not enough entropy.

Entropy 

I've had this issue when discovering both RHEL and SLES servers and it is related to certificate generation. 

There are two ways to solve this problem, you can recreate the /dev/random file or do a manual agent install.

For both fixes, clean off the partially installed agent using the commands

  1. rpm -e scx
  2. rm -rf /etc/opt/microsoft/scx

Then if you want to make it so that discovery will work from the wizard use the commands

  1. rm /dev/random
  2. mknod -m 644 /dev/random c 1 9
  3. chown root:root /dev/random

A manual install requires copying the appropriate package from %Program Files%\System Center Operations Manager 2007\AgentManagement\UnixAgents to the Unix\Linux machine and installing it directly.

After fixing the install issue, switch the /dev/random file back to a signed random file using the commands:

  1. rm /dev/random
  2. mknod -m 644 /dev/random c 1 8
  3. chown root:root /dev/random

Next, let's look at Unspecified Problem, this is one where I am sure there is a whole gamut of reasons why it occurs.  The text is Starting Microsoft SCX CIM Server:  Unspecified Problem. 

Unspecified 

The key here is that we can see that the certificate was generated by the statement "Generating certificate with hostname..." so we know we need to look at things after the certificate creation.  The only reason I have found for this error is the firewall, after installation and certificate generation there is a validation step.  If you watch the steps through the wizard, the error pops up almost immediately so the wizard is unable to verify the agent suggesting a communication issue.  Ensure that port 1270 has been opened on the firewall and try to discover again.

Some of the other errors I've run into over time are:

Access is Denied, this one pops up from time to time when an agent installation failed for some reason, you fixed the underlying reason and tried again. The problem is the partially installed agent is blocking the re-install, the fix is to clean off the agent and do a fresh install the same way we  did for Not Enough Entropy.

Cannot connect to port 1270, this one typically occurs when there is a library path issue on the monitored server.  If you go to the server, you'll likely see that the service failed to start. Trying to restart the service will give you the name of the library that cannot be found.  

The typical resolution path for linux is:

  1. scxadmin -restart all
  2. See what library is missing 
  3. find / -name <missing library>  
  4. vi /etc/ld.so.conf 
  5. add path to missing library  
  6. ldconfig to reload dynamic loader  
  7. scxadmin -restart all   

The path for Solaris is the same for steps 1 - 3 but differs when it comes to setting the library path:

  1. crle to see the current path
  2. crle -l to update the path (include the old path plus the new path because the command is a replacement, not an append) 
  3. scxadmin -restart all  

Can not resign certificate, /etc/opt/microsoft/ssl/scx-host-<hostname>.pem already exists,in this situation the re-creation of a certificate was attempted but failed because there was a previously generated certificate on the target host.  If you want to generate a new certificate, simply delete the contents of the /etc/opt/microsoft/ssl directory.  Alternatively you can export the certificate and trust it on the management server.

winrm failed to connect in a timely manner, this can happen if the target server is over loaded. OpenPegasus will time out after 20 seconds or so and this can result in a failure to validate the agent was properly installed.  The fix here is to ensure the agent was in fact installed using scxcimcli ei -n root/scx CIM_ManageElement on the target server and then retrying the discovery.
 
There are  many other things that couild go wrong during discovery but in most cases the error message you receive should help you determine how to fix the problem. One thing to watch is at what phase the error occurred: Initial discovery (name resolution issues), Installation (user account issues), Signing (certificate issues), Validation (configuration issues), knowing where to start looking is half the battle to getting our servers successfully discovered.

Posted on August 10, 2009 at 08:00 AM in Debugging, Operations Manager | Permalink | Comments (4) | TrackBack (0)

Reblog (0) | Digg This | Save to del.icio.us | Tweet This!

Validating Windows Providers

My name is Rob Doucette and I am the Software Development Manager at BridgeWays (a division of Xandros).  I am responsible for the development of all management packs at BridgeWays.  Before joining Xandros, I held a Senior Software Developer role at Quest Software.  I was involved with a number of projects at Quest including Spotlight on Active Directory, Spotlight on Exchange, as well as research into PowerShell extensions, user provisioning and identity management around cloud computing.  I have been working with Operations Manager since MOM 2000, and have been involved with the development of management packs and connectors for both MOM 2005 and SCOM 2007. 

For my first post, I'd like to extend Mike's post "Validating and Troubleshooting Unix/Linux Providers" to the Windows world.  The information Mike provided was specific to Unix/Linux.  I'm going to show you how to perform the same tasks with Windows.  This will primarily involve the use of PowerShell.

1) What classes are available in a namespace?

The first step is to enumerate the available classes in a given namespace.  The example I've included shows a Windows Server machine with the BridgeWays Management Pack for VMware installed.  This will show us all the classes that are modeled in this MP.

Get-WmiObject -Namespace root/bws -list

BLOG-Namespace

2) How do you query for instances of a class?

The next step is to specify a class and enumerate all the instances of that class.  The example I've chosen shows how to enumerate ESX Hosts (for brevity the screen shot only shows one ESX Host in the enumeration).

Get-WmiObject -Namespace root/bws VMware Host

Query

3) How do you execute commands remotely?

The commands specified above all assume that you are on the machine where the WMI provider is installed, however these commands can also be executed remotely.  First, you'll need to make sure that Windows Firewall is configured to allow remote WMI queries.  Second, you'll need to specify an additional parameter, the ComputerName parameter. 

Get-WmiObject -Namespace root/bws -ComputerName <remoteComputer> VMwareHost

This should give you a good starting point on how to talk to WMI providers outside of SCOM.  In my next post, I will expand on the Get-WmiObject command-let and other PowerShell commands to give you more control on the output from these queries.

Posted on August 05, 2009 at 10:44 PM in Debugging | Permalink | Comments (3) | TrackBack (0)

Reblog (0) | Digg This | Save to del.icio.us | Tweet This!

Validating and Troubleshooting Unix/Linux Providers

After discovering a Unix/Linux server and pushing down the appropriate providers, it's pretty common that we want make sure the provider is working.  It can take a while for initial application discoveries to complete once a provider has been installed, so for this post I'm going to talk about how you can verify that the providers are working outside of OpsMgr and the Management Packs. This can be done from either side of the fence, from the Management Server or directly on the discovered Unix/Linux machine. 

Unix/Linux Server

Let's start on the server to be monitored because we need to go and see what kind of information is available to be queried in the first place.

Connect to the server and go to /var/opt/microsoft/scx/lib/repository

Repository
What this is showing us, are the various namespaces registered on the server.

Now we start to use scxcimcli.  This is a command line tool that will allow us to call the CIM server directly and query information, if the query succeeds, then Operations Manager should be able to access all the data and we can be confident things are working (at least on the managed server side).

The first thing to do is a query to enumerate all available classes (this can be done for any namespace, I'm going to validate one of the BridgeWays management packs for the examples):
/opt/microsoft/scx/bin/tools/scxcimcli nc -n root/xsm

Enumeration

Now we know what is available, so we can actually query a class and see if relevant data comes back:
/opt/microsoft/scx/bin/tools/scxcimcli ei -n root/xsm XSM_MySQLServer

Failed

Uh oh, we have a problem here... the server doesn't have one of the dependencies installed or configured right.  We're unable to find the MySQL client libraries that are used to connect to the database and gather data.

Now what do we do?  We need to resolve the dependency issue. 

  1. Use ldd to see if there is a single dependency issue, or more than one.  
  2. Use find to see if the library is on the system, but not properly linked on the library load path  
  3. Create a symbolic link using ls -n in an existing library path or update the path to include the location of the missing module.  You update the path on Linux by editing (using vim) /etc/ld.so.conf to include the path to the libraries and on Solaris you use the command crle.
  4. Install the missing package (pkg for Solaris, rpm for Linux, etc) to provide the missing dependency.   
  5. Restart scx using svcadmin -restart all 

For my example I am running on Solaris, so I will update the path to include the MySQL client libraries which are part of the SUN WebStack install.

FixingMySQL

Now when we run our query, we should succeed.. let's see what happens:

Results

There we go, things are working.

The same series of steps are used when troubleshooting the agent or providers and once we have things working from the monitored server side, we can do a quick check on the management server to see if it is able to communicate with the CIM server.

Management Server

From the MS, we use winrm to talk to the Unix/Linux server.  From the cmd prompt run

winrm e "http://schemas.xandros.com/wbem/wscim/1/cim-schema/2/XSM_MySQLServer?__cimnamespace=root/xsm" -r:https://[server FQDN or IP]:1270 -u:[user name] -p:[password] -auth:basic -skipcacheck -skipcncheck -encoding:utf-8

will allow you to query the provider and ensure the management server should be able to get data.  If this succeeds but nothing is showing up in Operations Manager take a look at the Alert view.  Chances are there's an issue with either the Unix Action profile RunAs account having invalid credentials set (or none at all) or the certificate is invalid (perhaps you recently changed the hostname).

Posted on July 15, 2009 at 09:12 AM in Debugging | Permalink | Comments (2) | TrackBack (1)

Reblog (0) | Digg This | Save to del.icio.us | Tweet This!

Subscribe

  • Subscribe to this blog's feed

Sites We Like

  • System Center Central
  • Bridgeways Management Packs

Categories

  • BridgeWays
  • Debugging
  • Hyper-V
  • Management Packs
  • Operations Manager
  • SCOM
  • Technology