Files are created in the /var/ct/IW/log/mc/ ResourceManager directory to contain internal trace output that is usefulto a software service organization for resolving problems. An internaltrace utility tracks the activity of the resource manager daemon.Multiple levels of detail may be available for diagnosing problems.Some minimal level of tracing is on at all times. Full tracing can beactivated with the command:
traceson -s IBM.HostRM
Minimal tracing can be activated with the command:
tracesoff -s IBM.HostRM
where IBM.HostRM is used as an example of a resourcemanager.
Resource Manager Diagnostic Files
All trace files are written by the trace utility to the/var/ct/IW/log/mc/Resource Manager directory.Each file in this directory that is named trace<.n> corresponds to a separate run of theresource manager. The latest file that corresponds to the current runof the resource manager is called trace. Trace files fromearlier runs have a suffix of .n, where n startsat 0 and increases for older runs.
Use the rpttr command to view these files. Records can beviewed as they are added for an active process by adding the -f option to the rpttr command.
Any core files that result from a program error are written by the traceutility to the /var/ct/IW/run/mc/Resource Managerdirectory. Like the trace files, older core files have a.n suffix that increases with age. Core files andtrace files with the same suffix correspond to the same run instance.
The log and run directories have a default limit of10MB. The resource managers ensure that the total amount of disk spaceused is less than this limit. Trace files without corresponding corefiles are removed first when the resource manager is over the limit.Then pairs of core and trace files are removed, starting with theoldest. At least one pair of core and trace files is alwaysretained.
Recovering from RMC and Resource Manager Problems
This section describes the tools that you can use to recover frominfrastructure problems. It tells you how to determine if thecomponents of the monitoring system are running and what to do if the RMCsubsystem or one of the resource managers should abnormally stop.Common troubleshooting problems and solutions are also described.
The Audit Log, Event Response, File System, and Host resource managersrecover from most errors because they have few dependencies. In somecases, the recovery consists of terminating and restarting the appropriatedaemon. These resource managers can recover from at least the followingerrors:
- Losing connection to the RMC daemon, probably caused by the terminating ofthe RMC daemon or another system problem.
- Programming errors that cause the process to abnormally terminate.In this case, the SRC subsystem restarts the daemon. This includeserrors such as invalid memory references and memory leaks.
- The /var or /tmp directories filling up. Whenthis happens, core and trace files cannot be captured.
In addition, all parameters received from the RMC subsystem are verified toavoid impacting other clients that may be using the same resourcemanager.
The following tools are described:
- ctsnap command
- SRC-controlled commands
- rmcctrl command for the RMC subsystem
- Audit log
ctsnap Command
For debugging purposes, the ctsnap command can be used totar the RSCT and resource-manager programs and send them to the softwareservice organization. The ctsnap command gathers systemconfiguration information and compresses the information into a tarfile, which can then be downloaded to disk or tape and transmitted to a remotesystem. The information gathered with the ctsnap command maybe required to identify and resolve system problems. See the man pagefor the ctsnap command for more information.
SRC-Controlled Commands
The RMC subsystem and the resource managers are controlled by the SystemResource Controller (SRC). They can be viewed and manipulated by SRCcommands. For example:
To see the status of all resource managers, enter:
lssrc -g rsct_rm
To see the status of an individual resource manager, enter:
lssrc -s rmname
where rmname can be:
- IBM.AuditRM
- IBM.ERRM
- IBM.FSRM
- IBM.HostRM
To see the status of all SRC-controlled subsystems on the local machine,enter:
lssrc -a
To see the status of a particular subsystem, for example, the RMCsubsystem, which is known to SRC as ctrmc, enter:
lssrc -s ctrmc
The SRC has these commands:
- lssrc
- startsrc
- stopsrc
- traceson
- tracesoff
For more information, see the command man pages or AIX Commands andTechnical References.
To find out more about SRC, see System Management Concepts:Operating System and Devices.
Recovery Support for RMC Using rmcctrl
The RMC command rmcctrl controls the operation of the RMCsubsystem and the RSCT resource managers. It is not normally run fromthe command line, but it can be used in some diagnostic environments; forexample, it can be used to add, start, stop, or delete an RMCsubsystem. See the rmcctrl command in the AIX CommandsReference, which is available at http://www.ibm.com/servers/aix/library.
Tracking ERRM Events with the Audit Log
The audit log is a system-wide facility for recording information about thesystem's operation. It can include information about the normaloperation of the system as well as system problems and errors. It ismeant to augment error log functionality by conveying the relationship of theerror relative to other system activities. All detailed informationabout system problems is still written to the operating system errorlog.
Records are created in the audit log by subsystems that have beeninstrumented to do that. For example, the Event Response subsystem runsin the background to monitor conditions defined by the administrator and theninvokes one or more actions when a condition becomes true. Because thissubsystem runs in the background, it is difficult for the operator oradministrator to understand the total set of events that occurred and theresults of any actions that were taken in response to an event. Becausethe Event Response subsystem records its activity in the audit log, theadministrator can easily view Event Response subsystem activity as well asthat of other subsystems through the lsaudrec command.
Troubleshooting Problems and Solutions
See the Web-based System Manager online help for common RMC troubleshootingproblems and solutions.