Topics: Monitoring, Red Hat / Linux

Monitoring a log file through Systemd

The following procedure describes how you can continuously monitor a log file through the use of SystemD on Red Hat Enterprise Linux (or similar operating systems).

Let's say you want to receive an email when a certain string occurs in a log file. For example, if the string "error" occurs in file /var/log/messages.

First, create a script that does a tail of /var/log/messages and searches for that string:

# cat /usr/local/bin/monitor.bash
#!/bin/bash

tail -fn0 /var/log/messages | while read line ; do
   echo "${line}" | grep -i "error" > /dev/null
   if [ $? = 0 ] ; then
      echo "${line}" | mailx -s "error in messages file" your@emailaddress.com
   fi
done
You can run that script, and be done with it. But if that script somehow gets cancelled or killed, for example when the system is rebooted, then the monitoring of the log file stops as well. That's where the use of Systemd comes in.

Create a file in folder /etc/systemd/system, such as "monitor.service", and add the following:
[Unit]
Description=My monitor script
After=network.target

[Service]
Type=simple
ExecStart=/bin/bash /usr/local/bin/monitor.bash
TimeoutStartSec=0
Restart=always
StartLimitInterval=0

[Install]
WantedBy=default.target
This file is a description of a service managed by Systemd, and it basically tells SystemD to run the script that we created earlier, and to restart it, in case it fails (that's what "Restart=always" is for).

Next, you'll have to tell Systemd that you made some changes:
# systemctl daemon-reload
Now you can start the newly defined service:
# systemctl start monitor.service
And after starting it, you can query the status:
# systemctl status monitor.service
monitor.service - My monitor script
   Loaded: loaded (/etc/systemd/system/monitor.service; enabled; vendor preset: disabled)
   Active: active (running) since Wed 2018-10-03 12:12:28 CDT; 2s ago
 Main PID: 1832 (bash)
   CGroup: /system.slice/monitor.service
           ??1832 /bin/bash /usr/local/bin/monitor.bash
           ??1833 tail -fn0 /var/log/messages
           ??1834 /bin/bash /usr/local/bin/monitor.bash

Oct 03 12:12:28 enemigo systemd[1]: Started My monitor script.
Oct 03 12:12:28 enemigo systemd[1]: Starting My monitor script...
As you can see in the output above, both the monitor.bash script and the tail command are running. To test if the service is actually restarted, you can try killing the tail process or the monitor.bash script, and then check for the status again. You'll see it is restarted.

You may also want to test that the new monitor script indeed is working. Considering that /var/log/messages is a file written to through rsyslog, you can log an entry with the string "error" to that file as follows:
# logger error
Next, you should receive an email saying that an "error" occurrence was found in the messages file.

Finally, you'll want to make sure this new monitor service is restarted also when the system boots:
# systemctl enable monitor.service

Topics: Monitoring, Networking, Red Hat / Linux

Securely enabling SNMP on Red Hat

Monitoring tools often use SNMP to query another system's information and status. For that to work on a Red Hat Enterprise Linux system, that system will have to have SNMP configured. And to allow a remote (monitoring) system to query SNMP information of a Red Hat Enterprise Linux system, one has to complete the following 3 items:

  • Set up SNMP.
  • Configure SNMP to use a non-public community name.
  • Allow access through the firewall, if configured.
For the configuration of SNMP, you'll need to install the following 2 packages:
# yum -y install net-snmp net-snmp-utils
Next, start and enable (at boot time) the SNMP daemon to run on the system:
# systemctl enable snmpd
# systemctl start snmpd
Now you can test if you can query SNMP infomation -locally- on the system, by using the snmpwalk command:
# snmpwalk -v2c -c public localhost | head -5
The community string used above ("public") is a well-known SNMP community string, and this can be (and probably "is") utilized by hackers or other unfriendly people to obtain information about the system remotely, and as such, it's best practice to change the public community name into something a littlebit different, preferably something that can't be guessed very easily. For the sake of this tutorial, we'll change it to "kermit".

Basically, you'll have to update this line in /etc/snmp/snmpd.conf from "public" to "kermit":

Before:
com2sec notConfigUser  default       public
After:
com2sec notConfigUser  default       kermit
Then, restart the SNMP daemon, so it picks up the changes to configuration file /etc/snmp/snmpd.conf:
# systemctl restart snmpd
Now test again with the snmpwalk command but this time by using the "kermit" community name:
# snmpwalk -v2c -c kermit localhost
That should give you quite a bit of output. If it doesn't, you've made a mistake, and you'll have to re-trace your steps.

The final step is to allow remote access. That will be needed if a remote system is being used to monitor the server, for example by a tool like Solarwinds. By default, remote access will be blocked by the firewall daemon on the system. To allow remote access, open up UDP port 161 on the client:
# firewall-cmd --zone=public --add-port=161/udp --permanent
# firewall-cmd --reload
Now log in to a remote system and run a similar snmpwalk command, but this time, specify the hostname of the server that you're querying (instead of "localhost"). For example, if the name of the host is "myserver", run:
# snmpwalk -v2c -c kermit myserver
And that's it. You can now remotely monitor a Linux server using SNMP, and you've secured it by changing the community name.

Topics: AIX, Monitoring, Networking, Red Hat / Linux, Security, System Admin

Determining type of system remotely

If you run into a system that you can't access, but is available on the network, and have no idea what type of system that is, then there are few tricks you can use to determine the type of system remotely.

The first one, is by looking at the TTL (Time To Live), when doing a ping to the system's IP address. For example, a ping to an AIX system may look like this:

# ping 10.11.12.82
PING 10.11.12.82 (10.11.12.82) 56(84) bytes of data.
64 bytes from 10.11.12.82 (10.11.12.82): icmp_seq=1 ttl=253 time=0.394 ms
...
TTL (Time To Live) is a timer value included in packets sent over networks that tells the recipient how long to hold or use the packet before discarding and expiring the data (packet). TTL values are different for different Operating Systems. So, you can determine the OS based on the TTL value. A detailed list of operating systems and their TTL values can be found here. Basically, a UNIX/Linux system has a TTL of 64. Windows uses 128, and AIX/Solaris uses 254.

Now, in the example above, you can see "ttl=253". It's still an AIX system, but there's most likely a router in between, decreasing the TTL with one.

Another good method is by using nmap. The nmap utility has a -O option that allows for OS detection:
# nmap -O -v 10.11.12.82 | grep OS
Initiating OS detection (try #1) against 10.11.12.82 (10.11.12.82)
OS details: IBM AIX 5.3
OS detection performed.
Okay, so it isn't a perfect method either. We ran the nmap command above against an AIX 7.1 system, and it came back as AIX 5.3 instead. And sometimes, you'll have to run nmap a couple of times, before it successfully discovers the OS type. But still, we now know it's an AIX system behind that IP.

Another option you may use, is to query SNMP information. If the device is SNMP enabled (it is running a SNMP daemon and it allows you to query SNMP information), then you may be able to run a command like this:
# snmpinfo -h 10.11.12.82 -m get -v sysDescr.0
sysDescr.0 = "IBM PowerPC CHRP Computer
Machine Type: 0x0800004c Processor id: 0000962CG400
Base Operating System Runtime AIX version: 06.01.0008.0015
TCP/IP Client Support  version: 06.01.0008.0015"
By the way, the example for SNMP above is exactly why UNIX Health Check generally recommends to disable SNMP, or at least to dis-allow providing such system information trough SNMP by updating the /etc/snmpdv3.conf file appropriately, because this information can be really useful to hackers. On the other hand, your organization may use monitoring that relies of SNMP, in which case it needs to be enabled. But then you stil have the opportunity of changing the SNMP community name to something else (the default is "public"), which also limits the remote information gathering possibilities.

Topics: Monitoring, PowerHA / HACMP

Cluster status webpage

How do you monitor multiple HACMP/PowerHA clusters? You're probably familiar with the clstat or the xclstat commands. These are nice, but not sufficient when you have more than 8 HACMP/PowerHA clusters to monitor, as it can't be configured to monitor more than 8 clusters. It's also difficult to get an overview of ALL clusters in a SINGLE look with clstat. IBM included a clstat.cgi in HACMP 5 to show the cluster status on a webpage. This still doesn't provide an overview in a single look, as the clstat.cgi shows a long listing of all clusters, and it is just like clstat limited to monitoring only 8 clusters.

The HACMP/PowerHA cluster status can be retrieved via SNMP (this is actually what clstat does too). Using the IP addresses of a cluster and the snmpinfo command, you can remotely retrieve cluster status information, and use that information to build a webpage. We've written a script for this purpose. By using colors for the status of the clusters and the nodes (green = ok, yellow = something is happening, red = error), you can get a quick overview of the status of all the HACMP/PowerHA clusters.


Per cluster you can see: the cluster name, the cluster ID, HACMP version and the status of the cluster and all its nodes. It will also show you where any resource groups are active.

You can download the script here. This is version 1.6. Untar the file that you download. There is a README in the package, that will tell you how you can configure the script. This script has been tested with HACMP version 4, 5, 6, and up to PowerHA version 7.1.3.4.

Topics: AIX, Monitoring, System Admin

NMON recordings

One can set up NMON recordings from smit via:

# smitty topas -> Start New Recording -> Start local recording -> nmon
However, the smit panel doesn't list the option needed to get disk IO service times. Specifically, the -d option to collect disk IO service and wait times. Thus, it's better to use the command line with the nmon command to collect and report these statistics. Here's one set of options for collecting the data:
# nmon -AdfKLMNOPVY^ -w 4 -s 300 -c 288 -m /var/adm/nmon
The key options here include:
  • -d Collect and report IO service time and wait time statistics.
  • -f Specifies that the output is in spreadsheet format. By default, the command takes 288 snapshots of system data with an interval of 300 seconds between each snapshot. The name of the output file is in the format of hostname_YYMMDD_HHMM.nmon.
  • -O Includes the Shared Ethernet adapter (SEA) VIOS sections in the recording file.
  • -V Includes the disk volume group section.
  • -^ Includes the FC adapter section (which also measures NPIV traffic on VIOS FC adapters).
  • -s Specifies the interval in seconds between 2 consecutive recording snapshots.
  • -c Specifies the number snapshots that must be taken by the command.
Running nmon using this command will ensure it runs for a full day. And it is therefore useful to start nmon daily using a crontab entry in the root crontab file. For example, using the following script:
# cat /usr/local/collect_nmon.ksh
#!/bin/ksh

LOGDIR="/var/adm/nmon"
PARAMS="-fTNAdKLMOPVY^ -w 4 -s 300 -c 288 -m $LOGDIR"

# LOGRET determines the number of days to retain nmon logs.
LOGRET=365

# Create the nmon folder.
if [ ! -d /var/adm/nmon ] ; then
        mkdir -p $LOGDIR
fi

# Compress previous daily log.
find $LOGDIR -name *.nmon -type f -mtime +1 -exec gzip '{}' \;

# Clean up old logs.
find $LOGDIR -name *nmon.gz -type f -mtime +$LOGRET -exec rm '{}' \;

# Start nmon.
/usr/bin/nmon $PARAMS
Then add the following crontab entry to the root crontab file:
0 0 * * * /usr/local/collect_nmon.ksh >/tmp/collect_nmon.ksh.log 2>&1
To get the recordings thru the NMON Analyser tool (a spreadsheet tool that runs on PCs and generates performance graphs, other output, and is available here), it's recommended to keep the number of intervals less than 300.

Topics: AIX, Monitoring, System Admin

Boxes and lines in NMON

Usually, with the default settings used with NMON, along with using PuTTY on a Windows system, you may notice that the boxes and lines in NMON are not displayed correctly. It may look something like this:



An easy fix for this issue is to change the character set translation within PuTTY. In the upper left corner of your PuTTY window, click the icon and select "Change Settings". Then navigate to Window -> Translation. In the "Remote character set" field, change "UTF-8" to "ISO-8859-1".



Once changed, restart PuTTY and it should something like this:



Another option is to stop using boxes and lines altogether. You can do this by starting nmon with the -B option:

# nmon -B
Or you can set the NMON environment variable to the same:
# export NMON=B
# nmon

Topics: AIX, Monitoring, Red Hat / Linux, Security, System Admin

Sudosh

Sudosh is designed specifically to be used in conjunction with sudo or by itself as a login shell. Sudosh allows the execution of a root or user shell with logging. Every command the user types within the root shell is logged as well as the output.

This is different from "sudo -s" or "sudo /bin/sh", because when you use one of these instead of sudosh to start a new shell, then this new shell does not log commands typed in the new shell to syslog; only the fact that a new shell started is logged.

If this newly started shell supports commandline history, then you can still find the commands called in the shell in a file such as .sh_history, but if you use a shell such as csh that does not support command-line logging you are out of luck.

Sudosh fills this gap. No matter what shell you use, all of the command lines are logged to syslog (including vi keystrokes). In fact, sudosh uses the script command to log all key strokes and output.

Setting up sudosh is fairly easy. For a Linux system, first download the RPM of sudosh, for example from rpm.pbone.net. Then install it on your Linux server:

# rpm -ihv sudosh-1.8.2-1.2.el4.rf.i386.rpm
Preparing...  ########################################### [100%]
   1:sudosh   ########################################### [100%]
Then, go to the /etc file system and open up /etc/sudosh.conf. Here you can adjust the default shell that is started, and the location of the log files. Default, the log directory is /var/log/sudosh. Make sure this directory exists on your server, or change it to another existing directory in the sudosh.conf file. This command will set the correct authorizations on the log directory:
# sudosh -i
[info]: chmod 0733 directory /var/log/sudosh
Then, if you want to assign a user sudosh access, edit the /etc/sudoers file by running visudo, and add the following line:
username ALL=PASSWD:/usr/bin/sudosh
Now, the user can login, and run the following command to gain root access:
$ sudo sudosh
Password:
# whoami
root
Now, as a sys admin, you can view the log files created in /var/log/sudosh, but it is much cooler to use the sudosh-replay command to replay (like a VCR) the actual session, as run by the user with the sudosh access.

First, run sudosh-replay without any paramaters, to get a list of sessions that took place using sudosh:
# sudosh-replay
Date       Duration From To   ID
====       ======== ==== ==   ==
09/16/2010 6s       root root root-root-1284653707-GCw26NSq

Usage: sudosh-replay ID [MULTIPLIER] [MAXWAIT]
See 'sudosh-replay -h' for more help.
Example: sudosh-replay root-root-1284653707-GCw26NSq 1 2
Now, you can actually replay the session, by (for example) running:
# sudosh-replay root-root-1284653707-GCw26NSq 1 5
The first paramtere is the session-ID, the second parameter is the multiplier. Use a higher value for multiplier to speed up the replay, while "1" is the actual speed. And the third parameter is the max-wait. Where there might have been wait times in the actual session, this parameter restricts to wait for a maximum max-wait seconds, in the example above, 5 seconds.

For AIX, you can find the necessary RPM here. It is slightly different, because it installs in /opt/freeware/bin, and also the sudosh.conf is located in this directory. Both Linux and AIX require of course sudo to be installed, before you can install and use sudosh.

Topics: AIX, Monitoring, System Admin

Cec Monitor

To monitor all lpars within 1 frame, use:

# topas -C

Topics: Monitoring, PowerHA / HACMP

HACMP auto-verification

HACMP automatically runs a verification every night, usually around mid-night. With a very simple command you can check the status of this verification run:

# tail -10 /var/hacmp/log/clutils.log 2>/dev/null|grep detected|tail -1
If this shows a returncode of 0, the cluster verification ran without any errors. Anything else, you'll have to investigate. You can use this command on all your HACMP clusters, allowing you to verify your HACMP cluster status every day.

With the following smitty menu you can change the time when the auto-verification runs and if it should produce debug output or not:
# smitty clautover.dialog
You can check with:
# odmget HACMPcluster
# odmget HACMPtimersvc
Be aware that if you change the runtime of the auto-verification that you have to synchronize the cluster afterwards to update the other nodes in the cluster.

Topics: Monitoring, PowerHA / HACMP

HACMP Event generation

HACMP provides events, which can be used to most accurately monitor the cluster status, for example via the Tivoli Enterprise Console. Each change in the cluster status is the result of an HACMP event. Each HACMP event has an accompanying notify method that can be used to handle the kind of notification we want.

Interesting Cluster Events to monitor are:

  • node_up
  • node_down
  • network_up
  • network_down
  • join_standby
  • fail_standby
  • swap_adapter
  • config_too_long
  • event_error
You can set the notify method via:
# smitty hacmp
Cluster Configuration
Cluster Resources
Cluster Events
Change/Show Cluster Events
You can also query the ODM:
# odmget HACMPevent

Number of results found for topic Monitoring: 15.
Displaying results: 1 - 10.