Tuesday 26 November 2013

Using tmpfs to improve Nagios performance

Nagios is an excellent monitoring tool. We can monitor servers, network devices using Nagios.
Besides many of the useful plugins at nagios exchange (http://exchange.nagios.org) , we can also write our own plugins using shell scripts.

We can set up Nagios monitoring server by following Setting up Nagios monitoring server, the default setting and configuration is sufficient if we are only monitoring a few servers. However as the number of monitored hosts and services increases, we will notice the check latencies.
This is because Nagios needs continuously updating some files on disk, when there are more items to monitor, there are also more disk I/O required, eventually I/O will become the bottle neck, it's slowing down the Nagios check.

To solve this problem, we need to improve IO performance or reduce IO requests, we can install Nagios on SSD disk, but it's not cost effective.

In an earlier post using tmpfs to improve PostgreSQL performance, to boost the performance of PostgreSQL, we pointed stats_temp_directory to tmpfs.
Similarly, if some files are only needed when Nagios is running, we can move them to tmpfs, thus reduce IO requests.
In Nagios there are a few key files that affect disk I/O, they are:
1. /usr/local/nagios/var/status.data, this status file stores the current status of all monitored services and hosts, it's being consistently updated as defined by status_update_interval, in my default nagios installation, status_file is updated every 10 seconds.
The contents of the status file are deleted every time Nagios restarts, so it's only useful when nagios is running.
[root@centos /usr/local/nagios/etc]# grep '^status' nagios.cfg
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10

2. /usr/local/nagios/var/objects.cache, this file is a cached copy of object definitions, and CGIs read this file the get the object definitions.
the file is recreated every time Nagios starts, So objects.cache doesn't need to be on non-volatile storage.
[root@centos /usr/local/nagios/etc]# grep objects.cache nagios.cfg
object_cache_file=/usr/local/nagios/var/objects.cache

3. /usr/local/nagios/var/spool/checkresults, all the incoming check results are stored here, while Nagios is running, we will notice that files are being created and deleted constantly, so checkresults can also be moved to tmpfs
[root@centos /usr/local/nagios/etc]# grep checkresults nagios.cfg
check_result_path=/usr/local/nagios/var/spool/checkresults
[root@centos /usr/local/nagios/etc]#

[root@centos /usr/local/nagios/var/spool/checkresults]# ls
checkP2D5bM  cn6i6Ld  cn6i6Ld.ok
[root@centos /usr/local/nagios/var/spool/checkresults]# head -4 cn6i6Ld
### Active Check Result File ###
file_time=1385437541

### Nagios Service Check Result ###
[root@centos /usr/local/nagios/var/spool/checkresults]#

So we can move status.data, objects.cache and checkresults to tmpfs, but before that we need to mount the file system first
[root@centos ~]# mkdir -p /mnt/nagvar
[root@centos ~]# mount -t tmpfs tmpfs /mnt/nagvar -o size=50m
[root@centos ~]# df -h /mnt/nagvar
Filesystem            Size  Used Avail Use% Mounted on
tmpfs                  50M     0   50M   0% /mnt/nagvar
[root@centos ~]# mount | grep nagvar
tmpfs on /mnt/nagvar type tmpfs (rw,size=50m)

create directory for checkresults
[root@centos ~]# mkdir -p /mnt/nagvar/spool/checkresults
[root@centos ~]# chown -R nagios:nagios /mnt/nagvar

modify nagios.cfg
status_file=/mnt/nagvar/status.dat
object_cache_file=/mnt/nagvar/objects.cache
check_result_path=/mnt/nagvar/spool/checkresults

restart nagios so our changes will take effect
[root@centos ~]# service nagios restart
Running configuration check...done.
Stopping nagios: done.
Starting nagios: done.

we can see, nagios is using /mnt/nagvar
[root@centos ~]# tree /mnt/nagvar/
/mnt/nagvar/
├── objects.cache
├── spool
│   └── checkresults
│       ├── ca8JfZI
│       └── ca8JfZI.ok
└── status.dat

2 directories, 4 files

We can configure /etc/fstab to mount /mnt/nagvar everytime system reboots.
[root@centos ~]# echo <<EOF >> /etc/fstab
tmpfs      /mnt/nagvar    tmpfs   defaults,size=50m    0 0
EOF

But the directory /mnt/nagvar/spool/checkresults will be gone after /mnt/nagvar is re-mounted, so we need to create this directory before starting up Nagios.
we can update /etc/init.d/nagios, add this lines after the first line:
mkdir -p /mnt/nagvar/spool/checkresults
chown -R nagios:nagios /mnt/nagvar

[root@centos ~]# sed -i '1a\
mkdir -p /mnt/nagvar/spool/checkresults\
chown -R nagios:nagios /mnt/nagvar' /etc/init.d/nagios

Since we have moved the files to tmpfs, there is no disk I/O on these files, we can see great performance improvement of Nagios.

Reference:
http://assets.nagios.com/downloads/nagiosxi/docs/Utilizing_A_RAM_Disk_In_NagiosXI.pdf

Thursday 21 November 2013

setup nginx web server with PHP

Nginx (engine x) is a high performance lightweight HTTP server, more and more sites are using nginx, according to Netcraft survey (http://news.netcraft.com/archives/2013/11/01/november-2013-web-server-survey.html), nginx powers 15% of the busies sites in November 2013.

Nginx installation is very straight forward, we can download latest source code from http://nginx.org/en/download.html or point our yum source to http://nginx.org/packages/OS/OSRELEASE/$basearch/ and install using yum.
Replace “OS” with “rhel” or “centos”, depending on the distribution used, and “OSRELEASE” with “5” or “6”, for 5.x or 6.x versions, respectively.
So for CentOS 6.3, we can point our YUM source to: http://nginx.org/packages/centos/6/$basearch/

Tuesday 5 November 2013

How to recover deleted open files

In Linux, a file is deleted completed when:
  1. No more hard link reference to the file
  2. Processes opening the file is terminated
From why du and df show different filesystem usage (http://linuxscripter.blogspot.com/2013/11/why-du-and-df-show-different-filesystem.html), we know that if we delete an open file, Linux won't release its space until the opening processes are stopped.

So if we delete an open file by mistake, is there a way to recover it?
YES, we can check which process is opening the file, and recover file content by checking the process file descriptor.

Again, let's assume our Apache error_log is deleted, we can check which process is opening this file:
[root@centos ~]# lsof | sed -n '1p;/error_log.*deleted/p'
COMMAND    PID      USER   FD      TYPE     DEVICE SIZE/OFF       NODE NAME
httpd     3155      root    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3157    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3158    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3159    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3160    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3161    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3162    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3163    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)
httpd     3164    apache    2w      REG      253,0      370      15396 /var/log/httpd/error_log (deleted)

From the output we know that, process 3155 still opens this file. Go to /proc/3155/fd/ to confirm this, 2w means error_log is opened for write, and we need to check softlink "2" in /proc/3155/fd.
[root@centos ~]# cd /proc/3155/fd/
[root@centos fd]# ls -l 2
l-wx------ 1 root root 64 Nov  5 09:46 2 -> /var/log/httpd/error_log (deleted)
[root@centos fd]# tail 2
[Tue Nov 05 09:46:35 2013] [notice] suEXEC mechanism enabled (wrapper: /usr/sbin/suexec)
[Tue Nov 05 09:46:35 2013] [notice] Digest: generating secret for digest authentication ...
[Tue Nov 05 09:46:35 2013] [notice] Digest: done
[Tue Nov 05 09:46:35 2013] [notice] Apache/2.2.15 (Unix) DAV/2 PHP/5.3.3 mod_wsgi/3.2 Python/2.6.6 configured -- resuming normal operations

To recover the content of error_log, we can just copy 2 to a temporary location, stop Apache and rename copy of 2 to /var/log/httpd/error_log

[root@centos fd]# cp 2 /tmp/error_log
[root@centos ~]# service httpd stop
Stopping httpd:                                            [  OK  ]
[root@centos ~]# mv /tmp/error_log /var/log/httpd/error_log

[root@centos ~]# service httpd start
Starting httpd:                                            [  OK  ]

Monday 4 November 2013

why du and df show different filesystem usage

Today I saw a forum post asking on his system, why df shows over 200G space used, while du only shows 50G ussed.
Most probably this is caused by open files being deleted.
When a file is opened by a process, deleting the file won't release the space occupied by it. we also need to terminate the process, otherwise df and du will show different filesystem usage.

Let's assume du and df show huge difference for /var, we can check which open files are deleted and the processes opening them.

[root@linux ~]# lsof | sed -n '1p;/var.*deleted/p'
COMMAND    PID     USER   FD      TYPE     DEVICE SIZE/OFF       NODE NAME
httpd     1779     root   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1810   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1811   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1812   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1813   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1814   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1815   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1816   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)
httpd     1817   apache   10w      REG      253,0    22034       8117 /var/log/httpd/access_log (deleted)

From the output we can see that access_log is deleted, but Apache was not restarted, httpd process still has this file open.
to release the space, we can restart httpd process:
[root@linux ~]# service httpd restart
Stopping httpd:                                            [  OK  ]
Starting httpd:                                            [  OK  ]

Now check again
[root@linux ~]# lsof | sed -n '1p;/var.*deleted/p'
COMMAND    PID     USER   FD      TYPE     DEVICE SIZE/OFF       NODE NAME

And du df report same file system usage.

Before we restart httpd process, Linux won't release the space used by access_log, if access_log is deleted by mistake, is there a way to recover it?
Yes, I will demo how to recover deleted open files in http://linuxscripter.blogspot.com/2013/11/how-to-recover-deleted-open-files.html

Saturday 2 November 2013

Use puppet to manage linux servers

Puppet is a configuration management system, using puppet we can easily manage thousands of Linux servers. If we have configured our system using epel source, we can directly install puppet using YUM. Alternatively we can download the software from puppetlabs.org and follow document to install it.

To install manually, our system must have ruby installed, ruby rpm files can be found on linux installation media, if we have a local yum repository, we can install ruby using yum.

After ruby is installed, we can download and install puppet, facter is also required for puppet. we download the stable versions is facter-1.7.2.tar.gz and puppet-3.2.2.tar.gz.


1. Install puppet on both puppet master and agent
# tar -zxpf facter-1.7.2.tar.gz
# cd facter-1.7.2
# ruby install.rb
# cd ..
# tar -zxpf puppet-3.2.2.tar.gz
# cd puppet-3.2.2
# ruby install.rb

2. start puppet master
# puppet master

3. on agent, edit /etc/puppet/puppet.conf
[main]
server = centos.local.vb
certificate_revocation = false
ssldir=/var/lib/puppet/ssl

4. connect puppet master for the first time, this will generate an ssl signing request
# puppet agent --no-daemonize --onetime --verbose
Info: Creating a new SSL certificate request for centos-1.local.vb
Info: Certificate Request fingerprint (SHA256): B8:67:94:4C:2A:23:2F:90:D8:4E:34:CC:AF:48:B0:04:BA:82:7F:D2:E3:7F:B7:9A:78:35:18:87:EB:05:D5:61
Exiting; no certificate found and waitforcert is disabled


5. On puppet master, sign the ssl request from puppet agent
[root@centos ~]# puppet cert list
"centos-1.local.vb" (SHA256) B8:67:94:4C:2A:23:2F:90:D8:4E:34:CC:AF:48:B0:04:BA:82:7F:D2:E3:7F:B7:9A:78:35:18:87:EB:05:D5:61
[root@centos ~]# puppet cert sign "centos-1.local.vb"
Notice: Signed certificate request for centos-1.local.vb
Notice: Removing file Puppet::SSL::CertificateRequest centos-1.local.vb at '/var/lib/puppet/ssl/ca/requests/centos-1.local.vb.pem'


6. Now we can manage our linux servers from puppet master. If we want to manage httpd service, we can create an httpd module

# mkdir -p /etc/puppet/modules/httpd

Every module stores its configuration in manifests/init.pp file, so we need to create /etc/puppet/modules/httpd/manifests/init.pp


class httpd {
package { "httpd":
ensure => installed,
}

service { "httpd":
ensure => running,
enable => true,
}

file { "/var/www/html/index.html":
ensure => present,
group => "root",
owner => "root",
mode => "0644",
source => "puppet:///modules/httpd/puppet.index.html"
}
}

source => "puppet:///modules/httpd/puppet.index.html" is telling puppet agent that it needs to get index.html from puppet master, the file location on master is: /etc/puppet/modules/httpd/files/puppet.index.html

# echo i am from puppet index.html ! > /etc/puppet/modules/httpd/files/puppet.index.html
we have a httpd module, to manage agent, we also need to define our node files, we can define this in /etc/puppet/manifests/site.pp
node centos-1 {
include httpd
}

7. test our configuration on centos-1:
[root@centos-1 html]# puppet agent --no-daemonize --onetime --verbose
Info: Retrieving plugin
Info: Caching catalog for centos-1.local.vb
Info: Applying configuration version '1383379216'
Notice: /Stage[main]/Httpd/Service[httpd]/ensure: ensure changed 'stopped' to 'running'
Notice: /Stage[main]/Httpd/File[/var/www/html/index.html]/ensure: defined content as '{md5}33f97919a4e508801272b7889f34e332'
Notice: Finished catalog run in 0.70 seconds
Puppet supports regular expressions in its configuration files, if all the servers centos-1 centos-2 centos-999 have same configuration, instead of repeating the node definitions 999 times, we can represent them using one regular expression.
node /^centos-\d$/
Puppet also supports import in its configuration files, if all our agents have different configuration files, besides constructing a big site.pp, we can have 1 configuration file for each agent, centos-1.pp centos-2.pp centos-999.pp, and then import them from site.pp
import "nodes/*"
As the environment grows, we will have more and more configuration files in nodes directory, it's not very efficient to manage many files, puppet has a feature called External Node Classifier (ENC), using ENC we can replace text-based node definition files with LDAP, database, or whatever data sources suitable for our environment.