Snapshot recovery problems with MongoDB, mmap, XFS filesystem and rsync

I’m running a MongoDB 2.2 cluster in Amazon EC2 consisting of three machines. One of these machines is used to take hourly snapshots with LVM and EBS and I noticed a rare bug which leads to silent data corruption on the restore phase. I’m using Rightscale to configure the machines with my own ServerTemplate, which I enhance with some Rightscale Chef recipes for automated snapshots and restore. Rightscale needs to support multiple different platforms, where AWS is just one of them and they have carefully constructed these steps to perform the snapshot.

Each machine has two provisioned IO EBS volumes attached to the machine. The Rightscale block_device::setup_block_device creates an LVM volume group on top of these raw devices. Because EC2 can’t do atomic snapshot over multiple EBS volumes simultaneously, the LVM snapshots is used for this. So the backup steps are:

  1. Lock MongoDB from writes and flush journal file to disk to form a checkpoint with the db.fsyncLock() command.
  2. Lock the underlying XFS filesystem
  3. Do LVM snapshot
  4. Unlock XFS filesystem
  5. Unlock MongoDB with db.fsyncUnlock()
  6. Perform EBS snapshot for each underlying EBS volumes
  7. Delte the LVM snapshot, so that it doesn’t take disk space.

Notice that the main volume will start getting writes after step 5, before the EBS volumes have been snapshotted by Amazon EC2. This point is crucial when understanting the bug later. The restore procedure does the following steps in the block_device::do_primary_restore Chef recipe:

  1. Order EC2 to create new EBS volumes from each EBS snapshots and wait until the api says that the volumes have attached correctly
  2. Spin up the LVM
  3. Mount the snapshot first in read-write so that XFS can unroll the journal log and then remount it into read-only mode
  4. Mount the main data volume
  5. Use rsync to sync from the snapshot into the main data volume:  rsync -av –delete –inplace –no-whole-file /mnt/snapshot /mnt/storage
  6. Delete the LVM snapshot

Actual bug

MongoDB used mmap() sys-call to memory-map the data files from disk to memory. This makes the file layer implementation easier, but it creates other problems. Biggest issue is that the MongoDB daemon can’t know when the kernel flushes the writes to disk. This is also crucial information needed to understand this bug.

Rsync is very smart to optimize the comparison  and the sync. By default, rsync first looks at the file last modification time and size to determine if the file has changed. Only after it starts a sophisticated comparison function which sends the changed data over network. This makes it very fast to operate.

Now the devil of this bug comes from the kernel itself. It turns out that the kernel has a long lasting bug (dating way back to 2004!) where in some cases the mmap()’ed file mtime (last modification time) is not updated on sync when the kernel flushes writes to the disk, which uses XFS filesystem. Because of this, some of the data which mongodb writes after the LVM snapshot and before the EBS snapshot to the memory-map’ed data is flushed to the disk, but the file mtime is not updated.

Because the Rightscale restoration procedure uses rsync to sync the inconsistent main data volume from the consistent snapshot, the rsync will not notice that some of these files have been changed. Because of this, rsync in fact does not do a proper job to reconstruct the data volume and this results corrupted data.

When you try to start MongoDB from this kind of corrupted data, you will encounter some weird assertion errors like these:

Assertion: 10334:Invalid BSONObj size: -286331154 (0xEEEEEEEE) first element

and

ERROR: writer worker caught exception: BSONElement: bad type 72 on:

I first though that this was a bug in the MongoDB journal playback. The guys at 10gen were happy to assist me and after more careful digging I started to be more suspicious on the snapshot method itself. This required a quite lot of detective work by trial and error until I finally started to suspect the rsync phase in the restoration.

The kernel bug thread had a link to this program which replicated the actual bug in Linux and this confirmed that my system was still suffering from this very same bug:

[root@ip-10-78-10-22 tmp]# uname -a

Linux ip-10-78-10-22 2.6.18-308.16.1.el5.centos.plusxen #1 SMP Tue Oct 2
23:25:27 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux

[root@ip-10-78-10-22 tmp]# dd if=/dev/urandom of=/mnt/storage/foo.txt
20595+0 records in
20594+0 records out
10544128 bytes (11 MB) copied, 1.75957 seconds, 6.0 MB/s

[root@ip-10-78-10-22 tmp]# /tmp/mmap-test /mnt/storage/foo.txt
Modifying /mnt/storage/foo.txt...
Flushing data using sync()...
Failure: time not changed.
Not modifying /mnt/storage/foo.txt...
Flushing data using msync()...
Success: time not changed.
Not modifying /mnt/storage/foo.txt...
Flushing data using fsync()...
Success: time not changed.
Modifying /mnt/storage/foo.txt...
Flushing data using msync()...
Failure: time not changed.
Modifying /mnt/storage/foo.txt...
Flushing data using fsync()...
Failure: time not changed.

Conclusions

I’m now working with both Rightscale and 10gen so that others won’t suffer from this bug. Mainly this means a few documentation tweaks on 10gen side and maybe a change of the backup procedurals on Rightscale side. Note that this bug does not happen unless you are snapshotting a database which uses mmap(). This means that at least MySQL and PostgreSQL are not affected.

This issue reminds that you should periodically test your backup methods by doing a restore and comparing that your data is intact. Debugging this strange bug took me for about a week. It contained all the classic pieces of a good detective story: false leads, dead ends, a sudden plot twist and a happy ending :)

Setting backup solution for my entire digital legacy (part 2 of 2)

As part of my LifeArchive project, I had to verify that I have sufficient methods to back all my valuable assets so well that they will last for decades. Sadly, there isn’t currently any mass media storage available that is known to function for such a long time, and in any way you must prepare for losing a site due to floods, fire and other disasters. This post explains how I solved my backup needs for my entire digital legacy. Be sure to read the first part: LifeArchive – store all your digital assets safely (part 1 of 2)

The cheapest way to store data currently is to use hard disks. Google, Facebook, The Internet Archive, Dropbox etc are all known to host big data centers with a lot of machine with a lot of disks. Also at least Google is known to use tapes for additional backups, but they are way too expensive for this kind of small usage.

Disks have also their own problem. The biggest problem is that they tend to break. Another problem is that they might corrupt your data, which is a problem with traditional raid systems. As I’m a big fan of ZFS, my choose was to build a SAN on top of it. You can read more on this process from this blog post: Cheap NAS with ZFS in HP MicroServer N40L

Backups

As keeping your eggs in one basked is just stupid, having a good and redundant backup solution is the key to success. As in my previous post, I concluded that using cloud providers to solely host your data isn’t wise, but they are a viable choice for doing backups. I’ve chosen to use CrashPlan, which is a really nice cloud based software for doing increment backups. Here are the cons and the pros for CrashPlan:

Pros:

  • Nice GUI for both backing up and restoring files
  • Robust. The backups will eventually complete and the service will notify you by email if something is broken
  • Supports Windows, OS X, Linux and Solaris / OpenIndiana
  • Infinitive storage on some of the plans
  • Does increment backups, so you can find the lost file from history.
  • Allows you to backup to both CrashPlan cloud and to your own storage if you run the client in multiple machines.
  • Allows you to backup to your friends machine (this doesn’t even cost you anything), so you can establish a backup ring with a few of your friends.

Cons:

  • It’s still a black-box service, which might break down when you least expect
  • CrashPlan cloud is not very fast: Upload rate to CrashPlan cloud is around 1Mbps and download (restore) around 5Mbps
  • You have to fully trust and rely on the CrashPlan client to work – there’s no another way to access the archive except using the client.

I setup the CrashPlan client to backup into its cloud and in addition to Kapsi Ry’s server where I’m running a copy of the CrashPlan client. Running your own client is easy and it gives me a much faster way to recover the data when I need to. As the data is encrypted, I don’t need to worry that there’s also a few thousand other users in the same server.

Another parallel backup solution

Even when CrashPlan feels like a really good service, I still didn’t want to trust solely to its services. I can always somehow forget to enter my new credit card number and let the data there expire, only to have a simultaneous fatal accident on my NAS. So that’s why I wanted to have a redundant backup method. I happen to get another used HP MicrosServer for a good bargain, so I setup it similarly to have three 1TB disks which I also happend to have laying around unused from my previous old NAS. Used gear, used disks, but they’re good enough to act as my secondary backup method. I will of course still receiver email notifications on disk failures and broken backups, so I’m well within my planning safety limits.

This secondary NAS lives at another site and it’s connected with an openvpn network to the primary server in my home. It also doesn’t allow any incoming connections from anywhere outside, so it’s also quite safe. I setup a simple rsync script from my main NAS to sync all data to this secondary NAS. The rsync script uses –delete -option, so it will remove files which have been deleted from the primary NAS. Because of this I also use a crontab entry to snapshot the backup each night. This will protect me if I accidentally delete files from the main archive. I keep a week worth of daily snapshots and a few month of weekly snapshots.

One best pros with this when comparing to CrashPlan is that the files are sitting directly on your filesystem. There’s no encryption nor any proprietary client and tools you need to rely, so you can safely assume that you can always get an access to your files.

There’s also another option: Get a few USB disks and setup a schema where you automatically copy your entire archive to one disk. Then every once in a while unplug one of those, put it somewhere safe and replace it with another. I might do something like this once a year.

Verifying backups and monitoring

“You don’t have backups unless you have proven you can restore from them.” – a well known truth that many people tend to forget. Rsync backups are easy to verify, just run the entire archive thru sha1sum on both sides and verify that the checksums match. CrashPlan is a different beast, because you need to restore the entire archive to another place and verify it from there. It’s doable, but currently it can’t be automated.

Monitoring is another thing. I’ve built all my scripts so that they will email me if there’s a problem, so I can react immediately on error. I’m planning to setup a Zabbix instance to keep track, but I haven’t yet bothered.

Conclusions

Currently most of our digital assets aren’t stored safely enough that you can count that they all will be available in the future. With this setup I’m confident that I can keep all my digital legacy safe from hardware failures, cracking and human mistakes. I admit that the overal solution isn’t simple, but it’s very well doable for an IT-savvy person. The problem is that currently you can’t buy this kind of setup anywhere as a service, because you can’t be 100% sure that the service will keep up in the upcoming decades.

This setup will also work as a personal cloud, assuming that your internet link is fast enough. With the VPN connections, I can also let my family members to connect into this archive and let them store their valuable data. This is very handy, because that way I will know that I can access my parents digital legacy, who probably can’t do all this by themselves alone.

LifeArchive – store all your digital assets safely (part 1 of 2)

Remember the moments when you, or your parents, found some really old pictures buried deep into some closet and you instantly get a warm and fuzzy feeling of memories? In this modern era of Cloud Services, we’re producing even more personal data which we want to retain. Pictures form your phone and your DSLR camera, documents you’ve made, non-drm games, movies and music you’ve bought etc. The list goes on and on. Todays problem is that you have so many different devices, that it’s really hard to keep up where all your data is.

One solution is to rely on a number of cloud services: Dropbox, Google, Facebook, Flickr etc can all host your images and files, but can you really rely that your data is still there after ten years? What about 30 years? What if you use a paid service and for some reason you forget to upgrade your new credit card data and the service deletes all your data? What about the files which lay on a corner of your old laptop after you bought a new shiny computer? You can’t rely that the cloud providers you use are still in business for decades to come.

The solution? Implement your own strategy by backing up all your valuable digital assets. After thinking this for a few years I finally came up with my current solution for this problem: Gather all your digital assets into one place, so that they’re easy to backup. You can still use all your cloud services like Dropbox and Facebook, but just make sure that you do automatic backups from all these services into this central storage. This way there’s only one place which you need to backup and you can easily do backups to multiple different places just for extra precaution.

First identify what’s worth saving

  • I do photography, so that’s a bunch of .DNG and .JPG images in my Lightroom archive. I don’t photograph that much, so I can easily store them all, assuming that I remove the images which have failed so badly that there’s no imaginable situation where I would want those.
  • I also like doing movies. The raw footage takes too much space that it’s worth saving, so I only archive the project files and the final version. I store the raw footage in external drives which I don’t backup into this archive.
  • Pictures from my cell phone. There’s a ton of lovely images there which I want to save.
  • Emails, text messages from my phone, comments and messages from facebook.
  • Misc project files. Be it a 3D model, a source code file related to an experiment, drawings for my new home layout etc. I produce this kind of small stuff on weekly basis and I want to keep them for future reference and for the nostalgic value.
  • This blog and the backups related to it.

I calculated that I have currently about 250GB of this personal data, spanning over a decade. I’ve planned that I can just keep adding data to this central repository during my entire life and to always transfer it to new hardware when the old breaks. In other words, this will be my entire digital legacy.

Action plan:

  1. Buy a good NAS to home
  2. Build bunch of scripts and automation to fetch all data from different cloud services to this nas
  3. Implement good backup procedures for this NAS.

The first step was quite easy. I bought a HP MicroServer, which acts as a NAS in my home. You can read more from this project from this other blog post. Second step is the most complex: I have multiple computers, an Android cell phone and a few cloud services where I have data that I want to save. I had to find existing solutions for each of these and build my own when I couldn’t find anything. The third step is easy, but it’s worth for another blog post next week.

Archive pictures, edited videos and other projects

I can access the NAS directly from my workstations via Samba/CIFS mounts over network, so I use it directly to host my Lightroom archive, edited video projects (not including raw video assets), and other project files which I tend to produce. I also use it to store drm free music, videos and ebooks which I’ve bought from internet.

Backing up Android phones

This includes pictures which I take with my phone, but also raw data and settings for applications. I found out about this nice program called Rsync for Android. It uses rsync with ssh keys to sync into a backup destination, which runs inside a OpenIndiana Zone in my NAS. Data destination dir is mounted into the zone via lofs, so that only the specific data directory is exposed to the zone. Then I use Llama to periodically run the backup.

In addition I use SMS Backup + to sync all sms and mms messages to GMail under a special “SMS” label. Works like charm!

Backing up GMail

gmvault does the trick. It uses IMAP to download new emails from GMail and it’s simple to setup.

I actually use two different instance of gmvault. They both sync my gmail account, but other deletes emails from the backup database which have been deleted from the gmail and other does not. The idea is that I can still restore my emails if somebody gains access to my gmail and deletes my all emails. I have one script in my cron that syncs the backup databases every night with the “-t quick” option.

Backup other servers

I have a few other servers, like the one where I host this blog, that needs to be backed up. I use simple rsync with ssh keys from cron, which backs up these every night. The rsync uses –backup and –backup-dir to retain some direct protection for deleted files.

Conclusions

This kind of central storage for all your digital assets needs some careful planning, but it’s an idea worth exploring. After you have established this kind of archive, you need to implement the backups, which I will talk in the next post.

This kind of archive solves the problem where you aren’t sure where all your files are – they either have a copy in the archive, or they don’t exists. Beside that, you can make some really entertaining discoveries when you crawl the helms of the archive and find some files you thought you had lost a decade ago.

Read the second part of this blog series at Setting backup solution for my entire digital legacy (part 2 of 2)

Cheap NAS with ZFS in HP MicroServer N40L

I’ve ran Solaris / OpenIndiana machines in my home for years and I just recently had to get a new one because my old server with 12 disk slots made way too much noise and consumed way too much power to my taste. Luckily hard disk sizes have grown up a big time and I could replace the old server with a smaller one and with smaller amount of disks.
hp-microserver

I only recently learn about the HP MicroServer product family. HP has been making these small little servers for a few years and they are really handy and really cheap for their features. I bought a HP MicroServer N40L from amazon.de, which shipped from Germany to Finland in just a week for just 242 euros. Here’s a quick summary about the server itself:

  • It’s small, quiet and doesn’t use a lot of power. According to some measurements it will use on average around 40 to 60W.
  • It has four 3.5″ non-hot-swap disk slots inside. In addition, it has eSATA on the back and a 5.25″ slot with SATA cable. These can give you two extra sata slots for total of six disks.
  • A dual core AMD Turion II Neo N40L (1.5 Ghz) processor. More than enough for a storage server.
  • It can fit two ECC dimms, totaling maximum of 8 GB memory. According to some rumors, you can fit it with two 8GB dimms to get total of 16 GB.
  • Seven USB:s. Two on the back, four on the front and one inside the chassis, soldered right into the motherboard.
  • Depending on disks and configuration, it can stretch up to 12 TB of usable disk space (assuming you install five 4TB disks) with still providing enough security to handle two simultaneous disk failures!

The machine doesn’t have any RAID features, which doesn’t mind me, because my choice of weapon was to install OpenIndiana which comes with the great ZFS filesystem, among other interesting features. If you are familiar with Linux, you can easily learn how to manage an OpenIndiana installation, it’s really easy. ZFS itself is a powerful modern filesystem which combines the best features of software raid, security, data integrity and easy administration. OpenIndiana allows you to share the space with Samba/CIFS to Windows and with AppleTalk to your Macintosh. It also supports exporting volumes with iSCSI, it provides filesystem backups with ZFS snapshots and it even can be used as storage for your Macintosh Time Machine backups. You can read more about ZFS from the Wikipedia entry.

There’s a few different product packages available: I picked the one which comes with one 4GB ECC DIMM, DVD drive and no disks. There’s at least one other package which comes with one 2GB ECC DIMM and one 250GB disk. I personally suggest the following setup where you install the operating system into a small SSD and use the spinning disks only for the data storage.

Shopping list:

  • HP MicroServer N40L with one 4GB ECC DIMM and no disks.
  • Some 60GB SSD disk for operating system. I bought a Kingston 60GB for 60 euros.
  • Three large disks. I had 2TB Hitachi drives on my shelf which I used.
  • One USB memory stick, at least 1GB and an USB keyboard. You need these only during installation.

Quick installation instructions

– First optional step: You can flash a new modified bios firmware which allows getting better performance out of the SSD disk. The instructions are available in this forum thread and in here. Please read the instructions carefully. Flashing BIOS firmware is always a bit dangerous and it might brick your MicroServer, so consider yourself warned.

– Replace the DVD drive with the SSD disk. You can use good tape to attach the SSD disk to the chassis.

– Download the newest OpenIndiana server distribution from http://openindiana.org/download/#text and install it into the USB stick with these instructions.

– Insert the USB stick with OpenIndiana into the MicroServer and boot the machine. During the setup you can change how the SSD disk is partitioned for the OS: I changed the default settings so that the OS partition was just 30GB. This is more than enough for the basic OS. My plan is to later experiment with a 25GB slice as an L2ARC cache partition for the data disks and to leave 5GB unprovisioned. This extra 5GB should give the SSD controller even more room to manage the SSD efficiently and giving it more life time without wearing out. This might sound like a bit of exaggeration, but I’m aiming to minimize the required maintenance with sacrificing some disk space.

– After installation shutdown the server, remove the USB stick and install your data disks. My choise was to use three 2TB disks in a mirrored pool. This means that I can lose two disks at the same time, giving me a good time margin to replace the failed drive. You can read some reasoning why I wanted to have triple way mirroring from this article. If you populate the drive slots from left to right, the leftmost drive will be c4t0d0, the second-from-the-left c4t1d0 etc. The exact command should be:

zpool create tank mirror c4t0d0 c4t1d0 c4t2d0

After this the system should look like this:

  
# zpool status
pool: tank
state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            c4t0d0  ONLINE       0     0     0
            c4t1d0  ONLINE       0     0     0
            c4t2d0  ONLINE       0     0     0

Now you can create a child filesystems under the /tank storage pool to suit your needs:

# zfs create tank/lifearchive
# zfs create tank/crashplan_backups
# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank                     393G  1.40T    35K  /tank
tank/crashplan_backups  18.8G  1.40T  18.8G  /tank/crashplan_backups
tank/lifearchive         133G  1.40T   133G  /tank/lifearchive

Here I created two filesystems: One for crashplan backups from other machines and other for storing all my important digital heritage. Next I wanted to export this filesystem to Windows via the Samba sharing system:

# zfs set "sharesmb=name=lifearchive" tank/lifearchive
echo "other password required pam_smb_passwd.so.1 nowarn" >> /etc/pam.conf

In addition I had to add this line to /etc/pam.conf and change my password with “passwd” command after adding that line. Now you should be able to browse the lifearchive filesystem from windows.

Few additional steps you probably want to do:

Now when you got your basic system working you can do a few tricks to get it to work even better. Before you continue, snapshot your current state as explained in the next chapter:

Snapshot your root filesystem

After you’ve done everything as you want you should snapshot this situation as a new boot environment. This way if you do something very stupid and you render your system unusable, you can always boot the machine to the snapshotted state. The command “beadm create -e openindiana openindiana-baseline” will do the trick. Read here what it actually does! Note that this does not cover your data pool, so you might also want to create some snapshot backup system for that. Googling for “zfs snapshot backup script” should get you started.

Email alerts

Verify that you can send email out from the box, so that you can get alerts if one of your disk breaks. You can try this with for example the command “mail your.email@gmail.com”. Type your email and press Ctrl-D to send the email. At least in gmail your email might end up in the spam folder. My choice was to install Postfix and configure it to relay emails via the Google SMTP gateway. The exact steps are beyond the scope of this article, but here’s few tips:

  1. First configure to use SFE repositories: http://wiki.openindiana.org/oi/Spec+Files+Extra+Repository
  2. Stop sendmail: “svcadm disable sendmail”
  3. Remove sendmail: “pkg uninstall sendmail”
  4. Install postfix with “pkg install postfix”
  5. Follow these steps http://carlton.oriley.net/blog/?p=31 – at least the PKI certificate part is confusing. You need to read some documentation around the net to get this part right.
  6. Configure the FMA system to send email alerts with these instructions. Here’s also some other instructions which I used.

Backup your data

I’ve chosen to use Crashplan to back up my data to the Crashplan cloud and to another server over the internet. In addition I use Crashplan to backup my workstation into this NAS – that’s what the /tank/crashplan_backups filesystem was for.

Periodic scrubbing for the zpool

ZFS has a nice feature called scrubbing: This operation will scan over all stored data and verify that each and every byte in each and every disk is stored correctly. This will alert you from a possible upcoming disk breakage when there’s yet no permanent damage. The command is “zpool scrub tank” where “tank” is the name of the zpool. You should setup a crontab operation to do this every week. Here’s one guide how to do it: http://www.nickebo.net/periodic-zpool-scrubbing/

Optimize nginx ssl termination cpu usage with cipher selection

I have a fairy typical setup where I have nginx in front of haproxy, where nginx is terminating the ssl connections from client browsers. As our product grew, my loadbalancer machines didn’t have enough CPU to do all the required ssl processing.

As this zabbix screenshot shows, the nginx takes more and more cpu, until it hits the limit of our AWS c1.xlarge instance. This causes delays for our users and some requests might even time out.

Luckily it turns out that there was a fairy easy way to solve this. nginx defaults, at least in our environment, into a cipher called DHE-RSA-AES256-SHA. This cipher uses Diffie-Hellman Ephemeral key exchange protocol, which uses a lot of CPU. With help from this and this blog posts I ended up with the following sollution:

First check if your server uses the slow DHE-RSA-AES256-SHA cipher:

openssl s_client -host your.host.com -port 443

Look for the following line:

Cipher    : DHE-RSA-AES256-SHA

This tells us that we can optimize the CPU usage by selecting faster cipher. Because I’m using AWS instances and these instances don’t support the AESNI (Hardware accelerated processor instructions for calculationg AES) I ened up with following cipher list (read more what this means from here):

RC4-SHA:AES128-SHA:AES:!ADH:!aNULL:!DH:!EDH:!eNULL

If your box can support AESNI you might want to prefer AES over RC4. It’s not the safest cipher choice out there, but more than good enough for our use. Check out this blog post for more information.

So, I added these two lines to my nginx.conf

ssl_ciphers RC4-SHA:AES128-SHA:AES:!ADH:!aNULL:!DH:!EDH:!eNULL;
ssl_prefer_server_ciphers on;

After restarting nginx you should verify that the correct cipher is now selected by running the openssl s_client command again. In my case it now says:

Cipher    : RC4-SHA

All done! My CPU load graphs also shows a clear performance boost. Nice and easy victory.

 

Change terminal title in os x

If you’re like me, you’ll have a terminal with dozen or so tabs open and you can’t remember which tab was which. The problem is even more annoying when you have some programs running on each tab and you can’t differentiate them.

By adding this oneliner to your ~/.bash_profile you can set the title for your terminal:

function title() { echo -n -e "\033]0;$@\007"; }

Just type “title something” and the title will change. Note that you need to apply the file by typing “. ~/.bash_profile” or by reopening the tab.

Why Open-sourcing Components Increases Company Productivity and Product Quality

We’re big fans of open source community here at Applifier. So much, that we believe that open-sourcing software components and tools developed in-house will result in better quality, increased cost savings and increased productivity. Here’s why:

We encourage our programmers to design and implement components, which aren’t our core business, into reusable packages which will be open-sourced once the package is ready. The software is distributed on our GitHub site, with credits to each individual who contributed into the software.

Because the programmers know that their full name will be printed all over the source code, and they can be later Google’d with it, they will take better care to ensure that the quality standards are high enough to stand a closer look. This means:

  • Better overall code quality. Good function/parameter names, good packages, no unused functions etc.
  • Better modularization. The component doesn’t have as much dependencies to other systems, which is generally considered as a good coding practice.
  • Better tests and test coverage. Tests are considered to be essential part of modern code development, so you’ll want to show everybody that you know your business, right?
  • Better documentation. The component is published so that anybody can use it, so it must have good documentation and usage instructions.
  • Better backwards compatibility. Coders take better care when they design API’s and interfaces because they know that somebody might be using the component somewhere out there.
  • Better security. Coder knows that anybody will be able to read his code and find security holes, thus he takes better care for not making any.

In practice, we have found that all the open source components have higher code and document quality than any of our non-published software component. This also ensures that the components are well documented and can be easily maintained if the original coders leave the company. This gives good cost savings in the long run. Open-sourcing components also gives your company good PR value and makes you more attractive for future employers.

For example one of our new guy was asked to do a small monitoring component to monitor some data from RightScale and transfer it into Zabbix, which is our monitoring system. Once the person said that the component was completed, I said to him: “Good, now polish it so that you dare to publish it with your own name in GitHub.”

Adding new storage tank with diskmap.py.

We recently added a bunch of Western Digital 3.0TB Green drives to the enclosure, so that we can run a bunch of tests with this brand. Here’s a quick recap what I had to do to make these new disks online.

First run the diskmap.py. the bold lines are my commands which I wrote. First I discover for new drives (it might that the diskmap shows an old cache without the new drivers) and then I ask it to lisk all disks.

root@openindiana:/export/home/garo# diskmap.py
Diskmap - openindiana> discover
Diskmap - openindiana> disks
0:02:00 c2t5000C5003EF23025d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:01 c2t5000C5003EEE6655d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:02 c2t5000C5003EE17259d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:03 c2t5000C5003EE16F53d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:04 c2t5000C5003EE5D5DCd0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:05 c2t5000C5003EE6F70Bd0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:06 c2t5000C5003EEF8E58d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:07 c2t5000C5003EF0EBB8d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:08 c2t5000C5003EF0F507d0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:09 c2t5000C5003EECE68Ad0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:10 c2t5000C5003EF2D1D0d0 ST33000651AS 3.0T Ready (RDY) tank: spares
0:02:12 c2t5000C5003EEEBC8Cd0 ST33000651AS 3.0T Ready (RDY) tank: spares
0:02:13 c2t5000C5003EE49672d0 ST33000651AS 3.0T Ready (RDY) tank: spares
0:02:14 c2t5000C5003EEE7F2Ad0 ST33000651AS 3.0T Ready (RDY) tank: raidz3-0
0:02:15 c2t5000C5003EED65BBd0 ST33000651AS 3.0T Ready (RDY)
0:03:10 c2t50014EE2B1158E58d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:11 c2t50014EE2B11052C3d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:12 c2t50014EE25B963A7Ed0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:13 c2t50014EE2B1101488d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:14 c2t50014EE2B0EBFF8Ad0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:15 c2t50014EE25BBB4F91d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:16 c2t50014EE2066AB19Fd0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:17 c2t50014EE25BBFCAB0d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:18 c2t50014EE2066686C6d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:19 c2t50014EE2B1158F6Fd0 WDC WD30EZRX-00M 3.0T Ready (RDY)
0:03:20 c2t50014EE2B0E99EA1d0 WDC WD30EZRX-00M 3.0T Ready (RDY)
1:01:00 c2t50015179596901EBd0 INTEL SSDSA2CT04 40.0G Ready (RDY) rpool: mirror-0
1:01:01 c2t50015179596A488Cd0 INTEL SSDSA2CT04 40.0G Ready (RDY) rpool: mirror-0
Drives : 28 Total Capacity : 78.1T
Diskmap - openindiana>

I colored the new disk ids with yellow for your reading pleasure. Next I copied all those 11 ids and I formed the following command, which creates a new tank. I also reused the three old spares, those disk ids are marked with green

zpool create tank2 raidz3 c2t50014EE2B1158E58d0 ... c2t5000C5003EE49672d0 spares c2t5000C5003EF2D1D0d0 c2t5000C5003EEEBC8Cd0 c2t5000C5003EE49672d0

All set! I could now access the new 30TB volume under /tank2.

Building a 85TB cheap storage server with Solaris OpenIndiana

I just recently built a storage server based on Solaris OpenIndiana, a 2U SuperMicro server and a SuperMicro 45 disk JBOD rack enclosure. The current configuration can host 84 TB of usable disk space, but we plan to extend this at least to 200TB in the following months. This blog entry describes the configuration and steps how to implement such beast by yourself.

Goal:

  • Build a cheap storage system capable of hosting 200TB of disk space.
  • System will be used to archive data (around 25 GB per item) which is written once and then accessed infrequently (once every month or so).
  • System must be tolerant to disk failures, hence I preferred raidz3 which can handle a failure of three disks simultaneously.
  • The capacity can be extended incrementally by buying more disks and more enclosures.
  • Each volume must be at least 20TB, but doesn’t have to be necessarily bigger than that.
  • Option to add a 10GB Ethernet card in the future.
  • Broken disks must be able to identify easily with a blinking led.

I choose to deploy an OpenIndiana based system which uses SuperMicro enclosures to host cheap 3TB disks. Total cost of the hardware with one enclosure was around 6600 EUR (2011 Dec prices) without disks. Storing 85TB would cost additional 14000 EUR with current very expensive (after the Thailand floods) disk prices. Half a year ago the the same disks would have been about half of that. Disks would be deployed in a 11-disk raidz3 sets, one or two per zpool. This gives us about 21.5TB per 11 disk set. New storage is added as a new zpool instead of attaching it to an old zpool.

Parts used:

  • The host is based on a Supermicro X8DTH-6F server motherboard with two Intel Xeon E5620 4-core 2.4 Ghz CPUs and 48 GB of memory. Our workload didn’t need more memory, but one could easily add more.
  • Currently one SC847E16-RJBOD1 enclosure. This densely packed 4U chassis can fit a whopping 45 disks.
  • Each chassis is connected to a LSI Logic SAS 9205-8e HBA controller with two SAS cables. Each enclosure has two backplanes, so both backplanes are connected to the HBA with one cable.
  • Two 40GB Intel 320-series SSDs for the operating system.
  • Drives from two different vendor so that we can have some benchmarks and tests before we commit to the 200TB upgrade:
    • 3TB Seagate Barracuda XT 7200.12 ST33000651AS disks
    • Western Digital Caviar Green 3TB disks

It’s worth to note that this system is built for storing large amounts of data which are not frequently accessed. We could easily add SSD disks as L2ARC caches and even a separated ZIL (for example the 8 GB STEC ZeusRAM DRAM, which costs around 2200 EUR) if we would need faster performance for example database usage. We selected disks from two different vendors for additional testing. One zpool will use only disks of a single type. At least the WD Green drives needs a firmware modification so that they don’t park their heads all the time.

Installation:

OpenIndiana installation is easy: Create a bootable CD or a bootable USB and boot the machine with just the root devices attached. The installation is very simple and takes around 10 minutes. Just select that you install the system to one of your SSDs with standard disk layout. After your installation is completed and you have booted the system, follow these steps to make the another ssd bootable.

Then setup some disk utils under /usr/sbin. You will need these utils to for example identify the physical location of a broken disk in the enclosure. (read more here):

Now it’s time to connect your enclosure to the system with the SAS cables and boot it. OpenIndiana should recognize the new storage disks automatically. Use the diskmap.py to get a list of the disk identifies for later zpool create usage:

garo@openindiana:/tank$ diskmap.py
Diskmap - openindiana> disks
0:02:00 c2t5000C5003EF23025d0 ST33000651AS 3.0T Ready (RDY)
0:02:01 c2t5000C5003EEE6655d0 ST33000651AS 3.0T Ready (RDY)
0:02:02 c2t5000C5003EE17259d0 ST33000651AS 3.0T Ready (RDY)
0:02:03 c2t5000C5003EE16F53d0 ST33000651AS 3.0T Ready (RDY)
0:02:04 c2t5000C5003EE5D5DCd0 ST33000651AS 3.0T Ready (RDY)
0:02:05 c2t5000C5003EE6F70Bd0 ST33000651AS 3.0T Ready (RDY)
0:02:06 c2t5000C5003EEF8E58d0 ST33000651AS 3.0T Ready (RDY)
0:02:07 c2t5000C5003EF0EBB8d0 ST33000651AS 3.0T Ready (RDY)
0:02:08 c2t5000C5003EF0F507d0 ST33000651AS 3.0T Ready (RDY)
0:02:09 c2t5000C5003EECE68Ad0 ST33000651AS 3.0T Ready (RDY)
0:02:11 c2t5000C5003EF2D1D0d0 ST33000651AS 3.0T Ready (RDY)
0:02:12 c2t5000C5003EEEBC8Cd0 ST33000651AS 3.0T Ready (RDY)
0:02:13 c2t5000C5003EE49672d0 ST33000651AS 3.0T Ready (RDY)
0:02:14 c2t5000C5003EEE7F2Ad0 ST33000651AS 3.0T Ready (RDY)
0:03:20 c2t5000C5003EED65BBd0 ST33000651AS 3.0T Ready (RDY)
1:01:00 c2t50015179596901EBd0 INTEL SSDSA2CT04 40.0G Ready (RDY) rpool: mirror-0
1:01:01 c2t50015179596A488Cd0 INTEL SSDSA2CT04 40.0G Ready (RDY) rpool: mirror-0
Drives : 17 Total Capacity : 45.1T

Here we have total of 15 disks. We’ll use 11 of them to for a raidz3 stripe. It’s important to have the correct amount of drivers in your raidz configurations to get optimal performance with the 4K disks. I just simply selected the first 11 disks (c2t5000C5003EF23025d0, c2t5000C5003EEE6655d0, … , c2t5000C5003EF2D1D0d0) and created a new zpool with them and also added three spares for the zpool:

zpool create tank raidz3 c2t5000C5003EF23025d0, c2t5000C5003EEE6655d0, ... , c2t5000C5003EF2D1D0d0
zpool add tank spare c2t5000C5003EF2D1D0d0 c2t5000C5003EEEBC8Cd0 c2t5000C5003EE49672d0

This resulted in a nice big tank:

        NAME                       STATE     READ WRITE CKSUM
        tank                       ONLINE       0     0     0
          raidz3-0                 ONLINE       0     0     0
            c2t5000C5003EF23025d0  ONLINE       0     0     0
            c2t5000C5003EEE6655d0  ONLINE       0     0     0
            c2t5000C5003EE17259d0  ONLINE       0     0     0
            c2t5000C5003EE16F53d0  ONLINE       0     0     0
            c2t5000C5003EE5D5DCd0  ONLINE       0     0     0
            c2t5000C5003EE6F70Bd0  ONLINE       0     0     0
            c2t5000C5003EEF8E58d0  ONLINE       0     0     0
            c2t5000C5003EF0EBB8d0  ONLINE       0     0     0
            c2t5000C5003EF0F507d0  ONLINE       0     0     0
            c2t5000C5003EECE68Ad0  ONLINE       0     0     0
            c2t5000C5003EEE7F2Ad0  ONLINE       0     0     0
        spares
          c2t5000C5003EF2D1D0d0    AVAIL
          c2t5000C5003EEEBC8Cd0    AVAIL
          c2t5000C5003EE49672d0    AVAIL

Setup email alerts:

OpenIndiana will have a default sendmail configuration which can send email to the internet via directly connecting to the destination mail port. Edit your /etc/aliases to add a meaningful destination for your root account and type newaliases after you have done your editing. Then follow this guide and setup email alerts to get notified when you lose a disk.

Snapshot current setup as a boot environment:

OpenIndiana boot environments allows you to snapshot your current system as a backup, so that you can always reboot your system to a known working state. This is really handy when you do system upgrades, or experiment with something new. beadm list shows the default boot environment:

root@openindiana:/home/garo# beadm list
BE Active Mountpoint Space Policy Created
openindiana NR / 1.59G static 2012-01-02 11:57
There we can see our default openindiana boot environment, which is both active (N) and will be activated upon next reboot (R). The command beadm create -e openindiana openindiana-baseline will snapshot the current environment into new openindiana-baseline which acts as a backup. This blog post at c0t0d0s0 as a lot of additional information how to use the beadm tool.

What to do when a disk fails?

The failure detection system will email you a message when the zfs system detects a problem with system. Here’s an example of the results when we removed a disk on the fly:

Subject: Fault Management Event: openindiana:ZFS-8000-D3
SUNW-MSG-ID: ZFS-8000-D3, TYPE: Fault, VER: 1, SEVERITY: Major
EVENT-TIME: Mon Jan 2 14:52:48 EET 2012
PLATFORM: X8DTH-i-6-iF-6F, CSN: 1234567890, HOSTNAME: openindiana
SOURCE: zfs-diagnosis, REV: 1.0
EVENT-ID: 475fe76a-9410-e3e5-8caa-dfdb3ec83b3b
DESC: A ZFS device failed. Refer to http://sun.com/msg/ZFS-8000-D3 for more information.
AUTO-RESPONSE: No automated response will occur.
IMPACT: Fault tolerance of the pool may be compromised.
REC-ACTION: Run ‘zpool status -x’ and replace the bad device.

Log into the machine and execute zpool status to get detailed explanation which disk has been broken. You should also see that a spare disk has been activated. Look up the disk id (c2t5000C5003EEE7F2Ad0 in this case) from the print.

        NAME                       STATE     READ WRITE CKSUM
        tank                       ONLINE       0     0     0
          raidz3-0                 ONLINE       0     0     0
            c2t5000C5003EF23025d0  ONLINE       0     0     0
            c2t5000C5003EEE6655d0  ONLINE       0     0     0
            c2t5000C5003EE17259d0  ONLINE       0     0     0
            c2t5000C5003EE16F53d0  ONLINE       0     0     0
            c2t5000C5003EE5D5DCd0  ONLINE       0     0     0
            c2t5000C5003EE6F70Bd0  ONLINE       0     0     0
            c2t5000C5003EEF8E58d0  ONLINE       0     0     0
            c2t5000C5003EF0EBB8d0  ONLINE       0     0     0
            c2t5000C5003EF0F507d0  ONLINE       0     0     0
            c2t5000C5003EECE68Ad0  ONLINE       0     0     0
            spare-10
              c2t5000C5003EEE7F2Ad0  UNAVAIL       0     0     0  cannot open
              c2t5000C5003EF2D1D0d0 ONLINE       0     0     0 132GB resilvered
        spares
          c2t5000C5003EF2D1D0d0    INUSE     currently in use
          c2t5000C5003EEEBC8Cd0    AVAIL
          c2t5000C5003EE49672d0    AVAIL

Start the diskmap.py and execute command “ledon c2t5000C5003EEE7F2Ad0″. You should now see a blinking red led on the broken disk. You should also try to unconfigure the disk first via cfgadm: Type cfgadm -al to get a list of your configurable devices. You should find your faulted disk from a line like this:

c8::w5000c5003eee7f2a,0        disk-path    connected    configured   unknown

Notice that our disk id in zpool status was c2t5000C5003EEE7F2Ad0, so it will show in the cfgadm print as “c8::w5000c5003eee7f2a,0″. Now try and type cfgadm -c unconfigure c8::w5000c5003eee7f2a,0 I’m not really sure is this part needed, but our friends at #openindiana irc channel recommended doing this.

Now remove the physical disk which is blinking the red led and plug a new drive back. OpenIndiana should recognize the disk automatically. You can verify this by running dmesg:

genunix: [ID 936769 kern.info] sd17 is /scsi_vhci/disk@g5000c5003eed65bb
genunix: [ID 408114 kern.info] /scsi_vhci/disk@g5000c5003eed65bb (sd17) online

Now start diskmap.py, run discover and then disks and you should see your new disk c2t5000C5003EED65BBd0. Now you need to replace the faulted device with thew new one: zpool replace tank  c2t5000C5003EEE7F2Ad0 c2t5000C5003EED65BBd0. The zpool should now start resilvering the new replacement disk. The spare disk is still attached and must be manually removed after the resilvering is completed: zpool detach tank  c2t5000C5003EF2D1D0d0. There’s more info and examples at the Oracle manuals which you should read.

As you noted, there’s a lot manual operations which needs to be done. Some of these can be automated and the rest can be scripted. Consult at least the zpool man page to know more.

Benchmarks:

Simple sequential read and write benchmark against a 11 disks raidz3 in a single stripe was done with dd if=/dev/zero of=/tank/test bs=4k and monitoring the performance with zpool iostat -v 5

Read performance with bs=4k: 500MB/s
Write performance with bs=4k: 450MB/s
Read performance with bs=128k: 900MB/s
Write performance with bs=128k: 600MB/s

I have not done any IOPS benchmarks, but knowing how the raidz3 works, the IOPS performance should be about the same than one single disk can do. The 3TB Seagate Barracuda XT 7200.12 ST33000651AS can do (depending on threads) 50 to 100 iops. CPU usage tops at about 20% during the sequential operations.

Future:

We’ll be running more tests, benchmarks and watch for general stability in the upcoming months. We’ll probably fill the first enclosure gradually in the new few months with total of 44 disks, resulting around 85TB of usable storage. Once this space runs out we’ll just buy another enclosure, another 9205-8e HBA controller and start filling that.

Update 2012-12-11:

It’s been almost a year after I built this machine to one of my clients and I have to say that I’m quite pleased with this thing. The system has now three tanks of storage, each is a raidz3 spanning over 11 disks. Nearly every disk has worked just fine so far, I think we’ve had just one disk crash during the year. The disk types reported with diskmap.py are “ST33000651AS” and “WDC WD30EZRX-00M”, all 3TB disks. The Linux client side has had a few kernel panics, but I have no idea if those are related to the nfs network stack or not.

One of my reader also posted a nice article at http://www.networkmonkey.de/emulex-fibrechannel-target-unter-solaris-11/ – be sure to check that out also.

 

SuperMicro JBOD SC847E16-RJBOD1 enclosure with Solaris OpenIndiana

I’ve just deployed a OpenIndiana storage system which uses a SuperMicro JBOD SC847E16-RJBOD1 45 disk enclosure and an LSI Logic SAS 9205-8e HBA controller with OpenIndiana (build 151a). This enclosure allows you to fit huge amount of storage into just 4U rack space.

There’s a great utility called sas2ircu. Together with diskmap.py, these allows you to:

  • Identify where your disks are in your enclosure
  • Toggle the disk locator identify led on and off.
  • Run a smartcl test

So I can now locate my faulted disk with a clear blinking red led so that I can replace it safely.

Installation: Copy both binaries to /usr/sbin and you’re ready to go. Try first running diskmap.py and execute the discover command. Then you can list your disks and their addresses in the enclosure by saying disks.

There’s also the great lsiutil tool available. You need to find the 1.63 version which supports the 9205-8e controller. Unfortunately for some reason LSI has not yet made this tool available on their site. You can download it in the mean time from here: http://www.juhonkoti.net/media/LSIUTIL-1.63.zip

Kindlen käyttö Suomessa

Kindle on Amazonin erinomainen sähköisten kirjojen lukulaite, jonka saa tilattua Suomeen Amazonin verkkokaupasta reilun sadan euron hintaan. Jenkkimatkalla Kindlen voi noutaa itselleen noin 70 euron hintaan. Kindle käyttää ns. sähköistä mustetta (eInk), jonka lukeminen vastaa erittäin hyvin paperilta lukemista. Koska näytössä ei ole taustavaloa, sen lukeminen ei rasita silmiä ja sen käyttö vie erittäin vähän sähköä. Kindlen akku kestääkin normaalissa käytössä helposti kuukauden. Kindlen kaveriksi kannattaa ostaa jonkinlainen suojakuori, jotta sen kantaminen mukana olisi huolettomampaa. Kindle ladataan kytkemällä se micro-USB -johdolla esimerkiksi tietokoneeseen.

Kindlessä on sisäänrakennettuna Englannin, Ranskan, Saksan, Italian, Portugalin ja Espanjan tietosanakirjat. Voit erittäin helposti tarkistaa lähes minkä tahansa sanan merkityksen lukiessasi mitä tahansa kirjaa. Tämä on erittäin kätevä, jos esimerkiksi jokin Englanninkielisen romaanin sana ei ole ennalta tuttu. Vaikka pidänkin Englannin kielen sanavarastoani kohtuullisena, tämä ominaisuus on ollut ahkerassa käytössä. Sanan hakeminen tapahtuu liikuttamalla kursori halutun sanan eteen, jolloin sanan selostus ilmestyy ruudun reunaan.

Näyttö on mustavalkoinen ja se sopii parhaiten romaanien lukemiseen. Muita sovelluskohteita on esimerkiksi tietokirjat, oppikirjat ja laitteiden manuaalit. Grafiikkaa sisältävät julkaisut ja sarjakuvat eivät ole Kindlen omininta aluetta. Kindlestä on saatavilla useita eri malleja: Näppäimistöllä tai ilman, kosketusnäytöllä tai ilman, 3G yhteydellä tai ilman. Itse olen ollut hyvin tyytyväinen kaikista halvimpaan malliin, jossa ei ole mitään edellämainituista.

  • Näppäimistö: Kindlen avulla voi tehdä muistiinpanoja, merkintöjä “sivujen marginaaleihin” ja hakea tekstiä kirjasta ja selata vaikka Wikipediaa. Kaikki onnistuu tarvittaessa myös näytöltä käytettävällä näppäimistöllä. Koska itse käytän Kindleä vain lukemiseen, en ole kaivannut fyysistä näppäimistöä.
  • Kosketusnäyttö: Helpottaa esimerkiksi on-screen -näppäimistön käyttöä, mutta toisaalta täyttää näytön sormenjäljillä. En ole kaivannut.
  • 3G-yhteys: Kindlen saa (lähes) ilmaiseksi ympäri maailman toimivalla 3G yhteydellä. Kindlen avulla voi ostaa kirjoja suoraan Amazonin verkkokaupasta, eikä näiden lataamisesta Kindleen tarvitse maksaa tiedonsiirtomaksua. Ainoastaan omien dokumenttien lataaminen sähköpostin välityksellä maksaa jonkin verran. Jos sinulla on mahdollisuus käyttää WLANia esimerkiksi puhelimen kautta, et tarvitse tätä.
  • Mainokset: Jenkeistä ostamalla voi valita muutaman kymmenen dollarin halvemman version, joka näyttää laitteen ollessa poiskytkettynä jonkin satunnaisen mainoksen. Mainokset eivät haittaa lukemista, eikä niiden käyttö vähennä laitteen akunkestoa. Itse en ole kokenut mainoksia häiritseväksi ja ne sisältävät satunnaisesti ihan hyviä tarjouksia.
  • DX-versio: Kindlestä on olemassa myös huomattavasti isompi DX versio. Itse en ole tätä kaivannut, vaan romaanien lukeminen onnistuu hyvin normaalin Kindlen näytöltä. Voisin kuvitella, että isommasta näytöstä olisi hyötyä esimerkiksi PDF-muodossa olevien manuaalien lukemisessa.

Kindle tukee natiivisti Amazonin omaa .AWZ-formaattia, yleisesti käytettävää .MOBI-formaattia, PDF- ja TXT-tiedostoja. Netissä on saatavilla runsaasti .EPUB-tiedostoja, jotka on muunnettava Kindlen ymmärtämään muotoon esimerkiksi Calibre-ohjelmistolla.

Kindleen voi hankkia lukemista seuraavilla eri tavoilla, järjestettynä parhaimmasta tavasta huonompaan:

  • Osto Amazonin verkkokaupasta: Tämä on ehdottomasti Kindlen paras ominaisuus. Kirjauduttasi sisään Amazoniin, voit ladata lähes minkä tahansa kirjan Kindleen painamalla oikeasta reunasta “Buy now with 1-Click.” Tämän jälkeen sinun tarvitsee ainoastaan kytkeä Kindlen WLAN päälle. Parhaassa tapauksessa kirja on luettavissa viiden sekunnin päästä napin klikkauksesta. Ostokokemus on täysin omaa luokkaansa ja suorastaan häkellyttävän helppoa.
  • Ilmaisten netistä saatavien kirjojen lataaminen omalta koneelta käyttämällä Calibre-ohjelmistoa. Calibren avulla voit muuntaa lähes minkä tahansa kirjan Kindlen osaamaan .mobi-tiedostomuotoon. Kytket vain Kindlen USB:llä tietokoneeseesi ja valitset mitkä kirjat haluat ladata Kindleen.
  • Kirjan lähettäminen sähköpostin liitetiedostona <oma tunnus>@kindle.com -osoitteeseen. Voit laittaa esimerkiksi .MOBI-tiedoston sähköpostin liitteeksi, jolloin se päätyy sinulle Kindleen luettavaksi muutamassa minuutissa.
  • PDF tiedostojen lataaminen USB:llä. Voit kopioida PDF tiedostot sellaisenaan Kindleen kytkemällä sen USB:llä tietokoneeseen ilman Calibrea. PDF tiedostojen lukeminen ei ole niin mukavaa kuin Kindlelle taitetun kirjan, mutta se soveltuu silti kohtalaisesti esimerkiksi manuaalien lukemiseen.
  • Kirjojen osto Suomalaisista verkkokaupoista. Suomalaiset verkkokaupat eivät ole kovinkaan fiksuja, koska kaikki myydyt kirjat ovat DRM suojattuja. Tämän takia sinun tulee murtaa ostetun kirjan DRM suojaus käyttäen Calibrea. Tämä on täysin mahdollista ja hyvin helppoa, mutta kuvastaa silti täydellistä ymmärtämättömyyttä eBook markkinoista. Tämän jälkeen ei tarvitse ihmetellä miksi Suomalaiset kirjakaupat valittavat huonosta eBook-myynnistä.

Ilmaisia klassikkokirjoja löytää mm. Project Gutenberg:n sivuilta ja hakemalla Googlesta “ebook”, “.mobi” ja “.epub” -hakusanoilla. Valitettavasti jotkut verkkokaupat myyvät näitä ilmaiseksi saatavia kirjoja muutaman dollarin hinnalla, joten kannattaa varoa ettei maksa turhasta.

Implementing Multi Level Security in Windows 7 with VirtualBox and VMLite

I’ve been experimenting with a Multi Level Security implementation in Windows 7 using VirtualBox and VMLite to run Chrome and other browsers inside a virtual machine (guest system) and to use this browser as the default browser for the entire computer (host system) for additional security. This setup allows to click any HTTP link inside pretty much any running program and make that url to load itself into a browser running inside the virtual machine.

This gives us an extra layer of security besides the normal Chrome sandboxing. Also all other usual VM features like snapshotting, reverting to a snapshot, clipboard between host and guest operation system, seamless mode, networking etc are all available. In practice the software running inside VM can’t be easily tell apart from non virtualized programs.

VMLite Workstation is a software built upon VirtualBox which allows to run a Windows XP instance in Seamless mode over a host operating system (Windows 7 in this case). You need a Windows XP license which is available at least with Windows 7 Professional version. This guide shows how to install Windows XP Mode which comes with Windows 7 Professional into a virtual machine and to configure a Chrome browser inside the VM to act as the default browser for the host operating system.

Installation Instructions for VMLites and the Windows XP virtual image:

  1. Download Virtual XP Mode from http://www.microsoft.com/windows/virtual-pc/download.aspx and install it with the default settings.
  2. Download VMLite Workstation from http://www.vmlite.com/index.php/products/vmlite-workstation
  3. Create new Virtual Image inside the VMLite workstation and give it  the installation location of the Virtual XP Mode.
  4. Now you should be able to boot the Virtual XP Mode within VMLite and install Chrome and other softwares which you feel you might need. Here’s a list for ideas which you should do:
    • Change Chrome theme to something else so you can tell apart the Chrome which runs inside the guest vm and the Chrome which runs in your host system.
    • Edit the VM settings to disable full read/write access to the shared folders and drivers and instead just give one predefined directory which you use to transfer files between the guest and the host operating systems.
  5. Remember to take a snapshot from the VM after you have setup your environment. This acts as a restore point in time so you can always reset your VM into this state if you do something stupid or think that the VM is compromised.

Making the Chrome inside VM to be the default browser for everything.

VMLite comes with a “Internet Explorer (secure)” shortcut with green borders which is installed onto your desktop. This shortcut starts Internet Explorer inside the VM. We’ll use this trick to pass Chrome.exe calls from the host system into the guest (VM) system with a .bat file and then making this .bat file the default browser program for the host system.

  1. First create a multilevel-security-browser.bat file by modifying these sources into some good location (I’ve placed it into F:\Users\Garo\VMLites\multilevel-security-browser.bat)
    @echo off
    pushd "C:\Program Files\VMLite\VMLite Workstation\"
    set path="C:\Program Files\VMLite\VMLite Workstation\";%path%
    vmlitectl run "VMLite XP Mode" "C:\Documents and Settings\Administrator\Local Settings\Application Data\Google\Chrome\Application\chrome.exe" "%*"
    popd

    Notice few things: The path line should have the VMLite installation directory inside the host system, the “VMLite XP Mode” should be the name of your VM and the chrome.exe path is the browser path inside the guest vm.

  2. Then create a multilevel-security-browser.reg file based on these sources:
    Windows Registry Editor Version 5.00
    
    [HKEY_CLASSES_ROOT\MultilevelSecurityBrowser]
    @="MultilevelSecurityBrowser"
    "URL Protocol"=""
    
    [HKEY_CLASSES_ROOT\MultilevelSecurityBrowser\DefaultIcon]
    @="C:\\Users\\garo\\AppData\\Local\\Google\\Chrome\\Application\\chrome.exe,0"
    
    [HKEY_CLASSES_ROOT\MultilevelSecurityBrowser\shell]
    
    [HKEY_CLASSES_ROOT\MultilevelSecurityBrowser\shell\open]
    
    [HKEY_CLASSES_ROOT\MultilevelSecurityBrowser\shell\open\command]
    @="\"f:\\Users\\garo\\VMLites\\multilevel-security-browser.bat\" -- \"%1\""
    
    [HKEY_CURRENT_USER\Software\Microsoft\Windows\Shell\Associations\UrlAssociations\http\UserChoice]
    "Progid"="MultilevelSecurityBrowser"

    and set the path for chrome.bat to the path where you created your multilevel-security-browser.bat. Notice that we use the chrome.exe as the source of our DefaultIcon which assumes that you have Chrome also installed into your host operating system.

  3. Save the multilevel-security-browser.reg file and click to Merge its contents with the Windows 7 registry. UAC will ask for a confirmation which you need to allow.
  4. We’re pretty much done here. You can now try to click some http url and if everything went correctly a black shell window will appear for a moment and the VM is started (if it isn’t already running) and the url should be opened inside Chrome in the guest vm.

I’ve used this setup only for a day now and so far it has worked nicely. the VMLite can be turned into Seamless mode and the Windows XP taskbar can be moved on the top of the screen and set it to Auto-Hide the taskbar.

Saunalahden mokkulan asennus OS X:ään, error 5370

Törmäsin asennusongelmiin Saunalahden mobiilitikun ohjelmiston kanssa yrittäiessäni asentaa sitä OS X:ään (10.6.5). Asennusohjelma ilmoitti virheen “An internal error has occured during configuration (5370)”.

Ongelma voidaan ratkaista seuraavasti:

  1. Avaa Pääte (Terminal)
  2. Kirjaudu superkäyttäjäksi komennolla sudo ja antamalla oma salasanasi.
  3. Siirry oikeaan hakemistoon komennolla: cd “/Applications/Elisa/Mobiililaajakaista opastettu asennus.app/Contents/MacOS” (huomaa lainausmerkit ja välilyönnit)
  4. Käynnistä asennusohjelma komennolla: ./MobileManager\ Setup\ Assistant

Asennusohjelman pitäisi tämän jälkeen suoriutua tehtävästään ongelmitta.

Crash course to Java JVM memory issues to sysadmins

Are you a sysadmin who is new to Java? Then you might find this post to be helpful.

Java has its own memory management system with garbage collection which is most of the time really nice, but you need to know some details how it works so you can administrate your JVM instances effectively.

How Java manages memory?

At the beginning Java JVM will allocate a block of memory from the OS to its heap which it will distribute to the program running inside the JVM. The amount is controlled with two command line arguments: -Xms tells how much memory JVM will allocate at the start and -Xmx what’s the maximum amount of memory which JVM can allocate from the OS. For example -Xms512m -Xmx1G will tell java to start with half gigabyte at the beginning and allow it to grow to one gigabyte.

As the Java program runs it allocates memory for the objects from the JVM heap. This will result the heap to grow until a GC (garbage collection) threshold is  reached. This will trigger the JVM to see which objects are no longer used (objects which are not referenced by any working object) and it will free this memory back to the heap. There are numerous ways how this can work in different GC implementations (Java has many of them) and they’re out of the scope of this article. The main point is that the Java heap usage will grow until about 80% usage when the GC occurs and then drop back to much lower level. If you use jconsole to watch the free memory you will see something like this:

The saw tooth like pattern is just normal life Java garbage collection and nothing to worry about. This however will make it difficult to know how much memory the program actually uses needs.

What happens when Java runs out of memory?

If JVM can’t free enough memory with a simple GC it will run a Full GC which will be a stop-the-world collection. This suspends the JVM execution until the collection is done.  A Full GC can be seen as a sudden drop on the amount of used memory, for example as seen in this image. The Full GC in this case took 0.8 seconds. It’s not much, but it did suspend the program execution for that time, so keep that in mind when designing your Java software and its real time requirements.

The Full GC will be able to free enough memory so that the program will continue, but if the JVM simply has not enough memory it will need to trigger another Full GC shortly. This can result in a GC storm where the program spends even more time doing even longer Full GC’s and finally running out of memory.  It’s not uncommon to see Full GC taking over two minutes in these situations and remember, the program is suspended during a Full GC! No need to say that this is bad, right?

However giving JVM too much memory is also bad. This will make the JVM happy as it doesn’t need to do Full GCs, but then the small GCs can take longer and if you eventually run into a Full GC situation it will take long. Very long. Thus you need to think how much memory your program will need and setup the JVM -Xms and -Xmx so that it has enough plus additional “gc breathing room” on top of that.

How the OS sees all this?

When the JVM starts and allocates the amount of memory specified in -Xms the OS will not immediately allocate all this memory but thanks to the modern virtual memory management the OS reserves this to be used later. You can see this in the VIRT column in top. Once the Java program starts to actually use this memory the OS will need to provide the memory and thus you can see the program RES column value to grow. VIRT means how much virtual memory has been allocated and mapped to the process (this includes the JVM heap memory plus JVM code and other libraries) and RES means how much memory from all VIRT memory is actually in the ram.

The Java in the image above has too big amount of memory. The program was started with 384MB heap (-Xms384m) but it was allowed to grow up to one gigabyte (-Xmx1G), but the program is actually using just 161 megabytes out of it.

However when the JVM GC runs and frees the memory back to the application, the memory is not given back to the OS. Thus you will see the RES value to grow up to VIRT, but never to actually decrease unless the OS chooses to swap some of the JVM memory out to disk. This can happen easily if you specify too big heap to the JVM which doesn’t get used and you should try to avoid this.

Top Tip for Top: You can press f to add and hide columns like SWAP. Notice that SWAP isn’t actually the amount of memory which has been swapped to disk. According to top manual: VIRT = SWAP + RES. Swap contains both the pages which has been swapped to disk and pages which hasn’t yet been actually used. See for more very usefull top commands by pressing ?.

How can I monitor all this?

The best way is to use JMX with some handy tool like jconsole. JConsole is a GUI utility which comes with all JDK distributions and can be found under the bin/ directory (jconsole.exe in windows). You can use the jconsole to connect into a running JVM and extract a lot of different metrics out of it and even tweak some settings on the fly.

JMX needs to be enabled, which can be done by adding these arguments to the JVM command line:

-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8892 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote=true

Notice that these settings disable authentication and ssl, so you should not do this unless your network is secured from the outside. You can also feed this data into monitor systems like Zabbix (my favourite), Cacti or Nagios, which I have found very helpful when debugging JVM performance.

Other way is to enable GC logging which can be done in Sun JVM with these command line parameters (these are reported to be working also with OpenJDK but I haven’t tested)

-XX:+PrintGCTimeStamps -XX:+PrintGC -Xloggc:/some/dir/cassandra.gc.log

These will print GC statistics to the log file, here’s an actual example:

17500.125: [GC 876226K->710063K(4193024K), 0.0195470 secs]
17569.086: [GC 877871K->711547K(4193024K), 0.0200440 secs]
17641.289: [GC 879355K->713210K(4193024K), 0.0201440 secs]
17712.079: [GC 881018K->714931K(4193024K), 0.0212350 secs]
17736.576: [GC 881557K->882170K(4193024K), 0.0419590 secs]
17736.620: [Full GC 882170K->231044K(4193024K), 0.8055450 secs]
17786.560: [GC 398852K->287047K(4193024K), 0.0244280 secs]

The first number is seconds since JVM startup, the second tells the GC type (normal vs. Full GC) and how much memory was freed.

Conclusion

  • Java JVM will eat all the memory which you give to it (this is normal)
  • You need to tune the JVM -Xms and -Xmx parameters to give it enough but not too much memory so your application works.
  • The memory wont be released back to the OS until JVM exists, but the OS can swap the JVM memory out. Usually this is bad and you need to decrease the memory you give to the JVM.
  • Use JMX to monitor the JVM memory usage to find suitable values.

Script and template to export data from haproxy to zabbix

I’ve just created a zabbix template with a script which can be used to feed performance data from haproxy to zabbix. The script firsts uses HTTP to get the /haproxy?stats;csv page, parses the CSV and uses zabbix_sender command line tool to send each attribute to the zabbix server. The script can be executed on any machine which can access both zabbix server and the haproxy stats page (I use the machine which runs the zabbix_server). The script and template works on both zabbix 1.6.x and 1.8.x.

As the haproxy server names might differ from zabbix server names, the script uses annotations inside the haproxy.cfg hidden in comments. The annotations tell the script which frontend and server node statistics should be sent to the zabbix server. This allows you to keep the configuration in a central place which helps keeping the haproxy and zabbix configurations in sync. The template includes two graphs, example below:

I’ve chosen to export following attributes from haproxy, but more could be easily added (I accept patches via github.com):

  • Current session count
  • Maximum session count
  • Sessions per second
  • HTTP responses per minute, grouped by 1xx, 2xx, 3xx, 4xx and 5xx.
  • Mbps in (network traffic)
  • Mbps out (network traffic)
  • Request errors per minute
  • Connection errors per minute
  • Response errors per minute
  • Retries (warning) per minute
  • Rate (sessions per second)
  • HTTP Rate (requests per second)
  • Proxy name in haproxy config
  • Server name in haproxy config

The code is available at github: https://github.com/garo/zabbix_haproxy The script supports HTTP Basic Authentication and masking the HTTP Host-header.

Usage:

  1. Import the template_haproxyitems.xml into Zabbix.
  2. Add all your webservers to zabbix as hosts and link them with the Template_HAProxyItem
  3. Add all your frontends to zabbix as hosts and link them with the Template_HAProxyItem. The frontend hosts don’t need to be mapped to any actual ip nor server, I use the zabbix_server ip as the host ip for these.
  4. Edit your haproxy.cfg file and add annotations for the zabbix_haproxy script. These annotations mark which frontends and which servers you map into zabbix hosts. Notice that the annotations are just comments after #, so haproxy ignores them.
    frontend irc-galleria # @zabbix_frontend(irc-galleria)
            bind            212.226.93.89:80
            default_backend lighttpd
    
    backend lighttpd
            mode            http
            server  samba           10.0.0.1:80    check weight 16 maxconn 200   # @zabbix_server(samba.web.pri)
            server  bossanova       10.0.0.2:80    check weight 16 maxconn 200   # @zabbix_server(bossanova.web.pri)
            server  fuusio          10.0.0.3:80     check weight 4 maxconn 200   # @zabbix_server(fuusio.web.pri)
  5. Setup a crontab entry to execute the zabbix_haproxy script every minute.  I use the following entry in /etc/crontab:
    */1 * * * * nobody zabbix_haproxy -c /etc/haproxy.cfg -u "http://irc-galleria.net/haproxy?stats;csv" -C foo:bar -s [ip of my zabbix server]
  6. All set! Go and check the latests data in zabbix to see if you got the values. If you have problems you can use -v and -d command line arguments to print debugging information.

Oneliner: erase incorrect memcached keys on demand

We had a situation where our image thumbnail memcached cluster somehow got empty thumbnails. The thumbnails are generated on the fly by image proxy servers and the thumbnail is stored into memcached. For some reason some of the thumbnails were truncated.

As I didn’t have time to start debugging the real issue, I quickly wrote this oneliner which detects corrupted thumbnails when the thumbnail is fetched from the memcached and issues a delete operation to it. This will keep the situation under control until I can start the actual debugging. We could also have restarted the entire memcached cluster, but it would result in big preformance penalty for several hours. Fortunately all corrupted thumbnails are just one byte long, so detecting them was simple enough to do with an oneliner:

tcpdump -i lo -A -v -s 1400 src port  11213 |grep VALUE | perl -ne 'if (/VALUE (cach[^ ]+) [-]?\d+ (.+)/) { if ($2 == 1) { `echo "delete $1 noreply\n" | nc localhost 11213`; print "deleted $1\n"; } }'

Here’s how this works:

  1. tcpdump prints all packets in ascii (-A) which come from port 11213 (src port 11213), our memcached node,  from interface loopback (-i lo)
  2. the grep passes only those lines which contains the response header which has the following form: “VALUE <key> <flags> <length>
  3. for each line (-n) the perl executes the following script (-e ‘<script>’) which first uses regexp to catch the key “(cach[^ ]+)” and then the length.
  4. It then checks if the length is 1 if ($2 == 1) and on success it executes a shell command which sends a “delete <key> noreply” message to the memcached server using netcat (nc). This command will erase the corrupted value from memcached server.
  5. Last it prints a debug message