Test Driven Development on an SQL Database Schema

TDD on SQL Schema. Why not? A normal Test Driven Development approach means that you start with writing a test case which creates the requirement to write appropriate code in order for the test to pass. This results in a very quick iteration where you add new unit tests, verify that they fail, implement the code for the test to pass and verify that the test passes. A single iteration can be just few minutes or less and the test set usually executes in just a few seconds. The end result is that you will end up with great test coverage which helps refactoring and in itself it helps explaining the user stories in the code.

Applying TDD to SQL

At first writing CREATE TABLE declarations doesn’t sound like something worth testing, but modern SQL database engines offer a lot of tools to enforce proper and fully valid data. Constraints, foreign keys, checks and triggers are commonly used to validate that invalid or meaningless data is not stored in the database. This means that you can certainly write a simple CREATE TABLE declaration and run with it, but if you want to verify that you cannot send invalid data to a table then you need to test for it. If you end up writing triggers and stored procedures it is even more important to write proper tests.

I picked up Ruby with it’s excellent rspec testing tool for a proof-of-concept implementation for testing a new schema containing around a dozen tables and stored procedures. Ruby has a well working PostgreSQL driver and writing unit test cases with rspec is efficient in term of lines of code. Also as Ruby is interpreted executing a unit test suite is really fast. In my case a set of 40 test cases takes less than half a second to execute.

Example

Take this simple twitter example. I placed a complete source code example in github at https://github.com/garo/sql-tdd

CREATE TABLE users (
    id SERIAL PRIMARY KEY,
    login VARCHAR(20) NOT NULL UNIQUE,
    full_name VARCHAR(40) NOT NULL
);

CREATE TABLE tweets (
    id SERIAL PRIMARY KEY,
    user_id INTEGER REFERENCES users(id) NOT NULL,
    tweet VARCHAR(140) NOT NULL
);

The test suite will first drop the previous database, import the schema into the database from schema.sql following with any optional and non-essential data from data.sql and then run the each unit test case. Our first test might be to verify that the tables exists:

it "has required tables" do
  rs = @all_conn.exec "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'"
  tables = rs.values.flatten
  expect(tables.include?("users")).to eq(true)
  expect(tables.include?("tweets")).to eq(true)
end

Maybe test that we can insert users into the database?

it "can have user entries" do
  ret = @all_conn.exec "INSERT INTO users(login, full_name) VALUES('unique-user', 'first') RETURNING id"
  expect(ret[0]["id"].to_i).to be > 0

  ret = @all_conn.exec "SELECT * FROM users WHERE id = #{ret[0]["id"]}"
  expect(ret[0]["login"]).to eq("unique-user")
  expect(ret[0]["full_name"]).to eq("first")
end

Verify that we can’t insert duplicated login names:

it "requires login names to be unique" do
  expect {
    ret = @all_conn.exec "INSERT INTO users(login, full_name) VALUES('unique-user', 'second') RETURNING id"
  }.to raise_error(PG::UniqueViolation)
end

What about tweets? They need to belong to a user, so we want to have a foreign key. Especially we want that you can’t violate the foreign key constraint:

describe "tweets" do
  it "has foreign key on user_id to users(id)" do
    expect { # the database doesn't have a user with id=0
      ret = @all_conn.exec "INSERT INTO tweets(user_id, tweet) VALUES(0, 'test')"
    }.to raise_error(PG::ForeignKeyViolation)
  end
end

If you want to test for a trigger validation using a stored procedure then that violation would raise a PG::RaiseException. Using invalid value for an ENUM field would raise a PG::InvalidTextRepresentation. You can also easily test views, DEFAULT values, CASCADE updates and deletes on foreign keys and even user privileges. Happy developing!

Having Fun With IoT

With the blazing fast technology progress it’s now easier than ever to build all kinds of interconnected gadgets, something which the corporate world might refer as IoT – Internet Of Things. For me, it’s just an excuse to spend time playing around with electronics. I’ve been installing all kinds of features into our summer cottage (or mökki, as it’s called in Finnish), so this blog post shows around some things which I’ve done.

All the things where Raspberry Pi is useful!

I’ve lost count how many Raspberry Pi’s I’ve installed. Our cottage has two of them. My home has couple. My office has at least 20 of them. My dog would probably carry one as well, but that’s another story. As Pi runs standard Linux, all the standard Linux knowledge applies, so we can run databases, GUI applications and do things with your favourite programming language.

So far I’ve found it useful to do:

  • Connect our ground heating pump (maalämpöpumppu) to a Raspberry Pi with a usb-serial cable. This gives me full telemetry and remote configuration capabilities, allowing me to save energy by keeping the cottage temperature down when I’m not there and to warm it up before I arrive.
  • Work as a wifi-to-3g bridge. With a simple USB-3G dongle, an USB-WIFI dongle and a bit of standard Linux scripts you can have it to work as an access point for the Internet.
  • Display dashboards. Just hook the Pi up into a TV with HDMI, run Chrome or Firefox in full screen mode and let it display whatever information best floats your boat.
  • Connect DS18b20 temperature sensors. These are the legendary tiny Dallas 1-wire sensors. They look like transistors, but instead they offer digital temperature measurements from -55’C to +125’C in up to 0.5’C resolution. I have several them around, including in Sauna and in the lake. You can buy them in pre-packaged into the end of a wire or you can solder one directly to your board.
  • Run full blown home automation with Home Assistant and hook it up into a wireless Z-Wave network to control your pluggable lighting, in-wall installed light switches or heating.

All the things where a Raspberry Pi is too big

Enter Arduino and ESP8266. Since its introduction in 2005, the Arduino embedded programming ecosystem has revolutionized DIY electronics, opening the doors to build all kinds of embedded hobby systems easily. Recently a Chinese company built a chip containing full Wifi and TCP/IP stack, perfectly suitable to be paired with Arduino. So today you can buy a full WiFi capable Arduino chip (NodeMCU) for less than three euros a piece. With a bit care you can build remote sensors capable of operating under battery power for an impressive amount of time.

Using Raspberry Pi to log temperatures

The DS18b20 sensors are great. They can operate with just two wires, but it’s best to use a three wire cable: One is for ground, another is for operating power and 3rd is for data. You can put them comfortably over 30 meters away from the master (your raspberry pi) and you can have dozens of sensors in a same network as they each have a unique identifier. Reading temperature values from them is also easy as most Raspberry Pi distributions have easy-to-use drivers for them by default. The sensors are attached to the Raspberry Pi extension bus with a simple pull-down resistor. See this blog post for more info. Here’s my code to read sensor values and write the results into MQTT topics.

Using MQTT public-subscribe messaging for connecting everything together.

MQTT is a simple public-subscribe messaging system widely used for IoT applications. In this example we publish the read sensor values to different MQTT topics (for example I have nest/sauna/sauna for the temperature of sauna. I just invented that every topic in my cottage begins with “nest/”, then the “nest/sauna” means values read by the raspberry pi in the sauna building and then the last part is the sensor name).

On the other end you can have programs and devices reading values from an MQTT broker and reacting based on those values. The usual model is that each sensor publishes their current value to the MQTT bus when the value is read. If the MQTT server is down, or a device listening for the value is down, then the value is simply lost and the situation is expected to be recovered when the devices are back up. If you care for getting a complete history even during downtime, you need to build some kind of acknowledgment model, which is beyond the scope of this article.

To do this you can use mosquitto which can be installed into a recent Raspbian distribution with simply “apt-get install mosquitto”. The mosquitto_sub and mosquitto_pub programs are in the “mosquitto-clients” package.

Building a WiFi connected LCD display

My latest project was to build a simple wifi connected LCD display to show the temperature of the Sauna and the nearby lake, and emit a buzzer beep when one needs to go and put more wood in the fireplace when the sauna is warming up.

Here’s the quick part list for the display. You can get all these from Aliexpress for around 10 euros total (be sure to filter for free shipping):

  • A NodeMCU ESP8266 Arduino board from Aliexpress.
  • An LCD module. I bought both 4*20 (Google for LCD 2004) and 2*16 (LCD 1602), but the boxes I ordered were big enough only for the smaller display.
  • An I2C driver module for the LCD (Google for IIC/I2C / Interface LCD 1602 2004 LCD). This is used to make connecting the display to Arduino a lot easier.
  • Standard USB power source and a micro-usb cable.
  • A 3.3V buzzer for making the thing beep when needed.
  • A resistor to limit the current for the buzzer. The value is around 200 – 800 Ohm depending on the volume you want.

Soldering the parts together is easy. The I2C module is soldered directly to the LCD board and then four wires are used to connect the LCD to the NodeMCU board. The buzzer module is connected in series with the resistor between a ground pin and a GPIO pin on the NodeMCU (this is needed to limit the current used by the buzzer. Otherwise this would fry the Arduino GPIO pin). The firmware I made is available here. Instructions on how to configure Arduino for the NodeMCU are here.

When do I need to add more wood to the stove?

One you have your sensors measuring things and devices capable acting on those measurements, you can build intelligent logic to react on different situations. In my case I wanted to have my LCD device to buzz when I need to go outside to add more wood to the sauna’s stove when I’m heating the sauna. Handling this needs some state to track the temperature history and to execute some logic to determine when to buzz. All this could be programmed into the Arduino micro-controller running the display, but modifying this requires to reprogram the device by attaching a laptop with USB to the device.

I instead opted into another way: I programmed my LCD to be stupid. It simply listens MQTT topics for orders what to display and when to buzz. Then I placed a ruby program in my raspberry pi which listens for the incoming measurements about the sauna temperature and where all the business logic is handled. Then this script will order the LCD to display current temperature, or any other message for that matter (for example the “Please add more wood” message). The source code for this is available here.

The program listens for the temperature measurements and stores them in a small ring buffer. Then on each received measurement it calculates the temperature change in the last five minutes. If the sauna is heating up and the change in the last 5min is less than two ‘C warmer, then we know that the wood is almost burned up and we need to signal the LCD to buzz. The program also has a simple state machine to determine when to do the tracking and when it needs to be quiet. The same program also formats the messages which the LCD displays.

Conclusions

You can easily build intelligent and low-cost sensors to measure pretty much any imaginable metric in a modern environment. Aliexpress is full of modules for measuring temperature, humidity, CO2 levels, flammable gases, distance, weight, magnetic fields, light, motion, vibrations and so on. Hooking them together is easy using either Raspberry Pi or an ESP8266/Arduino and you can use pretty much any language to make them act together intelligently.

Any individual part here should be simple to build and there are a lot of other blog posts, tutorials and guides all around the net (I tried to link some of them in this article). Programming an Arduino is not hard and the ecosystem has very good library for attaching all kinds of sensors into the platform. Managing a Raspberry Pi is just like managing any other Linux. When you know that things can be done then you just need some patience while you learn the details and make things work as you want.

Problems with Terraform due to the restrictive HCL.

Terraform is a great tool to define cloud environments, virtual machines and other resources, but sadly it’s default usage of HCL (Hashicorp Configuration Language) is very restrictive and makes IaaC (Infrastructure-as-a-code) look more like Infrastructure-as-copy-paste. Here are my findings on all different issues which makes writing Terraform .tf files with HCL pain (as on Terraform v0.7.2). This is usually fine for most simple scenarios, but if you think like a programmer you will find HCL really restrictive when you encounter any of its limitations.

Make no mistake: Nothing said here are showstoppers, but they simply make you write more code than what would be required and you will also need to repeat yourself a lot by doing copy-paste and that’s always a recipe for errors.

Template evaluation.

Template evaluation with template_file. Variables must all be primitives. So you can’t pass a list to a template and then use that list to iterate when rendering a template. Also you can’t use any control structures inside a template. You can still use all built-in functions and you also need to escape the template file if the syntax collides with the Terraform template syntax.

Module variables

Modules can’t inherit variables without explicitly declaring them each time a module is used, or there are no global variables a module could access. This leads to the need to pass every possible required variable in every possible moment when a module is called. Consider rather static variables like “domain name”, “region” or the amazon ssh “key_name”. This leads to manual copy-paste repetition. Issue #5480 tries to address this.

Also when you use a module, you will declare a name for it (standard terraform feature). But you can’t access that name as a variable inside that module (#8706).

module "terminal-machine" {
  source = "./some-module"
  hostname = "terminal-machine" # there is no way to avoid writing the "terminal-machine" twice as you can't access the module name.
}

Variables are not really variables

Terraform supports variables which can be used to store data and then later pass that to a resource or a module, or use those to evaluate expressions when defining a module variable. But there are caveats:

  • You can’t have intermediate variables. This for example prevents you for setting a map which values evaluate from variables and then later merge that map with a module input. You can kinda work around with this with a null_resource, but it’s a hack.
  • You can’t use a variable when defining a resource name: “interpolated resource names are considered an anti-pattern and thus won’t be supported.”
  • You can’t evaluate a variable which name contains a variable. So you can’t do something like this “${aws_sns_topic.${topic.$env}.arn}”.
  • If you want to pass a list to a resource variable which requires a list, you need to encapsulate it again into a list: security_groups = [${var.mylist}]. This looks weird to a programmer.

Control structures and iteration

No control structures, iteration nor loops. HCL is just syntactic sugar for JSON. This means that (pretty much) all current features for iteration in Terraform are implemented inside Terraform itself. So you can have a list of items and use that list to spawn a resource which is customised using the list entries using interpolation syntax:

variable "names" {
  type = "map"
  default = {
    "0" = "Garo"
    "1" = "John"
    "2" = "Eve"
  }
}

resource "aws_instance" "web" {
  count = "${length(var.names)}"

  tags {
    Name = "${element(var.names, count.index)}'s personal instance"
  }
}

This works so that Terraform evaluates the length() function to assign count amount of items there are in the map names and then instantiating the resource aws_instance that many times. Each instantiation evaluates the element() function, so we can customise that instance. This doesn’t however extend to more depth. Say you want that each user has three different instances, one for testing and another for staging. You can’t define another list environments and expect to use both names and environments to declare resources in a nice ways. There are couple of workarounds [1] [2], but they usually really complex and error prone. Also you can’t easily reference a property (such as arn or id) of a created resource in another resource, if the first resource tries to use this kind of interpolation.

A programmers approach would be something like this:

variable "topics" {
  default = ["new_users", "deleted_users"]
}

variable "environments" {
  default = ["prod", "staging", "testing", "development]
}

for $topic in topics {
  # Define SNS topics which are shared between all environments
  resource "aws_sns_topic" "$topic.$env" { ... }

  for $env in environments {
    # Then for each topic define a queue for each env
    resource "aws_sqs_queue" "$topic.$env-processors" { ... }

    # And bind the created queue to its sns topic
    resource "aws_sns_topic_subscription" "$topic.$env-to-$topic.$env-processors" {
      topic_arn = "${aws_sns_topic.${topic.$env}.arn}"
      endpoint = "${aws_sqs_queue.{$topic.$env-processors}.arn}"
    }
  }
}

But that’s just not possible, at least currently. Hashicorp argues that control structures remove the declarative nature of HCL and Terraform, but I would argue that you can still have a language with declarative nature which declares resources and constant variables, but still have control structures like in pure functionally programming languages.

Workarounds?

There have been few projects which have created for example a Ruby DSL which outputs JSON directly into Terraform but they aren’t really maintained. Other options would be to use for example C preprocessor, or just hit a nail into your head and accept that you need to do a lot of copy-pasting and try to minimize the amount of infrastructure which is provisioned using Terraform. It’s still a great tool once you have your .tf files ready: It can query current state, helps with importing existing resources and the dependency graph works well. Hopefully Hashicorp realises that the current HCL could be extended much more while still maintaining its declarative nature.

Problems With Node.JS Event Loop

Asynchronous programming with Node.JS is easy after getting used to it but the single threaded model on how Node.JS works internally hides some critical issues. This post explains what everybody should understand on how the Node.JS event loop works and what kind of issues it can create, especially on high transactional volume applications.

What The Event Loop Is?

In the central of Node.JS process sits a relatively simple concept: an event loop. It can be thought as a never ending loop which waits for external events to happen and it then dispatches these events to application code by calling different functions which the node.js programmer created. If we think this really simple node.js program:

setTimeout(function printHello () {
    console.log("Hello, World!");
}, 1000);

When this program is executed the setTimeout function is called immediately. It will register a timer into the event loop which will fire after one second. After the setTimeout call the program execution freezes: the node.js event loop will patiently wait until one second has elapsed and it then executes the printHello() function, which will print “Hello World!”. After this the event loop notices that there’s nothing else to be done (no events to be scheduled and no I/O operations underway) and it exists the program.

We can draw this sequence like this where time flows from left to right: The red boxes are user programmable functions. In between there’s the one second delay where the event loop sits waiting and until it eventually executes the printHello function.

Let’s have another example: a simple program which does a database lookup:

var redis = require('redis'), client = redis.createClient();
client.get("mykey", function printResponse(err, reply) {
    console.log(reply);
});

If we look closely on what happens during the client.get call:

  1. client.get() is called by programmer
  2. the Redis client constructs a TCP message which asks the Redis server for the requested value.
  3. A TCP packet containing the request is sent to the Redis server. At this point the user program code execution yields and the Node.JS event loop will place a reminder that it needs to wait for the network packet where the Redis server response is.
  4. When the message is receivered from the network stack the event loop will call a function inside the Redis client with the message as the argument
  5. The Redis client does some internal processing and then calls our callback, the printResponse() function.

We can draw this sequence on a bit higher level like this: the red boxes are again user code and then the green box is a pending network operation. As the time flows from left to right the image represents the delay where the Node.JS process needs to wait for the database to respond over network.

Handling Concurrent Requests

Now when we have a bit of theory behind us lets discuss a bit more practical example: A simple Node.JS server receiving HTTP requests, doing a simple call to redis and then return the answer from redis to the http client.

var redis = require("redis"), client = redis.createClient();
var http = require('http');

function handler(req, res) {
	redis.get('mykey', function redisReply(err, reply) {
	  res.end("Redis value: " + reply)
	});
}

var server = http.createServer(handler);
server.listen(8080);

Let’s look again an image displaying this sequence. The white boxes represent incoming HTTP request from a client and the response back to that client.

So pretty simple. The event loop receives the incoming http request and it calls the http handler immediately. Later it will also receive the network response from redis and immediately call the redisReply() function. Now lets examine a situation where the server receives little traffic, say a few request every few second:

In this sequence diagram we have the first request on “first row” and then the second request later but drawn on “second row” to represent the fact that these are two different http request coming from two different clients. They are both executed by the same Node.JS process and thus inside the same event loop. This is the key thing: Each individual request can be imagined as its own flow of different javascript callbacks executed one after another but they are all actually executing in the same process. Again because Node.JS is single threaded then only one javascript function can be executing simultaneously.

Now The Problems Start To Accumulate

Now as we are familiar with the sequence diagrams and the notation I’ve used to draw how different requests are handled in a single timeline we can start going over different scenarios how the single threaded event loop creates different problems:

What happens if your server gets a lot of requests? Say 1000 requests per second? If each request to the redis server takes 2 milliseconds and as all other processing is minimal (we are just sending the redis reply straight to the client with a simple text message along it) the event loop timeline can look something like this:

Now you can see that the 3rd request can’t start its handler() function right away because the handler() on the 2nd request was still executing. Later the redis reponse from the 3rd arrived after 2ms to the event loop, but the event loop was still busy executing the redisReply() function from the 2nd request. All this means that the total time from start to finish on the 3rd request will be slower and the overall performance of the server starts to degrade.

To understand the implications we need to measure the duration of each request from start to finish with code like this:

function handler(req, res) {
        var start = new Date().getTime();
	redis.get('mykey', function redisReply(err, reply) {
	  res.end("Redis value: " + reply);
          var end = new Date().getTime();
          console.log(end-start);
	});
}

If we analyse all durations and then calculate how long an average request takes we might get something like 3ms. However an average is a really bad metric because it hides the worst user experience. A percentile is a measure used in statistics indicating the value below which a given percentage of observations in a group of observations fall. For example, the 20th percentile is the value (or score) below which 20 percent of the observations may be found. If we instead calculate median (which means 50%), 95% and 99% percentile values we can get much better understanding:

  • Average: 3ms
  • Median: 2ms
  • 95 percentile: 10ms
  • 99 percentile: 50ms

This shows the scary truth much better: for 1% of our users the request latency is 50 milliseconds, 16 times longer than average! If we draw this into graph we can see why this is also called a long tail: On the X-axis we have the latency and on the Y axis we have how many requests were completed in that particular time.

So the more requests per second our server servers, the more probability that a new request arrives before the previous is completed rises and thus the probability that the request is blocked due to other requests increases. In practice, with node.js when the server CPU usage grows over 60% then the 95% and 99% latencies start to increase quickly and thus we are forced to run the servers way below their maximum capacity if we want to keep our SLAs under control.

Problems With Different End Users

Let’s play with another scenario: Your website servers different users. Some users visit the site just few times a month, most visit once per day and then there are a small group of power users, visiting the site several times per hour. Lets say that when you process a request for a single user you need to fetch the history when the user visited your site during the past two weeks. You then iterate over the results to calculate whatever you wanted, so the code might look something like this:

function handleRequest(res, res) {
	db.query("SELECT timestamp, action from HISTORY where uid = ?", [request.uid], function reply(err, res) {
	  for (var i = 0; i < res.length; i++) {
	  	processAction(res[i]);
	  }
	  res.end("...");
	})
}

A sequence diagram would look pretty simple:

For the majority of the sites users this works really fast as a average user might visit your site 20 times within the past two weeks. Now what happens a heavy user hits your site which has visited the site 2000 times the past two weeks? The for loop needs to go over 2000 results instead of just a handful and this might take a while:

As we can see this immediately causes delays not only to the power user but all other users which had the back luck of browsing the site at the same time when a power user request was underway. We can mitigate this by using process.nextTick to process a few rows at a time and then yield. The code could look something like this:

var rows = ['a', 'b', 'c', 'd', 'e', 'f'];

function end() {
	console.log("All done");
}

function next(rows, i) {
	var row = rows[i];
	console.log("item at " + i + " is " + row)
	// do some processing
	if (i > 0) {
	    process.nextTick(function() {
            next(rows, i - 1);
	    });
	} else {
	    end();
	}
}

next(rows, rows.length - 1);

The functional code adds complexity but the time line looks now more favourable for long processing:

It’s also worth noting that if you try to fetch the entire history instead of just two weeks, you will end up with a system which performs quite well at the start but will gradually getting slower and slower.

Problems Measuring Internal Latencies

Lets say that during a single request your program needs to connect into a redis database and a mongodb database and that we want to measure how long each database call takes so that we can know if one of our databases is acting slowly. The code might look something like this (note that we are using the handy async package, you should check it out if you already haven’t):

function handler(req, res) {
	var start = new Date().getTime();
	async.series([
		function queryRedis(cb) {
			var rstart = new Date().getTime();
			redis.get("somekey", function endRedis(err, res) {
				var rend = new Date().getTime();
				console.log("redis took:", (rend - rstart));
				cb(null, res);
			})
		},
		function queryMongodb(cb) {
			var mstart = new Date().getTime();
			mongo.query({_id:req.id}, function endMongo(err, res) {
				var mend = new Date().getTime();
				console.log("mongodb took:", (mend - mstart));
				cb(null, res);
			})
		}
	], function(err, results) {
		var end = new Date().getTime();
		res.end("entire request took: ", (end - start));
	})
}

So now we track three different timers: one for each database call duration and a 3rd for the entire request. The problem with this kind of calculation is that they are depended that the event loop is not busy and that it can execute the endRedis and endMongo functions as soon as the network response has been received. If the process is instead busy we can’t any more measure how long the database query took because the end time measurement is delayed:

As we can see the time between start and end measurements were affected due to some other processing happening at the same time. In practice when you are affected by a busy event loop all your measurements like this will show highly elevated latencies and you can’t trust them to measure the actual external databases.

Measuring Event Loop Latency

Unfortunately there isn’t much visibility into the Node.JS event loop and we have to resort into some clever tricks. One pretty good way is to use a regularly scheduled timer: If we log start time, schedule a timer 200ms into the future (setTimeout()), log the time and then compare if the timer was fired more than 200 ms, we know that the event loop was somewhat busy around the 200ms mark when our own timer should have been executed:

var previous = null;
var profileEventLoop = function() {
    var ts = new Date().getTime();
    if (previous) {
    	console.log(ts - previous);
    }
    previous = ts;

	setTimeout(profileEventLoop, 1000);
}

setImmediate(profileEventLoop);

On an idle process the profileEventLoop function should print times like 200-203 milliseconds. If the delays start to get over 20% longer than what the setTimeout was set then you know that the event loop starts to get too full.

Use Statsd / Graphite To Collect Measurements

I’ve used console.log to print out the measurement for the sake of simplicity but in reality you should use for example statsd + graphite combination. The idea is that you can send a single measurement with a simple function call in your code to statsd, which calculates multiple metrics on the received data every 10 seconds (default) and it then forwards the results to Graphite. Graphite can then used to draw different graphs and further analyse the collected metrics. For example actual source code could look something like this:

var SDC = require('statsd-client'), sdc = new SDC({host: 'statsd.example.com'});

function handler(req, res) {
    sdc.increment("handler.requests")K
    var start = new Date();
    redis.get('mykey', function redisReply(err, reply) {
        res.end("Redis value: " + reply);
        sdc.timer("handler.timer", start);
    });
}

Here we increment a counter handler.requests each time we get a new incoming request. This can be then used to see how many requests per second the server is doing during a day. In addition we measure how long the total request took to process. Here’s an example what the results might look when the increasing load starts to slow the system and the latency starts to spike up. The blue (mean) and green (median) latencies are pretty tolerable, but the 95% starts to increase a lot, thus 5% of our users get a lot slower responses.

If we add the 99% percentile to the picture we see how bad the situation can really be:

Conclusion

Node.JS is not an optimal platform to do complex request processing where different requests might contain different amount of data, especially if we want to guarantee some kind of Service Level Agreement (SLA) that the service must be fast enough. A lot of care must be taken so that a single asynchronous callback can’t do processing for too long and it might be viable to explore other languages which are not completely single threaded.

Using HAProxy to do health checks to gRPC services

Haproxy is a great tool to do load balancing between microservers, but it current doesn’t support HTTP/2.0 nor gRPC directly. The only option now is to use tcp mode to load balance gRPC backend servers. It is however possible to implement intelligent health checks to gRPC enabled backends using “tcp-check send-binary” and “tcp-check expect binary” features. Here’s how:

First create a .proto service to represent a common way to obtain health check data from all of your servers. This should be shared with all your servers and projects as each gRPC endpoint can implement multiple different services. Here’s my servicestatus.proto as an example and it’s worth nothing that we should be able to add more fields into the StatusRequest and HealthCheckResult messages later if we want to extend the functionality without breaking the haproxy health check feature:

syntax = "proto3";

package servicestatus;

service HealthCheck {
  rpc Status (StatusRequest) returns (HealthCheckResult) {}
}

message StatusRequest {

}

message HealthCheckResult {
  string Status = 1;
}

The idea is that each service implements the servicestatus.HealthCheck service so that we can use same monitoring tools to monitor each and every different gRPC based service in our entire software ecosystem. In the HAProxy case I want that haproxy could call the HealthCheck.Status() function every few seconds and then the server would respond if everything is ok and that the server is capable of accepting new requests.  The server should set the HealthCheckResult.Status field to “MagicResponseCodeOK” string when everything is good so that we can look for the magic string in the response inside haproxy.

Then I extended the service_greeter example (in node.js in this case) to implement this:

var PROTO_PATH = __dirname + '/helloworld.proto';

var grpc = require('../../');
var hello_proto = grpc.load(PROTO_PATH).helloworld;

var servicestatus_proto = grpc.load(__dirname + "/servicestatus.proto").servicestatus;
function sayHello(call, callback) {
  callback(null, {message: 'Hello ' + call.request.name});
}

function statusRPC(call, callback) {
  console.log("statusRPC", call);
  callback(null, {Status: 'MagicResponseCodeOK'});
}

/**
 * Starts an RPC server that receives requests for the Greeter service at the
 * sample server port
 */
function main() {
  var server = new grpc.Server();
  server.addProtoService(hello_proto.Greeter.service, {sayHello: sayHello});
  server.addProtoService(servicestatus_proto.HealthCheck.service, { status: statusRPC });
  server.bind('0.0.0.0:50051', grpc.ServerCredentials.createInsecure());
  server.start();
}

main();

Then I also wrote a simple client to do a single RPC request to the HealthCheck.Status function:

var PROTO_PATH = __dirname + '/servicestatus.proto';

var grpc = require('../../');
var servicestatus_proto = grpc.load(PROTO_PATH).servicestatus;

function main() {
  var client = new servicestatus_proto.HealthCheck('localhost:50051', grpc.credentials.createInsecure());
  client.status({}, function(err, response) {
    console.log('Greeting:', response);
  });
}

main();

What followed was a brief and interesting exploration into how HTTP/2.0 protocol works and how gRPC uses HTTP/2.0 to work. After a brief moment with Wireshark I was able to explore the different frames inside a HTTP/2.0 request:

Screen Shot 2015-11-26 at 10.34.36

We can see here how the HTTP/2 request starts with first a Magic frame following with a SETTINGS frame. It seems that in this case we don’t need the  WINDOW_UPDATE frame when we later construct our own request. If we look closer on the packet #5 with Wireshark we can see this:

Screen Shot 2015-11-26 at 10.36.01

The Magic and SETTINGS are required in the start of each HTTP/2 request. After these gRPC sends a HEADERS frame which contains the interesting parts:

Screen Shot 2015-11-26 at 10.39.44

There’s also a DATA which in this case contains the protocolbuffers encoded payload of the function arguments. The DATA frame is analogous to the POST data payload in the HTTP/1 version, if that helps you to understand what’s going on.

What I did next is that I simply copied the Magic, SETTINGS, HEADERS and DATA frames as a raw Hex string and wrote a simple node.js program to test my work:

Screen Shot 2015-11-26 at 10.36.49

var net = require('net');

var client = new net.Socket();
client.connect(50051, '127.0.0.1', function() {
	console.log('Connected');

	var magic = new Buffer("505249202a20485454502f322e300d0a0d0a534d0d0a0d0a", "hex");
	client.write(magic);

	var settings = new Buffer("00001204000000000000020000000000030000000000040000ffff", "hex");
	client.write(settings);

	var headers = new Buffer("0000fb01040000000140073a736368656d65046874747040073a6d6574686f6404504f535440053a70617468212f736572766963657374617475732e4865616c7468436865636b2f537461747573400a3a617574686f726974790f6c6f63616c686f73743a3530303531400d677270632d656e636f64696e67086964656e746974794014677270632d6163636570742d656e636f64696e670c6465666c6174652c677a69704002746508747261696c657273400c636f6e74656e742d747970651061
	client.write(headers);

	var data = new Buffer("0000050001000000010000000000", "hex");
	client.write(data);

});

client.on('data', function(data) {
	console.log('Received: ' + data);
});

client.on('close', function() {
	console.log('Connection closed');
});

Now when I ran this node.js client code I correctly managed to create a gRPC request to the server and I could see the result, especially the MagicResponseCodeOK string. So how can we use this with HAProxy? We can simply define a backend with “mode tcp” and to concatenate the different HTTP/2 frames into one “tcp-check send-binary” blob and ask haproxy to look for the MagicResponseCodeOK string in the response. I’m not 100% sure yet that this works across all different gRPC implementations, but it’s a great start for a technology demonstration so that we don’t need to wait for HTTP/2 support in haproxy.

listen grpc-test
	mode tcp
	bind *:50051
	option tcp-check
	tcp-check send-binary 505249202a20485454502f322e300d0a0d0a534d0d0a0d0a00001204000000000000020000000000030000000000040000ffff0000fb01040000000140073a736368656d65046874747040073a6d6574686f6404504f535440053a70617468212f736572766963657374617475732e4865616c7468436865636b2f537461747573400a3a617574686f726974790f6c6f63616c686f73743a3530303531400d677270632d656e636f64696e67086964656e746974794014677270632d6163636570742d656e636f64696e670c6465666c6174652c677a69704002746508747261696c657273400c636f6e74656e742d74797065106170706c69636174696f6e2f67727063400a757365722d6167656e7426677270632d6e6f64652f302e31312e3120677270632d632f302e31322e302e3020286f73782900000500010000000100000000000000050001000000010000000000
	tcp-check expect binary 4d61676963526573706f6e7365436f64654f4b
	server 10.0.0.1 10.0.0.1:50051 check inter 2000 rise 2 fall 3 maxconn 100

There you go. =)

Windows script to convert video into jpeg sequence

I do a lot of Linux scripting but Windows .BAT files are something which I haven’t touched since the old MS DOS times.

Here’s a simple .BAT file which you can use to easily convert video into a jpeg sequence using ffmpeg:

echo Converting %1 to jpeg sequence
mkdir "%~d1%~p1%~n1"
c:\work\ffmpeg\bin\ffmpeg.exe -i %1 -q:v 1 %~d1%~p1%~n1\%~n1-%%05d.jpg

You can copy this into “%userdata%\SendTo” so that you can use this by selecting a file and right clicking. It creates a sub directory into the source file directory and writes the sequence there.
Most of the magic is in the weird %~d1 variables which I found out from this StackExchange. I use this to convert my GoPro footage into more suitable jpeg sequence which I then use with DaVinci Resolve.

Continuous Integration pipeline with Docker

I’ll describe our Continuous Integration pipeline which is used by several teams to develop software which is later deployed as Docker Containers. Our programmers use git to develop new features into feature branches which are then usually merged into master branch. The master branch represents the current development of the software. Most projects also have a production branch which always contains code which is ready to be deployed into production at any given moment. Some software packages use version release model so each major version has its own branch.

As developers develop their code they always run at least unit tests locally in their development machine. New code is committed into feature branch, which is merged by another developer into master and pushed to the git repository. This triggers a build in a Jenkins server. We have several Jenkins environments, the most important are testing and staging. Testing provides a CI environment for the developers to verify that their code is production compatible and the staging is for the testing team so that they can have their time to test a release candidate before its actually deployed into production.

High level anatomy of a CI server

The CI service runs a Linux with Docker support. A Jenkins instance is currently installed directly into the host system instead of a container (we had some issues with it as it needs to launch containers). Two sets of backend services are also started into the server: A minimal set required for a development environment. These containers run with –net=host mode and they bind to their default ports. These are used for the unit tests.

Then there’s a separated set of services inside containers which form a complete set that looks just like production environment. The services also obtain fresh backups from production databases so that the developers can test the new code against a copy of live data. These services run with the traditional docker mode, so they each have their own IP from the 178.18.x.x address space. More on this later.

ci-build2

Services for development environment and when running unit tests

Developers can run a subset of required services in their development laptops. This means that for each database type (redis, mongodb etc) only a minimal amount of processes are started (say one mongodb, one redis and no more). This is so that the environment doesn’t consume too much resources. The tests can also be programmed to assume that there’s actual databases behind. When the tests are executed in a CI machine a similar set of services is found on the ports at localhost of the machine. So the CI machine has both a set of services bound to localhost default ports and then a separated set of services which represent the production environment (see next paragraph)

We also use a single Vagrant image which contains this minimal set of backend services inside containers so that the developers can easily spawn the development environment into their laptops.

Build sequence

When a build is triggered the following sequence is executed. A failure in any step break the execution chain and the build is marked as a failure:

  1. Code is pulled from the git repository.
  2. The code includes a Dockerfile which is then used to build a container which is tagged with the git revision id. This results that each commit id has one and exactly one container.
  3. The container is started so that it executes the unit test suite inside the container. This accesses a set of empty databases which are reserved for unit and integration testing.
  4. Integration test suite is executed: This means that first a special network container is started with “–name=”networkholder” which acts as a base for the container network and it runs redir which is used to redirect certain ports from inside the container to the host system so that some depended services (like redis, mongodb etc) can be accessed as they would be in the container “localhost”. Then the application container is started with the –net=”container:networkholder” so that it reuses the network container network stack and it starts the application so that it listens for incoming requests. Then a third container is started into the same network space (–net=”container:networkholder) and this executes the integration test suite.
  5. A new application container (which usually replaces an existing container from the previous build) so that the developers can access the running service across the network. This application container has access to production like set of backend services (like databases) which contains a fresh copy of the production data.
  6. A set of live tests are executed against the application container launched in previous step. These tests are programmed to assume that they can be executed continuously in the production. This step verifies that the build can work with a similar deployment what the production has.
  7. The build is now considered to be successful. If this was a staging build then the container is uploaded to a private Docker registry so that it could be deployed into production. Some services run an additional container build so that all build tools and other unnecessary binaries are stripped from the container for security reasons.

Service naming in testing, staging and production

Each of our service has an unique service name, for example a set of mongodb services would have names “mongodb-cluster-a-node-1” to “mongodb-cluster-a-node-3”. This name is used to create a dns record: “mongodb-cluster-a-node-1.us-east-1.domain.com” so that each production region has its own domain (here “us-east-1.domain.com”). All our services use /etc/resolv.conf to add the region domain into its search path. This results that the applications can use the plain service name without the domain to find the services. This has the additional benefit that we can run the same backend services in our CI servers so that the server has a local dns resolver which resolves the host names to docker container.

Consider this setup:

  • Application config has setting like mongodb.host = “mongodb-cluster-a-node-1:27017”
  • Production environment the service mongodb-cluster-a-node-1 is deployed into some arbitrary machine and a DNS record is created: mongodb-cluster-a-node-1.us-east-1.domain.com A 10.2.2.1
  • Testing and Staging environments both run mongodb-cluster-a-node-1 service locally inside one container. This container has its own IP address, for example 172.18.2.1.

When the application is run in testing or staging: Application resolves mongodb-cluster-a-node-1. The request goes to a local dnsmasq in the CI machine, which resolves the name “mongodb-cluster-a-node-1” into a container at ip 172.18.2.1. Application connects to this ip which is locally in the same machine.

When the application is run in production: Application resolves mongodb-cluster-a-node-1. The request goes into the libc dns lookup code which uses the search property from /etc/resolv.conf. This results that a DNS query is eventually done for mongodb-cluster-a-node-1.us-east-1.domain.com, which returns an IP in an arbitrary machine.

This setup allows us to use the same configurations in both testing, staging and production environments, so that we can verify that all high availability client libraries can connect to all required backends and that the software will work in the production environment.

Conclusion

This setup suits our needs quite well. It leverages the sandboxing which Dockers gives us and enables us to do new deployments with great speed: The CI server needs around three minutes to finish a single build plus two minutes for deployment with our Orbitctl tool, which deserves its own blog post. The developers can use the same containers in a compact Vagrant environment which we use to run our actual production instances, reducing the overhead for maintaining separated environments.

Comparing Kubernetes with Orbit Control

I’ve been programming Orbit Control as a tool to deploy Docker containers for around half a year which we have been running in production without any issues. Recently (Nov 2014) Google released Kubernetes, its cluster container manager, which slipped under my radar until now. Kubernetes seem to contain several nice design features which I had already adopted into Orbitctl, so it looks like a nice product after a quick glance. Here’s a quick summary on the differences and similarities between Kubernetes and Orbitctl.

  • Both use etcd to store central state.
  • Both deploy agents which use the central state from etcd to converge the machine into the desired state.
  • Kubernetes relies on SaltStack for bootstrapping the machines. Currently we use Chef to bootstrap our machines but for Orbitctl it’s just one static binary which needs to be shipped into the machine, so no big difference here.
  • Orbitctl has just “services” without any deeper grouping. Kubernetes adds to this by defining that a Service is a set of Pods. Each pod contains containers which must be running in the same machine.
  • Orbit doesn’t provice any mechanisms for networking. The containers within a Kubernetes Pod share a single network entity (ie. and IP address) and the IP address is routable and accessible between machines running the pods. This seems to help preventing port conflicts in a Kubernetes deployment.
  • Orbit provides a direct access for Docker api which doesn’t hide anything where Kubernetes encapsulates several Docker details (like networking, volume mounts etc) into its own manifest format.
  • Orbitctl has “tags”, Kubernetes has “labels” which have more use cases within Kubernetes than what Orbit currently has for its tags.
  • Orbitctl relies on operators to specify which machines (according to tag) run which service. Kubernetes has some kind of automatic scheduler which can take cpu and memory requirements in account when it distributes the pods.
  • Both use json to define services with pretty similar syntax which is then loaded using a command line tool into etcd.
  • Orbitctl can automatically configure haproxies to specific set of services within a deployment. Kubernetes has similar software router, but it can’t support haproxy yet. There’s open issues on this, so it is coming in the future.
  • Kubernetes has several networking enchantments coming up later in their own feature roadmap. Read more here.
  • Both have support for health checks.
  • Orbit supports deployments across multiple availability zones but not across multiple regions. Kubernetes says it’s not supposed to be distributed across availability zones, probably because its lacking some HA features as it has a central server.

Kubernetes looks really promising, at least when they reach 1.0 version which has nice planned list of features. Currently its lacking some critical features like haproxy configuration, support for deployments across availability zones so it’s not production ready for us, but it’s definitively something to keep an eye on.

Quick way to analyze MongoDB frequent queries with tcpdump

MongoDB has an internal profiler, but it’s often too complex for a quick statistics to see what kind of queries the database is getting. Luckily there’s an easy way to get some quick statistics with tcpdump. Granted, these examples are pretty naive in terms of accuracy, but they are really fast to do and they do give out a lot of useful information.

Get top collections which are getting queries:

tcpdump dst port 27017 -A -s 1400 |grep query | perl -ne '/^.{28}([^.]+\.[^.]+)(.+)/; print "$1\n";' > /tmp/queries.txt
sort /tmp/queries.txt | uniq -c | sort -n -k 1 | tail

The first command will dump the beginning of each packet as string which goes into MongoDB and it will then filter out everything except queries. The perl regexp clause will pick the target collection name and print it to stdout. You should run this around 10 seconds and then stop it with Ctrl+C. The next command sorts this log and prints top collections to stdout. You can run these commands both in your MongoDB machine or in your frontend machine.

You can also get more detailed statistics about the query by looking at the tcpdump. For example you can spot keywords like $and, $or, $readPreference etc which can help you to determine what kind of queries there are. Then you can pick up the queries you might want to cache with memcached or redis, or maybe to move some queries to the secondary instances.

Check out also this nice tool called MongoMem, which can tell you how much each of your collections are stored in the physical memory (RRS). This is also known as the “hot data”.

Avointa maastotietodataa – löytyisikö sitä uutta kiipeiltävää?

Osana valtionhallinnon avoimen datan hanketta Maanmittauslaitos avasi vihdoin arkistonsa kaiken kansan nähtäville. Ehkä mielenkiintoisin näistä on Maastotietokanta, joka sisältää eritellysti yhtenä valtavana tietokantana kaikki Maanmittauslaitoksen kartoittamat yksityiskohdat maastosta. Kiipeilijät ovat pitkään etsineet uusia potentiaalisia kiipeilypaikkoja selailemalla karttoja, mutta voisiko hommaa helpottaa?

Maanmittauslaitos määrittelee Kiven seuraavasti:

Yli 2.5 m korkea tai yleisesti tunnettu tai vähäkivisellä seudulla selvästi ympäristöstään erottuva kivilohkare. Alueella, jossa on runsaasti yli 2.5 m korkeita kivilohkareita, vain selvimmin ympäristöstään erottuvat kivilohkareet tallennetaan.

Omaan korvaani tuo kuulostaa melko hyvältä potentiaalisen boulderin määritelmältä! Analysoin koeluontoisesti kaikki Suomesta löytyvät 483713 kiveä ja jaoin ne kartalla noin 1000 * 500 metrin lohkoihin. Tämän jälkeen piirsin kaikki eniten kiviä sisältäneet lohkot Google Mapsin päälle. Tästä kartasta voi klikata yksittäistä pistettä, jolloin pääsee katsomaan maastokarttaa kyseisestä kohdasta.

Tämä analyysi ei tietenkään löydä boulderointiin sopivia kiviä suoraan, vaan se toimii lähinnä apuna ohjaamaan etsintää suoraan mahdollisesti potentiaalisille alueille. Jos alue näyttää kiinnostavalta, niin sitten vaan GPS:n kanssa keväiseen luontoon! Huomioithan kuitenkin access-asiat! Tämä on vasta yksi esimerkki miten aineistoa voi käyttää kiipeilyn hyväksi. Moni varmasti keksii parempia tapoja!

Siirry yleiskarttaan tästä.

Kuva: Juha Immonen

Teknisesti orjentoituneet voivat katsoa lähinnä omaan käyttöön tehdyt lähdekoodit GitHubista. Itse aineiston kopioin kätevästi Kapsi Ry:n palvelimelta.

Why Open-sourcing Components Increases Company Productivity and Product Quality

We’re big fans of open source community here at Applifier. So much, that we believe that open-sourcing software components and tools developed in-house will result in better quality, increased cost savings and increased productivity. Here’s why:

We encourage our programmers to design and implement components, which aren’t our core business, into reusable packages which will be open-sourced once the package is ready. The software is distributed on our GitHub site, with credits to each individual who contributed into the software.

Because the programmers know that their full name will be printed all over the source code, and they can be later Google’d with it, they will take better care to ensure that the quality standards are high enough to stand a closer look. This means:

  • Better overall code quality. Good function/parameter names, good packages, no unused functions etc.
  • Better modularization. The component doesn’t have as much dependencies to other systems, which is generally considered as a good coding practice.
  • Better tests and test coverage. Tests are considered to be essential part of modern code development, so you’ll want to show everybody that you know your business, right?
  • Better documentation. The component is published so that anybody can use it, so it must have good documentation and usage instructions.
  • Better backwards compatibility. Coders take better care when they design API’s and interfaces because they know that somebody might be using the component somewhere out there.
  • Better security. Coder knows that anybody will be able to read his code and find security holes, thus he takes better care for not making any.

In practice, we have found that all the open source components have higher code and document quality than any of our non-published software component. This also ensures that the components are well documented and can be easily maintained if the original coders leave the company. This gives good cost savings in the long run. Open-sourcing components also gives your company good PR value and makes you more attractive for future employers.

For example one of our new guy was asked to do a small monitoring component to monitor some data from RightScale and transfer it into Zabbix, which is our monitoring system. Once the person said that the component was completed, I said to him: “Good, now polish it so that you dare to publish it with your own name in GitHub.”

Crash course to Java JVM memory issues to sysadmins

Are you a sysadmin who is new to Java? Then you might find this post to be helpful.

Java has its own memory management system with garbage collection which is most of the time really nice, but you need to know some details how it works so you can administrate your JVM instances effectively.

How Java manages memory?

At the beginning Java JVM will allocate a block of memory from the OS to its heap which it will distribute to the program running inside the JVM. The amount is controlled with two command line arguments: -Xms tells how much memory JVM will allocate at the start and -Xmx what’s the maximum amount of memory which JVM can allocate from the OS. For example -Xms512m -Xmx1G will tell java to start with half gigabyte at the beginning and allow it to grow to one gigabyte.

As the Java program runs it allocates memory for the objects from the JVM heap. This will result the heap to grow until a GC (garbage collection) threshold is  reached. This will trigger the JVM to see which objects are no longer used (objects which are not referenced by any working object) and it will free this memory back to the heap. There are numerous ways how this can work in different GC implementations (Java has many of them) and they’re out of the scope of this article. The main point is that the Java heap usage will grow until about 80% usage when the GC occurs and then drop back to much lower level. If you use jconsole to watch the free memory you will see something like this:

The saw tooth like pattern is just normal life Java garbage collection and nothing to worry about. This however will make it difficult to know how much memory the program actually uses needs.

What happens when Java runs out of memory?

If JVM can’t free enough memory with a simple GC it will run a Full GC which will be a stop-the-world collection. This suspends the JVM execution until the collection is done.  A Full GC can be seen as a sudden drop on the amount of used memory, for example as seen in this image. The Full GC in this case took 0.8 seconds. It’s not much, but it did suspend the program execution for that time, so keep that in mind when designing your Java software and its real time requirements.

The Full GC will be able to free enough memory so that the program will continue, but if the JVM simply has not enough memory it will need to trigger another Full GC shortly. This can result in a GC storm where the program spends even more time doing even longer Full GC’s and finally running out of memory.  It’s not uncommon to see Full GC taking over two minutes in these situations and remember, the program is suspended during a Full GC! No need to say that this is bad, right?

However giving JVM too much memory is also bad. This will make the JVM happy as it doesn’t need to do Full GCs, but then the small GCs can take longer and if you eventually run into a Full GC situation it will take long. Very long. Thus you need to think how much memory your program will need and setup the JVM -Xms and -Xmx so that it has enough plus additional “gc breathing room” on top of that.

How the OS sees all this?

When the JVM starts and allocates the amount of memory specified in -Xms the OS will not immediately allocate all this memory but thanks to the modern virtual memory management the OS reserves this to be used later. You can see this in the VIRT column in top. Once the Java program starts to actually use this memory the OS will need to provide the memory and thus you can see the program RES column value to grow. VIRT means how much virtual memory has been allocated and mapped to the process (this includes the JVM heap memory plus JVM code and other libraries) and RES means how much memory from all VIRT memory is actually in the ram.

The Java in the image above has too big amount of memory. The program was started with 384MB heap (-Xms384m) but it was allowed to grow up to one gigabyte (-Xmx1G), but the program is actually using just 161 megabytes out of it.

However when the JVM GC runs and frees the memory back to the application, the memory is not given back to the OS. Thus you will see the RES value to grow up to VIRT, but never to actually decrease unless the OS chooses to swap some of the JVM memory out to disk. This can happen easily if you specify too big heap to the JVM which doesn’t get used and you should try to avoid this.

Top Tip for Top: You can press f to add and hide columns like SWAP. Notice that SWAP isn’t actually the amount of memory which has been swapped to disk. According to top manual: VIRT = SWAP + RES. Swap contains both the pages which has been swapped to disk and pages which hasn’t yet been actually used. See for more very usefull top commands by pressing ?.

How can I monitor all this?

The best way is to use JMX with some handy tool like jconsole. JConsole is a GUI utility which comes with all JDK distributions and can be found under the bin/ directory (jconsole.exe in windows). You can use the jconsole to connect into a running JVM and extract a lot of different metrics out of it and even tweak some settings on the fly.

JMX needs to be enabled, which can be done by adding these arguments to the JVM command line:

-Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=8892 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote=true

Notice that these settings disable authentication and ssl, so you should not do this unless your network is secured from the outside. You can also feed this data into monitor systems like Zabbix (my favourite), Cacti or Nagios, which I have found very helpful when debugging JVM performance.

Other way is to enable GC logging which can be done in Sun JVM with these command line parameters (these are reported to be working also with OpenJDK but I haven’t tested)

-XX:+PrintGCTimeStamps -XX:+PrintGC -Xloggc:/some/dir/cassandra.gc.log

These will print GC statistics to the log file, here’s an actual example:

17500.125: [GC 876226K->710063K(4193024K), 0.0195470 secs]
17569.086: [GC 877871K->711547K(4193024K), 0.0200440 secs]
17641.289: [GC 879355K->713210K(4193024K), 0.0201440 secs]
17712.079: [GC 881018K->714931K(4193024K), 0.0212350 secs]
17736.576: [GC 881557K->882170K(4193024K), 0.0419590 secs]
17736.620: [Full GC 882170K->231044K(4193024K), 0.8055450 secs]
17786.560: [GC 398852K->287047K(4193024K), 0.0244280 secs]

The first number is seconds since JVM startup, the second tells the GC type (normal vs. Full GC) and how much memory was freed.

Conclusion

  • Java JVM will eat all the memory which you give to it (this is normal)
  • You need to tune the JVM -Xms and -Xmx parameters to give it enough but not too much memory so your application works.
  • The memory wont be released back to the OS until JVM exists, but the OS can swap the JVM memory out. Usually this is bad and you need to decrease the memory you give to the JVM.
  • Use JMX to monitor the JVM memory usage to find suitable values.

Script and template to export data from haproxy to zabbix

I’ve just created a zabbix template with a script which can be used to feed performance data from haproxy to zabbix. The script firsts uses HTTP to get the /haproxy?stats;csv page, parses the CSV and uses zabbix_sender command line tool to send each attribute to the zabbix server. The script can be executed on any machine which can access both zabbix server and the haproxy stats page (I use the machine which runs the zabbix_server). The script and template works on both zabbix 1.6.x and 1.8.x.

As the haproxy server names might differ from zabbix server names, the script uses annotations inside the haproxy.cfg hidden in comments. The annotations tell the script which frontend and server node statistics should be sent to the zabbix server. This allows you to keep the configuration in a central place which helps keeping the haproxy and zabbix configurations in sync. The template includes two graphs, example below:

I’ve chosen to export following attributes from haproxy, but more could be easily added (I accept patches via github.com):

  • Current session count
  • Maximum session count
  • Sessions per second
  • HTTP responses per minute, grouped by 1xx, 2xx, 3xx, 4xx and 5xx.
  • Mbps in (network traffic)
  • Mbps out (network traffic)
  • Request errors per minute
  • Connection errors per minute
  • Response errors per minute
  • Retries (warning) per minute
  • Rate (sessions per second)
  • HTTP Rate (requests per second)
  • Proxy name in haproxy config
  • Server name in haproxy config

The code is available at github: https://github.com/garo/zabbix_haproxy The script supports HTTP Basic Authentication and masking the HTTP Host-header.

Usage:

  1. Import the template_haproxyitems.xml into Zabbix.
  2. Add all your webservers to zabbix as hosts and link them with the Template_HAProxyItem
  3. Add all your frontends to zabbix as hosts and link them with the Template_HAProxyItem. The frontend hosts don’t need to be mapped to any actual ip nor server, I use the zabbix_server ip as the host ip for these.
  4. Edit your haproxy.cfg file and add annotations for the zabbix_haproxy script. These annotations mark which frontends and which servers you map into zabbix hosts. Notice that the annotations are just comments after #, so haproxy ignores them.
    frontend irc-galleria # @zabbix_frontend(irc-galleria)
            bind            212.226.93.89:80
            default_backend lighttpd
    
    backend lighttpd
            mode            http
            server  samba           10.0.0.1:80    check weight 16 maxconn 200   # @zabbix_server(samba.web.pri)
            server  bossanova       10.0.0.2:80    check weight 16 maxconn 200   # @zabbix_server(bossanova.web.pri)
            server  fuusio          10.0.0.3:80     check weight 4 maxconn 200   # @zabbix_server(fuusio.web.pri)
  5. Setup a crontab entry to execute the zabbix_haproxy script every minute.  I use the following entry in /etc/crontab:
    */1 * * * * nobody zabbix_haproxy -c /etc/haproxy.cfg -u "http://irc-galleria.net/haproxy?stats;csv" -C foo:bar -s [ip of my zabbix server]
  6. All set! Go and check the latests data in zabbix to see if you got the values. If you have problems you can use -v and -d command line arguments to print debugging information.

Oneliner: erase incorrect memcached keys on demand

We had a situation where our image thumbnail memcached cluster somehow got empty thumbnails. The thumbnails are generated on the fly by image proxy servers and the thumbnail is stored into memcached. For some reason some of the thumbnails were truncated.

As I didn’t have time to start debugging the real issue, I quickly wrote this oneliner which detects corrupted thumbnails when the thumbnail is fetched from the memcached and issues a delete operation to it. This will keep the situation under control until I can start the actual debugging. We could also have restarted the entire memcached cluster, but it would result in big preformance penalty for several hours. Fortunately all corrupted thumbnails are just one byte long, so detecting them was simple enough to do with an oneliner:

tcpdump -i lo -A -v -s 1400 src port  11213 |grep VALUE | perl -ne 'if (/VALUE (cach[^ ]+) [-]?\d+ (.+)/) { if ($2 == 1) { `echo "delete $1 noreply\n" | nc localhost 11213`; print "deleted $1\n"; } }'

Here’s how this works:

  1. tcpdump prints all packets in ascii (-A) which come from port 11213 (src port 11213), our memcached node,  from interface loopback (-i lo)
  2. the grep passes only those lines which contains the response header which has the following form: “VALUE <key> <flags> <length>
  3. for each line (-n) the perl executes the following script (-e ‘<script>’) which first uses regexp to catch the key “(cach[^ ]+)” and then the length.
  4. It then checks if the length is 1 if ($2 == 1) and on success it executes a shell command which sends a “delete <key> noreply” message to the memcached server using netcat (nc). This command will erase the corrupted value from memcached server.
  5. Last it prints a debug message

Open BigPipe javascript implementation

We have released our open BigPipe implementation written for IRC-Galleria which is implemented by loosely following this facebook blog. The sources are located at github: https://github.com/garo/bigpipe and there’s an example demonstrating the library in action at http://www.juhonkoti.net/bigpipe.

BigPipe allows speeding up page rendering times by loading the page in small parts called pagelets. This allows browser to start rendering the page while the php server is still processing to finish the rest. This transforms the traditional page rendering cycle into a streaming pipeline containing the following steps:

  1. Browser requests the page from server
  2. Server quickly renders a page skeleton containing the <head> tags and a body with empty div elements which act as containers to the pagelets. The HTTP connection to the browser stays open as the page is not yet finished.
  3. Browser will start downloading the bigpipe.js file and after that it’ll start rendering the page
  4. The PHP server process is still executing and its building each pagelet at a time. Once a pagelet  has been completed it’s results are sent to the browser inside a <script>BigPipe.onArrive(…)</script> tag.
  5. Browser injects the html code received into the correct place. If the pagelet needs any CSS resources those are also downloaded.
  6. After all pagelets have been received the browser starts to load all external javascript files needed by those pagelets.
  7. After javascripts are downloaded browser executes all inline javascripts.

There’s an usage example in example.php. Take a good look on it. The example uses a lot of whitespace padding to saturate web server and browser caches so that the bigpipe loading effect is clearly visible. Of course these paddings are not required in real usage. There’s still some optimizations to be done and the implementation is way from being perfect, but that hasn’t stopped us from using this in production.

Files included:

  • bigpipe.js Main javascript file
  • h_bigpipe.inc BigPipe class php file
  • h_pagelet.inc Pagelet class php file
  • example.php Example showing how to use bigpipe
  • test.js Support file for example
  • test2.js Support file for example
  • README
  • Browser.php Browser detection library by Chris Schuld (http://chrisschuld.com/)
  • prototype.js Prototypejs.org library
  • prototypepatch.js Patches for prototype

How NoSQL will meet RDBMS in the future

The NoSQL versus RDBMS war started a few years ago and as the new technologies are starting to get more mature it seems that the two different camps will be moving towards each other. Latests example can be found at http://blog.tapoueh.org/blog.dim.html#%20Synchronous%20Replication where the author talks about upcoming postgresql feature where the application developer can choose the service level and consistency of each call to give hint to the database cluster what it should do in case of database node failure.

The exact same technique is widely adopted in Cassandra where each operation has a consistency level attribute where the programmer can decide if he wants full consistency among entire cluster or is it acceptable if the result might not contain the most up to date data in case of node failure (and also gain extra speed for read operations) . This is also called Eventual Consistency.

The CAP theorem says that you can only have two out of three features from a distributed application: Consistency, Availability and Partition Tolerance (hence the acronym CAP). To give example: If you choose Consistency and Availability, your application cannot handle loss of a node from your cluster. If you choose Availability and Partition Tolerance, your application might not get most up-to-date data if some of your nodes are down. The third option is to choose Consistency and Partition Tolerance, but then your entire cluster will be down if you lost just one node.

Traditional relation databases are designed around the ACID principle which loosely maps to Consistency and Partition Tolerance in the CAP theorem. This makes it hard to scale an ACID into multiple hosts, because ACID needs Consistency. Cassandra in other hand can swim around the CAP theorem just fine because it allows the programmer to choose between Availability + Partition Tolerance  and Consistency + Availability.

In the other hand as nosql technology matures they will start to get features from traditional relation databases. Things like sequences, secondary indexes, views and triggers can already be found in some nosql products and many of them can be found from roadmaps. There’s also the ever growing need to mine the datastorage to extract business data out of it. Such features can be seen with Cassandra hadoop integration and MongoDB which has internal map-reduce implementation.

Definition of NoSQL: Scavenging the wreckage of alien civilizations, misunderstanding it, and trying to build new technologies on it.

As long as nosql is used wisely it will grow and get more mature, but using it without good reasons over RDBMS is a very easy way to shoot yourself in your foot. After all, it’s much easier to just get a single powerfull machine like EC2 x-large instance and run PostgreSQL in it, and maybe throw a few asynchronous replica to boost read queries. It will work just fine as long as the master node will keep up and it’ll be easier to program.


Example how to model your data into nosql with cassandra

We have built a facebook style “messenger” into our web site which uses cassandra as storage backend. I’m describing the data schema to server as a simple example how cassandra (and nosql in general) can be used in practice.

Here’s a diagram on the two column families and what kind of data they contain. Data is modelled into two different column families: TalkMessages and TalkLastMessages. Read more for deeper explanation what the fields are.

TalkMessages contains each message between two participants. The key is a string built from the two users uids “$smaller_uid:$bigger_uid”. Each column inside this CF contains a single message. The column name is the message timestamp in microseconds since epoch stored as LongType. The column value is a JSON encoded string containing following fields: sender_uid, target_uid, msg.

This results in following structure inside the column family.

"2249:9111" => [
  12345678 : { sender_uid : 2249, target_uid : 9111, msg : "Hello, how are you?" },
  12345679 : { sender_uid : 9111, target_uid : 2249, msg : "I'm fine, thanks" }
]

TalkLastMessages is used to quickly fetch users talk partners, the last message which was sent between the peers and other similar data. This allows us to quickly fetch all needed data which is needed to display a “main view” for all online friends with just one query to cassandra. This column family uses the user uid as its key. Each column
represents a talk partner whom the user has been talking to and it uses the talk partner uid as the column name. Column value is a json packed structure which contains following fields:

  • last message timestamp: microseconds since epoch when a message was last sent between these two users.
  • unread timestamp : microseconds since epoch when the first unread message was sent between these two users.
  • unread : counter how many unread messages there are.
  • last message : last message between these two users.

This results in following structure inside the column family for these
two example users: 2249 and 9111.

"2249" => [
  9111 : { last_message_timestamp : 12345679, unread_timestamp : 12345679, unread : 1, last_message: "I'm fine, thanks" }

],
"9111" => [
  2249 : { last_message_timestamp :  12345679, unread_timestamp : 12345679, unread : 0, last_message: "I'm fine, thanks" }
]

Displaying chat (this happends on every page load, needs to be fast)

  1. Fetch all columns from TalkLastMessages for the user

Display messages history between two participants:

  1. Fetch last n columns from TalkMessages for the relevant “$smaller_uid:$bigger_uid” row.

Mark all sent messages from another participant as read (when you read the messages)

  1. Get column $sender_uid from row $reader_uid from TalkLastMessages
  2. Update the JSON payload and insert the column back

Sending message involves the following operations:

  1. Insert new column to TalkMessages
  2. Fetch relevant column from TalkLastMessages from $target_uid row with $sender_uid column
  3. Update the column json payload and insert it back to TalkLastMessages
  4. Fetch relevant column from TalkLastMessages from $sender_uid row with $target_uid column
  5. Update the column json payload and insert it back to TalkLastMessages

There are also other operations and the actual payload is a bit more complex.

I’m happy to answer questions if somebody is interested :)

Message Truncation UTF-8 Problems With PHP and Sajax library

I recently had a long night debugging this: My application using PHP, Sajax and MySQL truncated messages from the first non-ascii character to the end of the message. For example: if I typed “yö tulee” (finnish sentence means that “night is coming”) only “y” was inserted into the database.

The problem was with Sajax library. Sajax does not handle UTF-8, but there’s an easy fix for it: open the Sajax.php, locate the code in the following snippet and add the bolded line to your Sajax.php:

// Bust cache in the head
header (“Expires: Mon, 26 Jul 1997 05:00:00 GMT”); // Date in the past
header (“Last-Modified: ” . gmdate(“D, d M Y H:i:s”) . ” GMT”);
// always modified
header (“Cache-Control: no-cache, must-revalidate”); // HTTP/1.1
header (“Pragma: no-cache”); // HTTP/1.0
header (“Content-type: charset=UTF-8”);
$func_name = $_GET[“rs”];

This forces the Sajax library to use UTF-8 charset when it transmits data between the browser and the server. Also if you are using MySQL with UTF-8 you need to read this. Also encoding the SQL query string with $sql = utf8_encode($sql); might help.