Test Driven Development on an SQL Database Schema

TDD on SQL Schema. Why not? A normal Test Driven Development approach means that you start with writing a test case which creates the requirement to write appropriate code in order for the test to pass. This results in a very quick iteration where you add new unit tests, verify that they fail, implement the code for the test to pass and verify that the test passes. A single iteration can be just few minutes or less and the test set usually executes in just a few seconds. The end result is that you will end up with great test coverage which helps refactoring and in itself it helps explaining the user stories in the code.

Applying TDD to SQL

At first writing CREATE TABLE declarations doesn’t sound like something worth testing, but modern SQL database engines offer a lot of tools to enforce proper and fully valid data. Constraints, foreign keys, checks and triggers are commonly used to validate that invalid or meaningless data is not stored in the database. This means that you can certainly write a simple CREATE TABLE declaration and run with it, but if you want to verify that you cannot send invalid data to a table then you need to test for it. If you end up writing triggers and stored procedures it is even more important to write proper tests.

I picked up Ruby with it’s excellent rspec testing tool for a proof-of-concept implementation for testing a new schema containing around a dozen tables and stored procedures. Ruby has a well working PostgreSQL driver and writing unit test cases with rspec is efficient in term of lines of code. Also as Ruby is interpreted executing a unit test suite is really fast. In my case a set of 40 test cases takes less than half a second to execute.


Take this simple twitter example. I placed a complete source code example in github at https://github.com/garo/sql-tdd

    full_name VARCHAR(40) NOT NULL

    user_id INTEGER REFERENCES users(id) NOT NULL,
    tweet VARCHAR(140) NOT NULL

The test suite will first drop the previous database, import the schema into the database from schema.sql following with any optional and non-essential data from data.sql and then run the each unit test case. Our first test might be to verify that the tables exists:

it "has required tables" do
  rs = @all_conn.exec "SELECT table_name FROM information_schema.tables WHERE table_schema = 'public'"
  tables = rs.values.flatten
  expect(tables.include?("users")).to eq(true)
  expect(tables.include?("tweets")).to eq(true)

Maybe test that we can insert users into the database?

it "can have user entries" do
  ret = @all_conn.exec "INSERT INTO users(login, full_name) VALUES('unique-user', 'first') RETURNING id"
  expect(ret[0]["id"].to_i).to be > 0

  ret = @all_conn.exec "SELECT * FROM users WHERE id = #{ret[0]["id"]}"
  expect(ret[0]["login"]).to eq("unique-user")
  expect(ret[0]["full_name"]).to eq("first")

Verify that we can’t insert duplicated login names:

it "requires login names to be unique" do
  expect {
    ret = @all_conn.exec "INSERT INTO users(login, full_name) VALUES('unique-user', 'second') RETURNING id"
  }.to raise_error(PG::UniqueViolation)

What about tweets? They need to belong to a user, so we want to have a foreign key. Especially we want that you can’t violate the foreign key constraint:

describe "tweets" do
  it "has foreign key on user_id to users(id)" do
    expect { # the database doesn't have a user with id=0
      ret = @all_conn.exec "INSERT INTO tweets(user_id, tweet) VALUES(0, 'test')"
    }.to raise_error(PG::ForeignKeyViolation)

If you want to test for a trigger validation using a stored procedure then that violation would raise a PG::RaiseException. Using invalid value for an ENUM field would raise a PG::InvalidTextRepresentation. You can also easily test views, DEFAULT values, CASCADE updates and deletes on foreign keys and even user privileges. Happy developing!

How NoSQL will meet RDBMS in the future

The NoSQL versus RDBMS war started a few years ago and as the new technologies are starting to get more mature it seems that the two different camps will be moving towards each other. Latests example can be found at http://blog.tapoueh.org/blog.dim.html#%20Synchronous%20Replication where the author talks about upcoming postgresql feature where the application developer can choose the service level and consistency of each call to give hint to the database cluster what it should do in case of database node failure.

The exact same technique is widely adopted in Cassandra where each operation has a consistency level attribute where the programmer can decide if he wants full consistency among entire cluster or is it acceptable if the result might not contain the most up to date data in case of node failure (and also gain extra speed for read operations) . This is also called Eventual Consistency.

The CAP theorem says that you can only have two out of three features from a distributed application: Consistency, Availability and Partition Tolerance (hence the acronym CAP). To give example: If you choose Consistency and Availability, your application cannot handle loss of a node from your cluster. If you choose Availability and Partition Tolerance, your application might not get most up-to-date data if some of your nodes are down. The third option is to choose Consistency and Partition Tolerance, but then your entire cluster will be down if you lost just one node.

Traditional relation databases are designed around the ACID principle which loosely maps to Consistency and Partition Tolerance in the CAP theorem. This makes it hard to scale an ACID into multiple hosts, because ACID needs Consistency. Cassandra in other hand can swim around the CAP theorem just fine because it allows the programmer to choose between Availability + Partition Tolerance  and Consistency + Availability.

In the other hand as nosql technology matures they will start to get features from traditional relation databases. Things like sequences, secondary indexes, views and triggers can already be found in some nosql products and many of them can be found from roadmaps. There’s also the ever growing need to mine the datastorage to extract business data out of it. Such features can be seen with Cassandra hadoop integration and MongoDB which has internal map-reduce implementation.

Definition of NoSQL: Scavenging the wreckage of alien civilizations, misunderstanding it, and trying to build new technologies on it.

As long as nosql is used wisely it will grow and get more mature, but using it without good reasons over RDBMS is a very easy way to shoot yourself in your foot. After all, it’s much easier to just get a single powerfull machine like EC2 x-large instance and run PostgreSQL in it, and maybe throw a few asynchronous replica to boost read queries. It will work just fine as long as the master node will keep up and it’ll be easier to program.