Applications come and go, data lasts for ever

14 05 2009

Data lasts longer than applications, but will your next system upgrade reuse existing databases? If not, do you have exit strategies? 

I have the last years worked with re-building existing mainframe systems on java platform with oracle backend. The migration often involved developing a mainframe program to export all data to files and java programs to import them to our datamodel in oracle. Luckily we had a lot of mainframe competence and I think the export programs usually took shorter time to realize than our import programs. Anyway, what happens in 10-15 years? When new upgrades and migrations is necessary? 

Relational databases is considered safe and often an argument when it comes to these questions. Competence on relational databases and oracle will still be here in 10-15 years, but is that enough?  Often your datamodel needs to be understood on a higher level than the database, some datamodels is easy to understand from the db-schema some is not.  Although it is possible to have this as a strategy I will chose a better one:

Implement a data exit strategy from the start of the project. Take the time to make a dump feature that dumps the data to file in a well defined format, and a import feature that can load data in. Of cource the predecessor system developers has to understand the new format, and one can argue that this is no better than the database, I think it’s easier.

This also will give you a few other positive effect:

  • It could be your backup/restore solution
  • It could help you with testing. It will be easy to set up a production copy of you system with data
  • It could help you with partitioning for availability 
  • It could help with database lock-in

A popular pattern for developers today is to use a repository abstraction for the database. I think a good idea will be to let the repository have the export and import feature.





Testing Tokyo cabinet

12 05 2009

Installed Tokyo cabinet on my ubuntu server today. 

Tokyo cabinet is a library for managing a database. The database is a key/value store and the database is a simple file. You can define the database to be a hash, btree+, table or fixed length array.

Tokyo cabinet is close to my dream of what a data storage should be:

  •  unstructured data
  •  “unlimited” space
  •  “unlimited” speed
  • totally controlled by programmers

To get a database server you need to install Tokyo Tyrant which is a network interface to support accessing the database from multiple distributed clients. It supports memcached protocol out of the box, http (REST), hot backup and replication.

Packages for most clients including java and ruby exists, and a full text search engine (Dystopia).

I did some benchmarking.

1 000 000 inserts on local db in 1.8 secs!

Then I installed Tokyo Tyrant which is the network interface of tokyo cabinet.

1 000 000 inserts over network in 38 secs.

It also supports REST, so I benchmarked it with ruby rest client. The overhead was of course big.

10 000 inserts varied between  12-14 secs.

Get’s gave apx. the same results..

require “rubygems”
require ‘tokyocabinet’
require “benchmark” 
include Benchmark
include TokyoCabinet
hdb = HDB::new
 if !hdb.open(“casket.tch”, HDB::OWRITER | HDB::OCREAT)
   ecode = hdb.ecode
   STDERR.printf(“open error: %s\n”, hdb.errmsg(ecode))
 end
test = “Tokyo Tyrant”
m = test.method(:length)
n = 10000000
bm do |x|
  x.report { n.times do ; !hdb.put(n, “benchmarked tokyo cabinet with ruby client”) ; end }
#  x.report { n.times do ; !hdb.get(n); end }
end
 if !hdb.close
   ecode = hdb.ecode
   STDERR.printf(“close error: %s\n”, hdb.errmsg(ecode))
 en




Databases databases databases

6 05 2009

Relational Databases are used for everything.

In my current project we use it for:

  • Transaction processing
  • Integration plattform
  • Data warehouse
  • Asynchronisity and parallellization
  • Online queries
  • Long time storage

We use it to this because it supports it, not necessarliy good, but it’s pretty simple, understood by many, and already a part of our infrastructure. 

The problem is that the database does not scale, it is lock-in, resistant to change, and we do not have enough control over it. All this things we have to compensate for and it gives us less freedom in our architecture.

What I want is:

  • unlimited speed 
  • unlimited scaling
  • store what ever we want (no schema, unstructured data)
  • total control

The database should not be the limitating factor of our architecture.

Is this possible? Well of cource it’s a dream, but we do have much better alternatives today.

CouchDb, unstructured, scales well, easy integraton with rest and json. On the other hand the speed is pretty bad.

BigTable, looks good, but could be very complex setup.

TokyoCabinet, promising, gives the developer a lot of control when it comes to internal structure of the database and the server setup. Very fast.