Saturday 19 April 2008

Databases for data warehousing - part I

In the Telco industry, there's loads of data - 10's, 100's even 1000's million rows of data per day. The problems we face managing our data are no different from others facing 1000's-1,000,000's of rows of data per day, it's just exacerbated by the sheer volume.

So what exactly is the problem? What are the issues?
Computers are more powerful than they have ever been.
Surely they can cope with a few billion rows of data?

Disk vs Memory

We know that RAM access time is in nanoseconds; hard disk access time is in milliseconds – i.e.
it's theoretically 100,000x faster processing data in memory than from disk.

So can't we run our databases in memory?

We can - and that's one of the things super computers do.
They are full of processors and memory and are designed
to get round I/O and compute bottlenecks.

However memory based systems are very expensive since the
cost of memory is £0.10-£0.15 per GB whereas RAM is 300x
more expensive around £30 per Gb (I've based these calculations
on PC technology but it's indicative of the price ratio). So,
  • Disk is slow but inexpensive and non-volatile
  • RAM is fast but expensive and volatile
RAM intensive Super Computers are 30-50Tb for $999M
High spec’d disk-based Database Servers 50Tb for $20M.

For now most companies are forced to use disk based technology
and use innovative technology to get round I/O and compute
bottlenecks.

Bottlenecks facing Databases

Watch the video demo on the Kickfire website. It captures the
essence of the Von Neumann compute bottleneck and the disk
I/O bottleneck facing database systems.

To summarise:

1. the registers on the processor needing to wait for data pushed
to/from memory - that's the Von Neumann bottleneck.
2. the data to/from the disk to memory/processor - that's the disk
I/O bottleneck.

Solutions

There are many ways to solve these problems both in hardware and
software. ETL tools, database technologies and hardware solutions
are full of ways round these bottlenecks.

Here are a few solutions:
  • Filter out unnecessary data as close to the source and/or disk as possible (e.g. using partition pruning or disk-side column/row projection filtering) => less data to deal with through the system
  • Read/write as few times from/to the disk as possible => minimise slow disk I/O
  • Get as many disk involved in a single I/O request as possible => increase the overall data throughput
  • Compress the data where possible/practical => less disk to read
  • Parallelise read/write operations => improved overall throughput
  • Passing the data output from one function to the next as soon as it is ready (pipelining) => faster end-to-end throughput
  • Materialise aggregations – store data summaries e.g. aggregations with 1/10th the data => 10X faster to read
We currently use Ab Initio and Oracle. Both these technologies utilise some of these concepts. And there are a host of new technologies rethinking the bottlenecks and coming up with creative and sometimes eye-catching results.

More about these coming up.