So what exactly is the problem? What are the issues?
Computers are more powerful than they have ever been.
Surely they can cope with a few billion rows of data?
Disk vs Memory
We know that RAM access time is in nanoseconds; hard disk access time is in milliseconds – i.e.
it's theoretically 100,000x faster processing data in memory than from disk.
So can't we run our databases in memory?
We can - and that's one of the things super computers do.
They are full of processors and memory and are designed
to get round I/O and compute bottlenecks.
However memory based systems are very expensive since the
cost of memory is £0.10-£0.15 per GB whereas RAM is 300x
more expensive around £30 per Gb (I've based these calculations
on PC technology but it's indicative of the price ratio). So,
- Disk is slow but inexpensive and non-volatile
- RAM is fast but expensive and volatile
•
•RAM intensive Super Computers are 30-50Tb for $999M
•High spec’d disk-based Database Servers 50Tb for $20M.
For now most companies are forced to use disk based technology
and use innovative technology to get round I/O and compute
bottlenecks.
Bottlenecks facing Databases
Watch the video demo on the Kickfire website. It captures the
essence of the Von Neumann compute bottleneck and the disk
I/O bottleneck facing database systems.
To summarise:
1. the registers on the processor needing to wait for data pushed
to/from memory - that's the Von Neumann bottleneck.
2. the data to/from the disk to memory/processor - that's the disk
I/O bottleneck.
Solutions
There are many ways to solve these problems both in hardware and
software. ETL tools, database technologies and hardware solutions
are full of ways round these bottlenecks.
Here are a few solutions:
More about these coming up.
For now most companies are forced to use disk based technology
and use innovative technology to get round I/O and compute
bottlenecks.
Bottlenecks facing Databases
Watch the video demo on the Kickfire website. It captures the
essence of the Von Neumann compute bottleneck and the disk
I/O bottleneck facing database systems.
To summarise:
1. the registers on the processor needing to wait for data pushed
to/from memory - that's the Von Neumann bottleneck.
2. the data to/from the disk to memory/processor - that's the disk
I/O bottleneck.
Solutions
There are many ways to solve these problems both in hardware and
software. ETL tools, database technologies and hardware solutions
are full of ways round these bottlenecks.
Here are a few solutions:
- Filter out unnecessary data as close to the source and/or disk as possible (e.g. using partition pruning or disk-side column/row projection filtering) => less data to deal with through the system
- Read/write as few times from/to the disk as possible => minimise slow disk I/O
- Get as many disk involved in a single I/O request as possible => increase the overall data throughput
- Compress the data where possible/practical => less disk to read
- Parallelise read/write operations => improved overall throughput
- Passing the data output from one function to the next as soon as it is ready (pipelining) => faster end-to-end throughput
- Materialise aggregations – store data summaries e.g. aggregations with 1/10th the data => 10X faster to read
More about these coming up.