Concurrent Processing On The Cheap
Breaking the list apart it the right route as a single geocode() consume a core but only one core. Here's the basics on how I did it.- Added an id primary key bigint to my table and populated
- Created a query to pull array_agg(id) on distinct addresses
- Sent each array_agg(id) as a message to a Kafka topic with 4 partitions
- Created a process to consume from a single topic partition
- This process again pulled the address for one id in the list
- Execute geocode()
- Saved long-lat in a table
Simple right? Here's a diagram to help illustrate:
Before, running one process at a time when by script or SQL would have taken an estimated 5 days. Breaking the work up because no one row has anything to do with another and executing 4 concurrent processes took roughly 1 day to finish. Neat, eh?
Before Kafka and similar technologies, I'd wold have sat down and created a program that did much the same except would need to use threads and some form of IPC to hand work out. Such a tedious and antiquated process reminiscent of the 1990's.
But let's say we don't want to use Kafka. Let's also say we don't want to script anything. How can we keep the data in the database and break the work up in such a way as to accomplish the end goal in the least amount of time? My answer... better technology selection.
Introducing Greenplum (for those who don't know it)
Greenplum was derived from PostgreSQL 8.2 and one of the first MPP (Massive Parallel Processing) RDBMS. Pivotal, the creator of Greenplum, has since open sourced the project as "gpdb" and can be found on github here if you're interested: https://github.com/greenplum-db/gpdb. I've opted to sign up on Pivotal's web site and downloaded the official Greenplum for trial purposes.
The entire premise of Greenplum is many hands make light work. No data is stored on the master but rather is hashed, sent to 1 of many segments (PostgreSQL instances) running on backend nodes. This effectively shards the data and permits the capability of taking advantage of all available I/O on each system, cores and memory. Greenplum is your ultimate horizontal scaling RDBMS solution allowing for up to 1,024 backend nodes. Given the need (and budget), a single cluster can have petabytes of memory and disk, hundreds or perhaps thousands of cores, and IOPS in the millions.
If you're a systems and data geek, you're drooling right now.
Basic architecture looks like this:
Here are the highlights:
- This illustration has only two backend nodes shown but as mentioned it can be up to 1,024
- Each backend has two primary segments (pseg) and two secondary segments (sseg)
- Ideally every primary segment runs on a LUN and disk separate from all others
- Primary segments replicate to a secondary on a different backend node
- Backend nodes are optional
- The master database has a standby or master-mirror
Being originally from PostgreSQL 8.2, Greenplum utilizes WAL file shipping to replicate. Yuck right? Well, the open source project is quickly catching up and as of this blog the stable branch is up to speed with PostgreSQL 9.4. I am not fully familiar with what replication capabilities have changed but anticipate streaming replication will become the defacto standard soon enough.
My GIS Configuration
I won't get into setting it up now but will describe the cluster.
- A single master
- Two nodes
- Four primary segment on each node
After creating the database, I downloaded and loaded the Greenplum safe version of PostGIS. There are a couple other notable add-ons includes GPText for full text search capabilities and Apache MADlib for all your math geeky, statistical needs.
This configuration has effectively 8 PostgreSQL instances. So how does it run?
After loading TIGER data (what a bother that was), loading my data, using a single SQL statement creating the address, geocode() and inserting into a table the time it took to complete was close to 14 hours. I've gone from 5 days, to 1 day, to 14 hours. Not too shabby.
I don't have a huge budget, in fact I haven't a budget at all. This cluster is running on a single host within virtual machines. I've done what I could to have each node on its own disk. Perhaps given some $$ and time, I could run the same test with Greenplum in AWS. I've done that in the past with great results. Certainly it would reduce the time even further but I reached my objective.


