Greg's Ramblings

Wednesday, April 24, 2019

Topics, Partitions & Streaming ... Oh My!

It landed on my shoulders to derive the [longitude,latitude] from a list of addresses. Naturally, PostGIS is the tool to use however this list is over 11M and anyone who has used geocode() knows it's not exactly, shall we say, quick.

Concurrent Processing On The Cheap

Breaking the list apart it the right route as a single geocode() consume a core but only one core. Here's the basics on how I did it.

Added an id primary key bigint to my table and populated
Created a query to pull array_agg(id) on distinct addresses
Sent each array_agg(id) as a message to a Kafka topic with 4 partitions
Created a process to consume from a single topic partition
This process again pulled the address for one id in the list
Execute geocode()
Saved long-lat in a table

Simple right? Here's a diagram to help illustrate:

Before, running one process at a time when by script or SQL would have taken an estimated 5 days. Breaking the work up because no one row has anything to do with another and executing 4 concurrent processes took roughly 1 day to finish. Neat, eh?

Before Kafka and similar technologies, I'd wold have sat down and created a program that did much the same except would need to use threads and some form of IPC to hand work out. Such a tedious and antiquated process reminiscent of the 1990's.

But let's say we don't want to use Kafka. Let's also say we don't want to script anything. How can we keep the data in the database and break the work up in such a way as to accomplish the end goal in the least amount of time? My answer... better technology selection.

Introducing Greenplum (for those who don't know it)

Greenplum was derived from PostgreSQL 8.2 and one of the first MPP (Massive Parallel Processing) RDBMS. Pivotal, the creator of Greenplum, has since open sourced the project as "gpdb" and can be found on github here if you're interested: https://github.com/greenplum-db/gpdb. I've opted to sign up on Pivotal's web site and downloaded the official Greenplum for trial purposes.

The entire premise of Greenplum is many hands make light work. No data is stored on the master but rather is hashed, sent to 1 of many segments (PostgreSQL instances) running on backend nodes. This effectively shards the data and permits the capability of taking advantage of all available I/O on each system, cores and memory. Greenplum is your ultimate horizontal scaling RDBMS solution allowing for up to 1,024 backend nodes. Given the need (and budget), a single cluster can have petabytes of memory and disk, hundreds or perhaps thousands of cores, and IOPS in the millions.

If you're a systems and data geek, you're drooling right now.

Basic architecture looks like this:

Here are the highlights:

This illustration has only two backend nodes shown but as mentioned it can be up to 1,024
Each backend has two primary segments (pseg) and two secondary segments (sseg)
Ideally every primary segment runs on a LUN and disk separate from all others
Primary segments replicate to a secondary on a different backend node
Backend nodes are optional
The master database has a standby or master-mirror

Being originally from PostgreSQL 8.2, Greenplum utilizes WAL file shipping to replicate. Yuck right? Well, the open source project is quickly catching up and as of this blog the stable branch is up to speed with PostgreSQL 9.4. I am not fully familiar with what replication capabilities have changed but anticipate streaming replication will become the defacto standard soon enough.

My GIS Configuration

I won't get into setting it up now but will describe the cluster.

A single master
Two nodes
Four primary segment on each node

After creating the database, I downloaded and loaded the Greenplum safe version of PostGIS. There are a couple other notable add-ons includes GPText for full text search capabilities and Apache MADlib for all your math geeky, statistical needs.

This configuration has effectively 8 PostgreSQL instances. So how does it run?

After loading TIGER data (what a bother that was), loading my data, using a single SQL statement creating the address, geocode() and inserting into a table the time it took to complete was close to 14 hours. I've gone from 5 days, to 1 day, to 14 hours. Not too shabby.

I don't have a huge budget, in fact I haven't a budget at all. This cluster is running on a single host within virtual machines. I've done what I could to have each node on its own disk. Perhaps given some $$ and time, I could run the same test with Greenplum in AWS. I've done that in the past with great results. Certainly it would reduce the time even further but I reached my objective.

Tuesday, April 23, 2019

PostgreSQL Progress Bar

We've all been there, right? Large statement, insert / update / delete / select, taking for... ev... er... We understand it is hard to predict when these things will complete but wouldn't it be neat if there was a way to provide a progress bar?

Obviously, there is a solution. :)

Let's not get ahead of ourselves. The solution I am presenting will not include parallelism. Could it be done? Certainly with additional scripting external to the database however the problem I am wanting to address is simply providing feedback for a single query. Keep that objective in mind.

Enter PostgreSQL Version 11 and Transaction Management

Available in PostgreSQL version 11 is the ability to COMMIT and ROLLBACK within a procedure or anonymous code block (DO). There are limitations to this control including:

CALL or DO only (sorry FUNCTION)
Loops on CURSOR
UPDATE RETURNING

If your statement can be batched, we can break things up in such a way as to provide the ever elusive SQL progress bar.

What is postgres_fdw doing here?

I'm glad this question occurred to you to ask. PostgreSQL is an ACID compliant MVCC RDBMS. A single query executes in one transaction and no process outside of that transaction can see what it is doing or, more to the point, what has been done.

What I thought postgres_fdw brought to the table is the ability for a single controlling function to execute DML on a foreign table and thus data made available to other sessions. That was my thought when I started down this post but it is simply not the case. When DML on a foreign table is being performed, the remote session is also in a transaction and not until the local session is COMMIT'd is the remote session also COMMIT'd.

Well, buggers. At least I tried and learned a little.

The Basics

We need something to operate on:

CREATE TABLE output(
id serial primary key,
created timestamp default now(),
uu_id uuid not null
);

Demo

About time I got to it. Now, let's say we have that huge statement that takes forever.

INSERT INTO batch_output (id, created, uu_id)
SELECT nextval('output_id_seq'), now(), uuid_generate_v1()
FROM generate_series(1, 1000, 1);

Okay, it's not that big but what if it were and took hours to complete and the boss wants feedback as it progresses? Could use an anonymous code block?

DO $$
DECLARE
i int;
BEGIN
FOR i IN 1..10
LOOP
INSERT INTO output (id, created, uu_id)
SELECT nextval('output_id_seq'), now(), uuid_generate_v1()
FROM generate_series(1, 100, 1);

RAISE NOTICE '% pct done, xact id %', i*10, txid_current();
END LOOP;
END $$;
NOTICE: 10 pct done, xact id 1551874
NOTICE: 20 pct done, xact id 1551874
NOTICE: 30 pct done, xact id 1551874
NOTICE: 40 pct done, xact id 1551884
NOTICE: 50 pct done, xact id 1551884
NOTICE: 60 pct done, xact id 1551884
NOTICE: 70 pct done, xact id 1551884
NOTICE: 80 pct done, xact id 1551884
NOTICE: 90 pct done, xact id 1551884
NOTICE: 100 pct done, xact id 1551884
DO

No. Doesn't meet the objective where other processes want access to the data as everything is in a single transaction. Note xact id 1551884 at every notice. We need to COMMIT in the loop.

DO $$
DECLARE
i int;
BEGIN
FOR i IN 1..10
LOOP
INSERT INTO batch_output (id, created, uu_id)
SELECT nextval('output_id_seq'), now(), uuid_generate_v1()
FROM generate_series(1, 100, 1);

RAISE NOTICE '% pct done, xact id %', i*10, txid_current();

COMMIT;
END LOOP;
END $$;
NOTICE: 10 pct done, xact id 1551876
NOTICE: 20 pct done, xact id 1551877
NOTICE: 30 pct done, xact id 1551878
NOTICE: 40 pct done, xact id 1551880
NOTICE: 50 pct done, xact id 1551884
NOTICE: 60 pct done, xact id 1551885
NOTICE: 70 pct done, xact id 1551886
NOTICE: 80 pct done, xact id 1551887
NOTICE: 90 pct done, xact id 1551888
NOTICE: 100 pct done, xact id 1551890
DO

What's going on?

Obviously we're looping
Our batch sizes are 100
Output is provided showing progress and transaction id
Transaction is COMMIT'd

To wrap up the illustration, here is a list of timestamps for records created:

SELECT created,count(*) FROM output GROUP BY 1;

created | count

----------------------------+-------

2019-04-23 13:07:41.476109 | 100

2019-04-23 13:07:41.25541 | 100

2019-04-23 13:07:41.30827 | 100

2019-04-23 13:07:41.49809 | 100

2019-04-23 13:07:41.392188 | 100

2019-04-23 13:07:41.521092 | 100

2019-04-23 13:07:41.447075 | 100

2019-04-23 13:07:41.567168 | 100

2019-04-23 13:07:41.542118 | 100

2019-04-23 13:07:41.277067 | 100

(10 rows)

It's quite that simple.

Caveats

In addition to those mentioned above, two things are worth mentioning.

First, a child stored procedures called by a parent stored procedure or anonymous code block may COMMIT or ROLLBACK as long as CALL is used. Okay, that's not too hard to remember.

Second, a child stored procedure called by a FUNCTION may not perform transaction controls.

I hope you learned something here. I certainly did.

Tuesday, May 1, 2012

Nonviable NoSQL

I've done a ton of reading lately on NoSQL including articles dating back to c. 2000 up to some just recently published and I have to tell you I was impressed. Most claimed at least 1 or 2 magnitudes better performance, all scaled linearly or near linearly and all have some wonderful features.

Then I got to thinking, if these NoSQL solutions are so great then why do we still use Oracle, Microsoft SQL, DB2, and the like? Must be something else missing. Lets think about it...

Going with a NoSQL solution you're trading everything that ACID provides for eventually consistent but darn if it ain't quick! You get your results faster than ever and who cares if it's not accurate all the time. It's accurate enough. Sorry, I know people working in finance and logistics and what they want is the answer.

NoSQL means having a solution that doesn't require a fork-lift to upgrade. That's the argument anyways. Scale by adding commodity hardware anyone can pick up. Reality, you're spinning up many servers as virtual machines that reside on big hardware, as physical servers inside some blade unit or you're using that commodity hardware but need so much of it a fork-lift would come in handy. Go ahead and argue but the scaling without requiring the fork-lift isn't much of an argument.

What else is sacrificed in NoSQL? Table joins, sorting, oh yeah, and a standardized language. All that for the sake of getting possibly inconsistent query results really, really fast from a scalable platform. Since when did the platform dictate whether it was okay or not to have inconsistent results? It may be okay for Google, Facebook or Twitter but I doubt application owners in the non-search and non-social space would care for it.

I think what the NoSQL crowd is really griping about is an affordable, scalable database.

Monday, April 30, 2012

Advantages of PostgreSQL Asynchronous Events

Communications is key among processes and threads. We have IPC's and signals for local system communications but what to do for processes running on different hosts? There's a host of answers for this ranging from JMS to Jabberd but I was looking for something with the following parameters

uses a service already in my environment,
message is temporary, and
is automatically sent to all listening.

We have a LAPP stack (Linux, Apache, ActiveMQ, PostgreSQL & PHP). Might argue ActiveMQ would work and yes it would however it's abused already and the messages aren't always as timely as they ought to be. Also, the purpose of this task was to find a fallback means to communication should ActiveMQ die (which has been known to happen).

Decided on the PostgreSQL NOITFY - LISTEN pair to send simple asynchronous event notifications. All that's required is a database connection (already there) to a shared database (already done). The PHP code is pretty straightforward. Here's a snippit...

#!/usr/bin/php
<?php

// Open some connections to our database
$dbh1 = pg_connect("dbname=postgres");
$dbh2 = pg_connect("dbname=postgres");
$dbh3 = pg_connect("dbname=postgres");

// Prime the connection
Database::asyncListen($dbh2, 'listen_test');
Database::asyncListen($dbh3, 'listen_test');

// Send notification
Database::asyncSend($dbh1, 'listen_test');

// Listen for notification on current connections
Database::asyncListen($dbh2, 'listen_test');
Database::asyncListen($dbh3, 'listen_test');

// Test connection signing on after the event
$dbh4 = pg_connect("dbname=postgres");
Database::asyncListen($dbh4, 'listen_test');

class Database {
  public function asyncSend($conn, $event) {
     pg_query($conn, "NOTIFY {$event};");
     echo "event created\n";
  }

  public function asyncListen($conn, $event) {
     pg_query($conn, "LISTEN {$event};");
     if ( pg_get_notify($conn) !== false ) {
       echo "event '{$event}' registered\n";
     } else {
       echo "nothing\n";
     }
  }
}
?>

Here's the output:

nothing
nothing
event created
event 'listen_test' registered
event 'listen_test' registered
nothing

What's this good for? As we race towards distributed computing there has to be a means to communicate effectively. Such communications within PostgreSQL is handy in that LISTEN is incredibly light thus a body to the message / event is not necessary. Multiple LISTEN's can be executed looking for many events for many things. The name is specific to the message. Rather than multiple constant SELECT's on a table looking for a record, using NOTIFY can tell the code when to check the table.

Just one of those little discoveries I felt the need to scream from the rooftops.

Monday, March 21, 2011

Home Solar Project - Capacity

Made up your mind yet? How big is it going to get? I'm shooting for 50% and here's my reasoning... money.

It's all about capacity. Capacity to supply electricity to your home. Capacity to charge the batteries. Capacity to pay for it all. Another capacity comes in the form of patience and in this case it costs nothing to be patient. My advice for a DIY-er is to target half of your needs. You can add capacity later with the right planning up front.

Half of my current needs (37,000 Wh) is 18,500 Wh. Right, what's that mean?

13 60W light bulbs turned on for 24 hours
9 60W light bulbs and a 50" Plasma TV on for 24 hours
11 60W light bulbs and a 50" LED TV on for 24 hours
59 14W CFL bulbs turned on for 24 hours

Whoa. 59 14W CFLs? Where'd that come from?

Capacity isn't just about supply but what you can (or want) do with that supply. Before thinking more about your project it may be helpful to make a list of everything that's plugged in from the oven and microwave, lamps, clocks, computers, DVRs, toothbrush and other chargers, light timers and whatever else uses a wall socket. Chances are it's on and you don't realize it.

Back to the 59 CFL's v. 13 60W bulbs... replace those 13 60W bulbs with 14W CFL's and you have "created" 14,132 Wh capacity. It's a 76% reduction in what was previously consumed!

Okay, reality check. Your electric bill isn't for 13 bulbs you leave on all day. It's about the total of everything plugged in and why you need it. It's worth looking at it and asking whether those things can be replaced with a more efficient technology. Replacing 13 60W bulbs with 13 14W bulbs is relatively cheap ($2.85 for 4 at my local Home Depot). Replacing a circa 1970's refrigerator may be worth it. Replacing your grandmothers first 1000W microwave complete with speed dial may not. Why? It doesn't run unless you're standing there heating your coffee or soup.

I'll give this some time to ferment.

Home Solar Project

I have been reading about solar technology for some time now. Lately, while watching our energy bills increase, I have been researching what it would cost to go solar at home. In a word, simple.

Major requirements for the typical home:

Solar panels
Charge controller
Grid-tie inverter
Batteries
Permit
Miscellaneous do-dads
Current electricity bills
Calculator
Patience

This list is not in the order of necessity or importance. Honestly, the calculator and patience I should have but at the top but ya know.

It really needs to start with your current electricity bills. How much do you use in a month? My house averages 1,100 KWh/mo (kilowatt hours per month). To get an average daily usage in Wh (watt hours), multiple by 1,000 then divide by 30. For me, that's roughly 37,000 Wh/day.

How much of that do you want to offset? 50%, 80%, 100%? This is the number one factor in sizing your project. It is the difference between a hobby and project and can quickly double or triple your costs.

Ponder that and I'll work on the next steps in planning.

Tuesday, September 7, 2010

Multiple String Comparison Found?

I cobbled together an edit distance algorithm back in 2006 which worked quite well and was considering publishing. I haven't yet and it's a good thing. I recently found in Dan Gusfield's Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology a very similar algorithm. Burden of proof that it is different is on me.

While comparing the algorithms, I noticed taking a slightly different view of the data I could add a third, fourth and perhaps N number of strings and produce a meaningful comparison string. Certainly a completely different sequence and haven't found anything like it yet. Still have a few things to work out but it looks very promising. Promising in that there would seem to be many applications.