SQL Change Management Sans Duplication

In the previous episode in this series, I had one more issue with regard to SQL change management that I wanted to resolve:

  1. There is still more duplication of code than I would like, in that a procedure defined in one change script would have to be copied whole to a new script for any changes, even simple single-line changes.

So let’s see what we can do about that. Loading it into Git, our first example looks like this:

> alias sqlhist="git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print \"\"} /./’"
> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (HEAD, gamma, master)]
sql/deploy/add_user.sql

(Aside: I’ve created an alias, sqlhist, on the first line, so that all the Git and Awk magic doesn’t clutter the remaining examples.)

So, we’ve got the creation of the users table under the alpha tag, the addition of the widgets table and an accompanying add_widget() function under the beta tag, and the creation of the add_user() function under the gamma tag. So far so good. Now, let’s say that gamma has been deployed to production, and now we’re ready to add a feature for the next release.

Modify This

It turns out that our users really want a timestamp for the time a widget was created. So let’s add a new change script that adds a created_at column to the widgets table. First we add sql/deploy/widgets_created_at.sql with:

— requires: widgets_table
ALTER TABLE widgets ADD created_at TIMESTAMPTZ;

And then the accompanying revert script, sql/revert/widgets_created_at.sql:

ALTER TABLE widgets DROP COLUMN IF EXISTS created_at;

Commit them and now our deployment configuration looks like this:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2 (HEAD, master)]
sql/deploy/widgets_created_at.sql

So far so good. We have a simple delta script that modifies the existing table, and there is no code duplication. Time to modify the add_widget() function to insert the timestamp. Recall that, in the first article in this series, I created a separate sql/deploy/add_widgets_v2.sql file, copied the existing function in its entirety into the new file, and modified it there. If we were to do that here, the resulting deployment configuration would look something like this:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
sql/deploy/widgets_created_at.sql

[dfba488cfd9145928a25d8d48de3231da84s4bd2 (HEAD, master)]
sql/deploy/add_widget_v2.sql

Which would be fine, except that if someone else wanted to see what had changed, here’s what git diff would output:

> git diff HEAD^ sql/deploy 
diff —git a/sql/deploy/add_widget_v2.sql b/sql/deploy/add_widget_v2.sql
new file mode 100644
index 0000000..9132195
—- /dev/null
+++ b/sql/deploy/add_widget_v2.sql
@@ -0,0 +1,8 @@
+— requires widgets_created_at
+CREATE OR REPLACE FUNCTION add_widget(
+    username   TEXT,
+    widgetname TEXT
+) RETURNS VOID LANGUAGE SQL AS $$
+    INSERT INTO widgets (created_by, name, created_at)
+    VALUES ($1, $2, NOW());
+$$;

So, what changed in the add_widget() function between gamma and now? One cannot tell from this diff: it looks like a brand new function. And no web-based VCS interface will show you, either; it’s just not inherent in the commit. We have to actually know that it was just an update to an existing function, and what files to manually diff, like so:

 > diff -u sql/deploy/add_widget.sql sql/deploy/add_widget_v2.sql 
—- sql/deploy/add_widget.sql   2012-01-28 13:06:24.000000000 -0800
+++ sql/deploy/add_widget_v2.sql    2012-01-28 13:26:59.000000000 -0800
@@ -1,8 +1,8 @@
—- requires: widgets_table
-
+— requires: widgets_created_at
 CREATE OR REPLACE FUNCTION add_widget(
     username   TEXT,
     widgetname TEXT
 ) RETURNS VOID LANGUAGE SQL AS $$
-    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
+    INSERT INTO widgets (created_by, name, created_at)
+    VALUES ($1, $2, NOW());
 $$;

Much better, but how annoying is that? It doesn’t allow us to really take advantage of the VCS, all because we need SQL changes to run in a very specific order.

But let’s ignore that for the moment. Let’s just throw out the commit with add_widgets_v2.sql and go ahead and change the add_widget change script directly. So the history now looks like this:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
sql/deploy/widgets_created_at.sql

[e4b970aa36f27451fe377791eab040a73c6eb47a (HEAD, epsilon, master)]
sql/deploy/add_widget.sql

Naturally, the add_widget script appears twice now, once under the beta tag and once under epsilon (which I’ve just tagged). What are the consequences for our migration? Well, if we were to build a new database from the beginning, running these migrations as listed here, we would get an error while applying the beta changes:

ERROR:  column "created_at" of relation "widgets" does not exist
LINE 5:     INSERT INTO widgets (created_by, name, created_at)

This is because the created_at column won’t exist until the widgets_created_at change is applied. That won’t do, will it? Fortunately, Git knows exactly what the add_widget deploy script looked like under the beta tag, and we can ask it:

> git show beta:sql/deploy/add_widget.sql
— requires: widgets_table

CREATE OR REPLACE FUNCTION add_widget(
    username   TEXT,
    widgetname TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
$$;

Boom, there it is, with no reference to created_at. Using this technique, our SQL deployment app can successfully apply all of our database changes by iterating over the list of changes and applying the contents of each script at the time of the appropriate commit or tag. In other words, it could apply the output from each of these commands:

git show alpha:sql/deploy/users_table.sql
git show beta:sql/deploy/widgets_table.sql
git show beta:sql/deploy/add_widget.sql
git show gamma:sql/deploy/add_user.sql
git show 44ba615b7813531f0acb6810cbf679791fe57bf2:sql/deploy/widget_created_at.sql
git show epsilon:sql/deploy/add_widget.sql

And everything will work exactly as it should: the original version of the add_widget change script will be for the beta tag, and the next version will be applied for the epsilon tag. Not bad, right? We get a nice, clean Git history and can exploit it to manage the changes.

Reversion to the Mean

But what about reversion? What if the deploy to epsilon failed, and we need to revert back to gamma? Recall that in the first article, I eliminated duplication by having the add_widget_v2 revert script simply call the add_widget deploy script. But such is not possible now that we’ve changed add_widget in place. What to do?

The key is for the change management script to know the difference between a new change script and a modified one. Fortunately, Git knows that, too, and we can get it to cough up that information with a simple change to the sqlhist alias: instead of passing —name-only, pass —name-status:

% alias sqlhist="git log -p —format=’[%H%d]’ —name-status —reverse sql/deploy \
| awk ‘/^\[/ {print \"\"} /./’"

Using this new alias, our history looks like:

> sqlhist

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
A   sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
A   sql/deploy/add_widget.sql
A   sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
A   sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
A   sql/deploy/widgets_created_at.sql

[e4b970aa36f27451fe377791eab040a73c6eb47a (HEAD, epsilon, master)]
M   sql/deploy/add_widget.sql

Now we have a letter defining the status of each file. An “A” means the file was added in that commit; an “M” means it was modified. But the upshot is that, to revert to gamma, our change management can see that add_widget was modified in epsilon, and, rather than apply a revert change script, it can just apply the version of the script as it existed under gamma:

> git show gamma:sql/deploy/add_widget.sql
— requires: widgets_table

CREATE OR REPLACE FUNCTION add_widget(
    username   TEXT,
    widgetname TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
$$;

And there we are, right back to where we should be. Of course, the remaining epsilon deploy script, widget_created_at, was added in its commit, so we just apply the revert script and we’re set, back to gamma.

Still Configurable

To get back to the original idea of a migration configuration file, I still think it’s entirely do-able. All we need to is to have the change management app generate it, just as before. When it comes to modified — rather than added — deploy scripts, it can automatically insert new scripts with the full copies of previous versions, much as before. The resulting configuration would look something like this:

[3852b378aa029cc610a03806e8268ed452dce8a6 (alpha)]
sql/deploy/users_table.sql

[32883d5a08691351b07928fa4e4fb7e68c500973 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[b8b9f5c152675305c6b2d3e105d55a25019e0828 (gamma)]
sql/deploy/add_user.sql

[44ba615b7813531f0acb6810cbf679791fe57bf2]
sql/deploy/widgets_created_at.sql

[e4b970aa36f27451fe377791eab040a73c6eb47a (HEAD, epsilon, master)]
sql/deploy/add_widget_v2.sql

Note that last line, where we now have add_widget_v2. The change management script would simply generate this file, and create an additional revert script with the same name that just contains the contents of the deploy script as it was under the gamma tag.

Too Baroque?

Having written down these ideas that have plagued by brain for the last week, along with some examples using Git to confirm them, I’m convinced more than ever that this is entirely workable. But it also leads me to wonder if it’s too baroque. I intend these posts as a rough spec for how this thing should work, and I plan to implement it in the coming weeks. But I’m wondering how difficult it will be to explain it all to people?

So let me see if I can break it down to a few simple rules.

  • In general, you should create independent deploy and revert scripts for your SQL. Put a CREATE TABLE statement into its own script. If it requires some some other table, require declare the dependency. If you need to change it later, create a new script that uses ALTER TABLE.
  • In special cases where a simple change cannot be made without copying something wholesale, and where the deploy script is idempotent, you may simply modify the deploy script in-place.

That’s about it. The idempotence of the deploy script is important for ensuring consistency, and applies very well to features such as user-defined functions. For other objects, there are generally ALTER statements that allow changes to be made without wholesale copying of existing code.

So what am I missing? What have I overlooked? What mistakes in my logic have I made? Do you think this will be too tricky to implement, or to use? Is it hard to understand? Your comments would be greatly appreciated, because I am going to write an app to do this stuff, and want to get it right.

Thanks for sticking with me through all the thought experiments. For my next post on this topic, I expect to have an interface spec for the new app.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

VCS-Enabled SQL Change Management

In my previous post, I outlined the basics of a configuration-file and dependency-tracking SQL deployment architecture, but left a couple of additional challenges unresolved. They were:

  1. I would rather not have to hand-edit a configuration file, as it it’s finicky and error-prone.

  2. There is still more duplication of code than I would like, in that a procedure defined in one change script would have to be copied whole to a new script for any changes, even single-line simple changes.

I believe I can solve both of these issues by simple use of a VCS. Since all of my current projects currently use Git, I will use it for the examples here.

Git it On

First, recall the structure of the configuration file, which was something like this:

[alpha]
users_table

[beta]
add_widget
widgets_table

[gamma]
add_user

[delta]
widgets_created_at
add_widget_v2

Basically, we have bracketed tags identifying changes that should be deployed. Now have a look at this:

> git log -p —format=’[%H]’ —name-only —reverse sql/deploy
[8920aaf7947a56f6777e69a21b70fd877c8fd6dc]

sql/deploy/users_table.sql
[f7da5fd4b7391747f75d85db6fa82de47b9e4c00]

sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql
[ea10b9e566934ef256debe8752504189436e162a]

sql/deploy/add_user.sql
[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4]

sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

Look familiar? Let’s use a bit of awk magic to neaten things a bit (Thanks helwig!):

> git log -p —format=’[%H]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

Ah, that’s better. We have commit SHA1s for tags, followed by the appropriate lists of deployment scripts. But wait, we can decorate it, too:

> git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql
[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (HEAD, delta, master)]

Look at that! Actual VCS tags built right in to the output. So, assuming our deployment app can parse this output, we can deploy or revert to any commit or tag. Better yet, we don’t have to maintain a configuration file, because the VCS is already tracking all that stuff for us! Our change management app can automatically detect if we’re in a Git repository (or Mercurial or CVS or Subversion or whatever) and fetch the necessary information for us. It’s all there in the history. We can name revision identifiers (SHA1s here) to deploy or revert to, or use tags (alpha, beta, gamma, delta, HEAD, or master in this example).

And with careful repository maintenance, this approach will work for branches, as well. For example, say you have developers working in two branches, feature_foo and feature_bar. In feature_foo, a foo_table change script gets added in one commit, and an add_foo script in a second commit. Merge it into master and the history now looks like this:

> git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e (HEAD, master, feature_foo)]
sql/deploy/add_foo.sql

So far so good.

Meanwhile, development in the feature_bar branch has added a bar_table change script in one commit and add_bar in another. Because development in this branch was going on concurrently with the feature_foo branch, if we just merged it into master, we might get a history like this:

> git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’
[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[d1882d7b4cfcf5c57030bd5a15f8571bfd7e48e2]
sql/deploy/bar_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e]
sql/deploy/add_foo.sql

[2330da1caae9a46ea84502bd028ead399ca3ca02 (feature_bar)]
sql/deploy/add_bar.sql

[73979ede2c8589cfe24c9213a9538f305e6f508f (HEAD, master, feature_foo)]

Note that bar_table comes before add_foo. In other words, the feature_foo and feature_bar commits are interleaved. If we were to deploy to HEAD, and then need to revert feature_bar, bar_table would not be reverted. This is, shall we say, less than desirable.

There are at least two ways to avoid this issue. One is to squash the merge into a single commit using git merge —squash feature_bar. This would be similar to accepting a single patch and applying it. The resulting history would look like this:

> git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e]
sql/deploy/add_foo.sql

[91a048c05e0444682e2e4763e8a7999a869b4a77 (HEAD, master)]
sql/deploy/add_bar.sql
sql/deploy/bar_table.sql

Now both of the feature_bar change scripts come after the feature_foo changes. But it might be nice to keep the history. So a better solution (and the best practice, I believe), is to rebase the feature_bar branch before merging it into master, like so:

> git rebase master
First, rewinding head to replay your work on top of it...
Applying: Add bar.
Applying: Add add_bar().
> git checkout master
Switched to branch ‘master’
> git merge feature_bar
Updating 7f89e23..0fab7a0
Fast-forward
 0 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 sql/deploy/add_bar.sql
 create mode 100644 sql/deploy/bar_table.sql
 create mode 100644 sql/revert/add_bar.sql
 create mode 100644 sql/revert/bar_table.sql

And now we should have:

> git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e]
sql/deploy/add_foo.sql

[0e53c29eb47c618d0a8818cc17bd5a0aab0acd6d]
sql/deploy/bar_table.sql

[0fab7a0ba928b34a46a9495d4efc1c73d9133d37 (HEAD, master, feature_bar)]
sql/deploy/add_bar.sql

Awesome, now everything is in the correct order. We did lose the feature_foo “tag,” though. That’s because it wasn’t a tag, and neither is feature_bar here. They are, rather, branch names, which we becomes obvious when using “full” decoration:

git log —format=’%d’ —decorate=full HEAD^..      
 (HEAD, refs/heads/master, refs/heads/feature_foo)

After the next commit, it will disappear from the history. So let’s just tag the relevant commits ourselves:

> git tag feature_foo 7f89e23c9f1e7fc298c69400f6869d701f76759e
> git tag feature_bar
> git log -p —format=’[%H%d]’ —name-only —reverse sql/deploy \
| awk ‘/^\[/ {print ""} /./’

[8920aaf7947a56f6777e69a21b70fd877c8fd6dc (alpha)]
sql/deploy/users_table.sql

[f7da5fd4b7391747f75d85db6fa82de47b9e4c00 (beta)]
sql/deploy/add_widget.sql
sql/deploy/widgets_table.sql

[ea10b9e566934ef256debe8752504189436e162a (gamma)]
sql/deploy/add_user.sql

[89e85f98d891a2984ad4e3c42d8ca8cf31f3b2b4 (delta)]
sql/deploy/add_widget_v2.sql
sql/deploy/widgets_created_at.sql

[cbb48144065dd345c5248e5f1e42c1c7391a88ed]
sql/deploy/foo_table.sql

[7f89e23c9f1e7fc298c69400f6869d701f76759e (feature_foo)]
sql/deploy/add_foo.sql

[0e53c29eb47c618d0a8818cc17bd5a0aab0acd6d]
sql/deploy/bar_table.sql

[0fab7a0ba928b34a46a9495d4efc1c73d9133d37 (HEAD, feature_bar, master, feature_bar)]
sql/deploy/add_bar.sql

Ah, there we go! After the next commit, one of those feature_bars will disappear, since the branch will have been left behind. But we’ll still have the tag.

Not Dead Yet

Clearly we can intelligently use Git to manage SQL change management. (Kind of stands to reason, doesn’t it?) Nevertheless, I believe that a configuration file still might have its uses. Not only because not every project is in a VCS (it ought to be!), but because oftentimes a project is not deployed to production as a git clone. It might be distributed as a source tarball or an RPM. In such a case, including a configuration file in the distribution would be very useful. But there is still no need to manage it by hand; our deployment app can generate it from the VCS history before packaging for release.

More to Come

I’d planned to cover the elimination of duplication, but I think this is enough for one post. Watch for that idea in my next post.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Simple SQL Change Management

I’ve been thinking a lot about SQL change management. I know I have written about this before. But I was never satisfied with that idea, mostly because it required managing database changes in two separate but interdependent ways. Blargh. So for my Perl projects the last couple of years, I have stuck to the very simple but ugly Rails-style migration model, as implemented in Module::Build::DB.

But it has been on my brain more lately because I’m writing more and more database applications at work, and managing changes over time is becoming increasingly annoying. I’ve been using a variation on Depesz’s Versioning package, mainly because its idea of specifying dependencies instead of ordered deployment scripts is so useful. However, its implementation in pure SQL, with accompanying shell and Perl scripts, is not entirely satisfying. Worse, one cannot easily include the contents of an earlier deployment script in a reversion script, because the dependency registration function embedded in a script will throw an error if it has been run before. The upshot is that if you make a one-line change to a database function, you still have to paste the entire thing into a new file and commit it to your source code repository. This makes tracking diffs annoying.

Oh, and did I mention that there is no simple built-in way to revert changes, and even if there were, because there are no named releases, it can be difficult to decide what to revert to? I don’t often need that capability, but when I need it, I need it.

Then, this week, Robert Haas described a deployment implementation he implemented. It was simple:

My last implementation worked by keeping a schema_versions table on the server with one column, a UUID. The deployment tarball contained a file with a list of UUIDs in it, each one associated to an SQL script. At install time, the install script ran through that file in order and ran any scripts whose UUID didn’t yet appear in the table, and then added the UUIDs of the run scripts to the table.

I like this simplicity, but there are some more things I think could be done, including dependency reslolution and reversion. And it seems silly to have a UUID stand for a script name; why not just list script names? Better yet, tag groups of changes for easy reference.

Yet Another SQL Deployment Strategy

So here’s my proposal. Following Robert, we create a configuration file, but instead of just listing changes, we fill it with tags and the names of the changes are associated with each. An example:

[alpha]
users_table

[beta]
add_widget
widgets_table

[gamma]
add_user

Our change management app will parse this file, finding the tag for each stage of the migration in brackets, and apply the associated changes, simply finding each of them in sql/deploy/$change.sql. If it’s reverting changes, it finds the reversion scripts named sql/revert/$change.sql. The tags can be anything you want; release tags might be useful. Easy so far, right?

Except notice that I have a minor ordering problem here. The add_widget change, which adds a function to insert a record into the widgets table, comes before the widgets_table script. If we run the add_widget change first, it will fail, because the widgets table does not yet exist.

Of course we can re-order the lines in the configuration file. But given that one might have many changes for a particular tag, with many cross-referenceing dependencies, I think it’s better to overcome this problem in the scripts themselves. So I suggest that the sql/deploy/add_widget.sql file look something like this:

— requires: widgets_table

CREATE OR REPLACE FUNCTION add_widget(
    username   TEXT,
    widgetname TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO widgets (created_by, name) VALUES ($1, $2);
$$;

Here I’m stealing Depesz’s dependency tracking idea. With a simple comment at the top of the script, we specify that this change requires that the widgets_table change be run first. So let’s look at sql/deploy/widgets_table.sql:

— requires: users_table

CREATE TABLE widgets (
    created_by TEXT NOT NULL REFERENCES users(name),
    name       TEXT NOT NULL
);

Ah, now here we also require that the users_table change be deployed first. Of course, it likely would be, given that it appears under a tag earlier in the file, but it’s best to be safe and explicitly spell out dependencies. Someone might merge the two tags at some point before release, right?

The users_table change has no dependencies, but the later add_user change of course does; our sql/deploy/add_user.sql:

— requires: users_table

CREATE OR REPLACE FUNCTION add_user(
    name TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO users (name) VALUES ($1);
$$;

Our deployment app can properly resolve these dependencies. Of course, we also need reversion scripts in the sql/revert directory. They might look something like:

— sql/revert/users_table.sql
DROP TABLE IF EXISTS users;

— sql/revert/add_widget.sql
DROP FUNCTION IF EXISTS add_widget(text, text);

— sql/revert/widgets_table.sql
DROP TABLE IF EXISTS widgets;

— sql/revert/add_user.sql
DROP FUNCTION IF EXISTS add_user(text);

So far so good, right? Our app can resolve dependencies in both directions, so that if we tell it to revert to beta, it can do so in the proper order.

Now, as the deployment app runs the scripts, deploying or reverting changes, it tracks them and their dependencies in its own metadata table in the database, not unlike Depesz’s Versioning package. But because dependencies are parsed from comments in the scripts, we are free to include the contents of one script in another. For example, say that we later need to revise the add_widget() function to log the time a widget is created. First we add a new script to add the necessary column:

— requires: widgets_table
ALTER TABLE widgets ADD created_at TIMESTAMPTZ;

Call that script sql/deploy/widgets_created_at.sql. Next we add a script that changes add_widgets():

— requires widgets_created_at
CREATE OR REPLACE FUNCTION add_widget(
    username   TEXT,
    widgetname TEXT
) RETURNS VOID LANGUAGE SQL AS $$
    INSERT INTO widgets (created_by, name, created_at)
    VALUES ($1, $2, NOW());
$$;

Call it sql/deploy/add_widget_v2.sql. Then update the deployment configuration file with a new tag and the associated changes:

[delta]
widgets_created_at
add_widget_v2

With me so far? Now, what about reversion? sql/revert/widgets_created_at.sql is simple, of course:

ALTER TABLE widgets DROP COLUMN IF EXISTS created_at;

But what should sql/revert/add_widget_v2.sql look like? Why, to go back to the first version of add_widget(), it would be identical to sql/deploy/add_widget.sql. But it would be silly to copy the whole file, wouldn’t it? Why duplicate when we can just include?

\i sql/deploy/add_widget.sql

Boom, we get the reversion script for free. No unnecessary duplication between deployment and reversion scripts, and all dependencies are nicely resolved. Plus, the tags in the configuration file make it easy to deploy and revert change sets as necessary, with dependencies properly followed.

There’s More!

To recap, I had two primary challenges with Depesz’s Versioning package to overcome: inability to easily revert to an earlier implementation; and the inability to easily include one script in another. Both of course are do-able with workarounds, but I think that the addition of a deployment configuration file with tagged sets of changes and the elimination of SQL-embedded dependency specification overcome these issues much more effectively and intuitively.

Still, there are two more challenges I would like to overcome:

  1. It would be nice not to need the configuration file at all. Maintaining such a thing can be finicky and error-prone.

  2. I still had to duplicate the entire add_widget() function in the add_widget_v2 script for a very simple change. This means no easy way to simply see the diff for this change in my VCS. It would be nice not to have to copy the entire function.

I think I have solutions for these issues, as well. More in my next post.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

DBIx::Connector and Serializable Snapshot Isolation

I was at Postgres Open week before last. This was a great conference, very welcoming atmosphere and lots of great talks. One of the more significant, for me, was the session on serializable transactions by Kevin Grittner, who developed SSI for PostgreSQL 9.1. I hadn’t paid much attention to this feature before now, but it became clear to me, during the talk, that it’s time.

So what is SSI? Well, serializable transactions are almost certainly how you think of transactions already. Here’s how Kevin describes them:

True serializable transactions can simplify software development. Because any transaction which will do the right thing if it is the only transaction running will also do the right thing in any mix of serializable transactions, the programmer need not understand and guard against all possible conflicts. If this feature is used consistently, there is no need to ever take an explicit lock or SELECT FOR UPDATE/SHARE.

This is, in fact, generally how I’ve thought about transactions. But I’ve certainly run into cases where it wasn’t true. Back in 2006, I wrote an article on managing many-to-many relationships with PL/pgSQL which demonstrated a race condition one might commonly find when using an ORM. The solution I offered was to always use a PL/pgSQL function that does the work, and that function executes a SELECT...FOR UPDATE statement to overcome the race condition. This creates a lock that forces conflicting transactions to be performed serially.

Naturally, this is something one would rather not have to think about. Hence SSI. When you identify a transaction as serializable, it will be executed in a truly serializable fashion. So I could actually do away with the SELECT...FOR UPDATE workaround — not to mention any other race conditions I might have missed — simply by telling PostgreSQL to enforce transaction isolation. This essentially eliminates the possibility of unexpected side-effects.

This comes at a cost, however. Not in terms of performance so much, since the SSI implementation uses some fancy, recently-developed algorithms to keep things efficient. (Kevin tells me via IRC: “Usually the rollback and retry work is the bulk of the additional cost in an SSI load, in my testing so far. A synthetic load to really stress the LW locking, with a fully-cached database doing short read-only transactions will have no serialization failures, but can run up some CPU time in LW lock contention.”) No, the cost is actually in increased chance of transaction rollback. Because SSI will catch more transaction conflicts than the traditional “read committed” isolation level, frameworks that expect to work with SSI need to be prepared to handle more transaction failures. From the fine manual:

The Serializable isolation level provides the strictest transaction isolation. This level emulates serial transaction execution, as if transactions had been executed one after another, serially, rather than concurrently. However, like the Repeatable Read level, applications using this level must be prepared to retry transactions due to serialization failures.

And that brings me to DBIx::Connector, my Perl module for safe connection and transaction management. It currently has no such retry smarts built into it. The feature closest to that is the “fixup” connection mode, wherein if a execution of a code block fails due to a connection failure, DBIx::Connector will re-connect to the database and execute the code reference again.

I think I should extend DBIx::Connector to take isolation failures and deadlocks into account. That is, fixup mode would retry a code block not only on connection failure but also on serialization failure (SQLSTATE 40001) and deadlocks (SQLSTATE 40P01). I would also add a new attribute, retries, to specify the number of times to retry such execution, with a default of three (which likely will cover the vast majority of cases). This has actually been an oft-requested feature, and I’m glad to have a new reason to add it.

There are a few design issues to overcome, however:

  • Fixup mode is supported not just by txn(), which scopes the execution of a code reference to a single transaction, but also run(), which does no transaction handling. Should the new retry support be added there, too? I could see it either way (a single SQL statement executed in run() is implicitly transaction-scoped).
  • Fixup mode is also supported by svp(), which scopes the execution of a code reference to a savepoint (a.k.a. a subtransaction). Should the rollback and retry be supported there, too, or would the whole transaction have to be retried? I’m thinking the latter, since that’s currently the behavior for connection failures.
  • Given these issues, will it make more sense to perhaps create a new mode? Maybe it would be supported only by txn().

This is do-able, will likely just take some experimentation to figure it out and settle on the appropriate API. I’ll need to find the tuits for that soon.

In the meantime, given currently in-progress changes, I’ve just released a new version of DBIx::Connector with a single change: All uses of the deprecated catch syntax now throw warnings. The previous version threw warnings only the first time the syntax was used in a particular context, to keep error logs from getting clogged up. Hopefully most folks have changed their code in the two months since the previous release and switched to Try::Tiny or some other model for exception handling. The catch syntax will be completely removed in the next release of DBIx::Connector, likely around the end of the year. Hopefully the new SSI-aware retry functionality will have been integrated by then, too.

In a future post I’ll likely chew over whether or not to add an API to set the transaction isolation level within a call to txn() and friends.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Fixing Foreign Key Deadlocks in PostgreSQL

PGX had a client come to us recently with a rather nasty deadlock issue. It took far longer than we would have liked to figure out the issue, and once we did, they were able to clear it up by dropping an unnecessary index. Still, it shouldn’t have been happening to begin with. Joel Jacobson admirably explained the issue on pgsql-hackers (and don’t miss the screencast).

Some might consider it a bug in PostgreSQL, but the truth is that PostgreSQL can obtain stronger than necessary locks. Such locks cause some operations to block unnecessarily and some other operations to deadlock, especially when foreign keys are used in a busy database. And really, who doesn’t use FKs in their busy database?

Fortunately, Simon Riggs proposed a solution. And it’s a good one. So good that PGX is partnering with Glue Finance and Command Prompt as founding sponsors on a new FOSSExperts project to actually get it done. Álvaro Herrera is doing the actual hacking on the project, and has already blogged about it here and here.

If you use foreign key constraints (and you should!) and you have a high transaction load on your database (or expect to soon!), this matters to you. In fact, if you use ActiveRecord with Rails, there might even be a special place in your heart for this issue, says Mina Naguib. We’d really like to get this done in time for the PostgreSQL 9.1 release. But it will only happen if the project can be funded.

Yes, that’s right, as with PGXN, this is community project for which we’re raising funds from the community to get it done. I think that more and more work could be done this way, as various interested parties contribute small amounts to collectively fund improvements to the benefit of us all. So can you help out? Hit the FOSSExperts project page for all the project details, and to make your contribution.

Help us help the community to make PostgreSQL better than ever!

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Managing Key/Value Pairs in PostgreSQL

Let’s say that you’ve been following the latest research in key/value data storage and are interested in managing such data in a PostgreSQL database. You want to have functions to store and retrieve pairs, but there is no natural way to represent pairs in SQL. Many languages have hashes or or data dictionaries to fulfill this role, and you can pass them to functional interfaces. SQL’s got nothin’. In PostgreSQL, have two options: use nested arrays (simple, fast) or use a custom composite data type (sugary, legible).

Let’s assume you have this table for storing your pairs:

CREATE TEMPORARY TABLE kvstore (
    key        TEXT PRIMARY KEY,
    value      TEXT,
    expires_at TIMESTAMPTZ DEFAULT NOW() + '12 hours'::interval
);

To store pairs, you can use nested arrays like so:

SELECT store(ARRAY[ ['foo', 'bar'], ['baz', 'yow'] ]);

Not too bad, and since SQL arrays are a core feature of PostgreSQL, there’s nothing special to do. Here’s the store() function:

CREATE OR REPLACE FUNCTION store(
    params text[][]
) RETURNS VOID LANGUAGE plpgsql AS $$
BEGIN
    FOR i IN 1 .. array_upper(params, 1) LOOP
        UPDATE kvstore
           SET value      = params[i][2],
               expires_at = NOW() + '12 hours'::interval
         WHERE key        = param[i][1];
        CONTINUE WHEN FOUND;
        INSERT INTO kvstore (key, value)
        VALUES (params[i][1], params[i][2]);
    END LOOP;
END;
$$;

I’ve seen worse. The trick is to iterate over each nested array, try an update for each, and insert when no row is updated. Alas, you have no control over how many elements a user might include in a nested array. One might call it as:

SELECT store(ARRAY[ ['foo', 'bar', 'baz'] ]);

Or:

SELECT store(ARRAY[ ['foo'] ]);

No errors will be thrown in either case. In the first the "baz" will be ignored, and in the second the value will default to NULL. If you really didn’t like these behaviors, you could add some code to throw an exception if array_upper(params, 2) returns anything other than 2.

Let’s look at fetching values for keys. PostgreSQL 8.4 added variadic function arguments, so it’s easy to provide a nice interface for retrieving one or more values. The obvious one fetches a single value:

CREATE OR REPLACE FUNCTION getval(
    text
) RETURNS TEXT LANGUAGE SQL AS $$
    SELECT value FROM kvstore WHERE key = $1;
$$;

Nice and simple:

SELECT getval('baz');

 getval 
--------’
 yow

The variadic version looks like this:

CREATE OR REPLACE FUNCTION getvals(
    variadic text[]
) RETURNS SETOF text LANGUAGE SQL AS $$
    SELECT value
      FROM kvstore
      JOIN (SELECT generate_subscripts($1, 1)) AS f(i)
        ON kvstore.key = $1[i]
     ORDER BY i;
$$;

Note the use of ORDER BY i to ensure that the values are returned in the same order as the keys are passed to the function. So if I’ve got the key/value pairs ’foo’ => ‘bar’ and ’baz’ => ‘yow’, the output is:

SELECT * FROM getvals('foo', 'baz');

 getvals 
---------
 bar
 yow

If we want to the rows to have the keys and values together, we can return them as arrays, like so:

CREATE OR REPLACE FUNCTION getpairs(
    variadic text[]
) RETURNS SETOF text[] LANGUAGE SQL AS $$
    SELECT ARRAY[key, value]
      FROM kvstore
      JOIN unnest($1) AS k ON kvstore.key = k
$$;

Here I’m assuming that order isn’t important, which means we can use unnest to "flatten" the array, instead of the slightly more baroque generate_subscripts() with array access. The output:

SELECT * FROM getpairs('foo', 'baz');

  getpairs   
-------------
 {baz,yow}
 {foo,bar}

Now, this is good as far as it goes, but the use of nested arrays to represent key/value pairs is not exactly ideal: just looking at the use of a function, there’s nothing to indicate that you’re using key/value pairs. What would be ideal is to use row constructors to pass arbitrary pairs:

SELECT store( ROW('foo', 'bar'), ROW('baz', 42) );

Alas, one cannot pass RECORD values (the data type returned by ROW()) to non-C functions in PostgreSQL.1 But if you don’t mind your keys and values always being TEXT, we can get almost all the way there by creating an "ordered pair" data type as a composite type like so:

CREATE TYPE pair AS ( k text, v text );

Then we can create store() with a signature of VARIADIC pair[] and pass in any number of these suckers:

CREATE OR REPLACE FUNCTION store(
    params variadic pair[]
) RETURNS VOID LANGUAGE plpgsql AS $$
DECLARE
    param pair;
BEGIN
    FOR param IN SELECT * FROM unnest(params) LOOP
        UPDATE kvstore
           SET value = param.v,
               expires_at = NOW() + '12 hours'::interval
         WHERE key = param.k;
        CONTINUE WHEN FOUND;
        INSERT INTO kvstore (key, value) VALUES (param.k, param.v);
    END LOOP;
END;
$$;

Isn’t it nice how we can access keys and values as param.k and param.v? Call the function like this:

SELECT store( ROW('foo', 'bar')::pair, ROW('baz', 'yow')::pair );

Of course, that can get a bit old, casting to pair all the time, so let’s create some pair constructor functions to simplify things:

CREATE OR REPLACE FUNCTION pair(anyelement, text)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';

CREATE OR REPLACE FUNCTION pair(text, anyelement)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';

CREATE OR REPLACE FUNCTION pair(anyelement, anyelement)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair';

CREATE OR REPLACE FUNCTION pair(text, text)
RETURNS pair LANGUAGE SQL AS 'SELECT ROW($1, $2)::pair;';

I’ve created four variants here to allow for the most common combinations of types. So any of the following will work:

SELECT pair('foo', 'bar');
SELECT pair('foo', 1);
SELECT pair(12.3, 'foo');
SELECT pair(1, 43);

Alas, you can’t mix any other types, so this will fail:

SELECT pair(1, 12.3);

ERROR:  function pair(integer, numeric) does not exist
LINE 1: SELECT pair(1, 12.3);

We could create a whole slew of additional constructors, but since we’re using a key/value store, it’s likely that the keys will usually be text anyway. So now we can call store() like so:

SELECT store( pair('foo', 'bar'), pair('baz', 'yow') );

Better, eh? Hell, we can go all the way and create a nice binary operator to make it still more sugary. Just map each of the pair functions to the operator like so:

CREATE OPERATOR -> (
    LEFTARG   = text,
    RIGHTARG  = anyelement,
    PROCEDURE = pair
);

CREATE OPERATOR -> (
    LEFTARG   = anyelement,
    RIGHTARG  = text,
    PROCEDURE = pair
);

CREATE OPERATOR -> (
    LEFTARG   = anyelement,
    RIGHTARG  = anyelement,
    PROCEDURE = pair
);

CREATE OPERATOR -> (
    LEFTARG   = text,
    RIGHTARG  = text,
    PROCEDURE = pair
);

Looks like a lot of repetition, I know, but checkout the new syntax:

SELECT store( 'foo' -> 'bar', 'baz' -> 1 );

Cute, eh? I chose to use -> because => is deprecated as an operator in PostgreSQL 9.0: SQL 2011 reserves that operator for named parameter assignment.2

As a last twist, let’s rewrite getpairs() to return pairs instead of arrays:

CREATE OR REPLACE FUNCTION getpairs(
    variadic text[]
) RETURNS SETOF pair LANGUAGE SQL AS $$
    SELECT key -> value
      FROM kvstore
      JOIN unnest($1) AS k ON kvstore.key = k
$$;

Cute, eh? Its use is just like before, only now the output is more table-like:

SELECT * FROM getpairs('foo', 'baz');

  k  |   v   
-----+-------
 baz | yow
 foo | bar

You can also get them back as composites by omitting * FROM:

SELECT getpairs('foo', 'baz');

  getpairs   
-------------
 (foo,bar)
 (baz,yow)

Anyway, just something to consider the next time you need a function that allows any number of key/value pairs to be passed. It’s not perfect, but it’s pretty sweet.


  1. In the recent pgsql-hackers discussion that inspired this post, Pavel Stehule suggested adding something like Oracle COLLECTIONs to address this shortcoming. I don’t know how far this idea will get, but it sure would be nice to be able to pass objects with varying kinds of data, rather than be limited to data all of one type (values in an SQL array must all be of the same type).

  2. No, you won’t be able to use named parameters for this application because named parameters are inherently non-variadic. That is, you can only pre-declare so many named parameters: you can’t anticipate every parameter that’s likely to be wanted as a key in our key/value store.

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

PGXN Blog and Twitterstream

I crated the PGXN Blog yesterday. Tune in there for news and announcements. I’ll also be posting status reports once development gets underway, so that all you fans out there can follow my progress. Once the site is done (or at 1.0 anyway), the blog will be used for announcements, discussion of support issues, etc. So tune in!

Oh, and I created a PGXN Twitterstream, too. You should follow it! New blog posts will be tweeted, and once the site gets going, new uploads will be tweeted, too. Check it out!

  • E-mail this story to a friend!
  • Sphinn
  • StumbleUpon
  • Facebook
  • del.icio.us
  • LinkedIn
  • TwitThis
  • Digg
  • Google
  • MySpace
  • Reddit
  • StumbleUpon
  • Technorati
  • Yahoo! Buzz

Powered by KinoSearch