home :: computers :: databases :: postgresql :: reducing view calculations

Need Help Reducing View Calculations

I could use some advice and suggestions for how to solve a performance problem due to the highly redundant calculation of values in a view. Sorry for the longish explanation. I wanted to make sure that I omitted no details in desribing the problem.

In order to support recurring events in an application I’m working on, we have a lookup table that maps dates to their daily, weekly, monthly, and yearly recurrences. It looks something like this:

try=# \d recurrence_dates
   Table "public.recurrence_dates"
   Column   |    Type    | Modifiers 
------------+------------+-----------
 date       | date       | not null
 recurrence | recurrence | not null
 next_date  | date       | not null
Indexes:
    "recurrence_dates_pkey" PRIMARY KEY, btree (date, recurrence, next_date)
    "index_recurrence_dates_on_date_and_recurrence" btree (date, recurrence)

try=# select * from recurrence_dates
try-# where date = '2007-11-04'
try-# order by recurrence, next_date;
    date    | recurrence | next_date  
------------+------------+------------
 2007-11-04 | daily      | 2007-11-04
 2007-11-04 | daily      | 2007-11-05
 2007-11-04 | daily      | 2007-11-06
 2007-11-04 | weekly     | 2007-11-04
 2007-11-04 | weekly     | 2007-11-11
 2007-11-04 | weekly     | 2007-11-18
 2007-11-04 | monthly    | 2007-11-04
 2007-11-04 | monthly    | 2007-12-04
 2007-11-04 | monthly    | 2008-01-04
 2007-11-04 | annually   | 2007-11-04
 2007-11-04 | annually   | 2008-11-04
 2007-11-04 | annually   | 2009-11-04
 2007-11-04 | none       | 2007-11-04

To get all of the permutations of recurring events, we simply select from a view rather than from the events table that contains the actual event data. The view joins events to recurrence_dates table like so:

CREATE OR REPLACE VIEW recurring_events AS
SELECT id, name, user_id, duration,
       (rd.next_date || ' ' ||
       (starts_at::timestamptz at time zone start_tz)::time)::timestamp
       at time zone start_tz AS starts_at,
       start_tz
  FROM events LEFT JOIN recurrence_dates rd
    ON (events.starts_at::timestamptz at time zone events.start_tz)::date = rd.date
   AND events.recurrence = rd.recurrence;

Then, to get all of the recurrences of events for a user within a week, we do something like this in the client code:

SELECT *
  FROM recurring_events
 WHERE user_id = 2
   AND starts_at BETWEEN '2007-11-04 07:00:00' AND '2007-11-10 07:59:59';

This works perfectly, as all of our dates and times are stored in UTC in timestamp columns. We pass UTC times for the appropriate offset to the query (Pacific Time in this example) and, because the view does the right thing in mapping the starts_at time for each event to its proper time zone, we get all of the events within the date range, even if they are recurrences of an earlier event, and with their times properly set.

The trouble we’re having, however, is all of those conversions. Until last week, the view just kept everything in UTC and left it to the client to convert to the proper zone in the start_tz column. But that didn’t work so well when an event’s starts_at was during daylight savings time and recurrences were in standard time: the standard time recurrences were all an hour off! So I added the repeated instances of events.starts_at::timestamptz at time zone events.start_tz. But now the view is really slow.

Since the only thing that has changed is the addition of the time zone conversions, I believe that the performance penalty is because of them. The calculation executes multiple times per row: once for the join and once again for the starts_at column. We can have an awful lot of events for a given user, and an awful lot of recurrences of a given event. If, for example, an event recurs daily for 2 years, there will be around 730 rows for that one event. And the calculation has to be executed for every one of them before the WHERE clause can be properly evaluated. Ouch! Worse still, we actually have three columns that do this in our application, not just one as in the example here.

So what I need is a way to execute that calculation just once for each row in the events table, rather than once for each row in the recurring_events view. I figure 1 calculation will be a heck of a lot faster than 730! So the question is, how do I do this? How do I get the view to execute the conversion of the starts_at to the start_tz time zone only once for each row in events, regardless of how many rows it ends up generating in the recurring_events view?

Suggestions warmly welcomed. This is a bit of a tickler for me, and since the query performance on these views is killing us, I need to get this adjusted post haste!

Meanwhile, tomorrow I’ll post a cool hack I came up with for validating time zones in the database. Something to look forward to as you ponder my little puzzle, eh?

Update 2008-01-30: Thanks to help from depesz, I came figured out what the underling problem was and solved it much more elegantly using PL/pgSQL. I’ve now written up the basic recipe. Enjoy!

Comments & Trackbacks

depesz@depesz.com wrote:

it would help a lot if you could show:

  1. \d events
  2. explain analyze of the problematic query
  3. version of pg

EdwinF wrote:

More info needed.

Hello. As Depesz said, it is still not enough information to trace the performance issue (explain analyze, column type...). For now, what I can recommend is doing right side transformation before comparing and column redundace with the already calculated field using one index by that column.

instead of: starts_at Between 'val1' and 'val2' try: starts_at between 'val1'::date and 'val2'::date and make sure you have an index on that column only.

These are my two cents.

Scott Woods wrote:

Don't forget the PostgreSQL performance mailling list

If you haven't posted the question there already, you should check out the pgsql-performance list. The content is generally excellent:

http://archives.postgresql.org/pgsql-performance/

hernan wrote:

So, if I understand righly, you must match the date part of the events.starts_at (stored as timestimes UTC) as seen by the client (i.e. non UTC) to a recurrence_dates.date (plain date). The convertion seems unavoidable, and, as you see, the performance must suffer (dont know if postgresql has some sort of function results cache).

My first simple hack (obvious and ugly, of course; redundant data...) would be to add a column to events table, starts_at_local = starts_at with the "local" (clientside) timestamp (or perhaps just the date part). It seems you really need that data and it's too costly to compute dynamically. If the view were used only ocassionally (say, for reports and/or in batch mode) you might even produce this column by SELECTing INTO a temporary table, from the events table, perhaps restricting by date range.

Theory wrote:

More Info

depesz:

  1. try=# \d events
                                            Table "public.events"
         Column      |            Type             |                       Modifiers
    
    -----------------+-----------------------------+-------------------------------------------------------
     id              | integer                     | not null default nextval('events_id_seq'::regclass)
     name            | text                        | not null default ''::text
     user_id         | integer                     | not null
     starts_at       | timestamp without time zone | 
     start_tz        | text                        | not null default ''::text
     recurrence      | recurrence                  | not null
    Indexes:
        "events_pkey" PRIMARY KEY, btree (id)
        "index_events_on_starts_at" btree (starts_at)
        "index_events_on_user_id" btree (user_id)
    Check constraints:
    Foreign-key constraints:
        "fk_events_users" FOREIGN KEY (user_id) REFERENCES users(id) ON UPDATE CASCADE ON DELETE CASCADE
    
  2. Here is is the EXPLAIN ANALYZE output using the view two times:

    try=# EXPLAIN ANALYZE SELECT *
    try-#   FROM recurring_events
    try-#  WHERE user_id = 2
    try-#    AND starts_at BETWEEN '2007-11-04 07:00:00' AND '2007-11-10 07:59:59';
                                                                                                                                                                                                                                                                                     QUERY PLAN                                                                                                                                                                                                                                                                                 
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Nested Loop  (cost=0.01..701867.56 rows=686202 width=712) (actual time=67.627..13572.904 rows=212 loops=1)
       Join Filter: ((timezone(events.start_tz, ((((rd.next_date)::text || ' '::text) || ((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::time without time zone)::text))::timestamp without time zone) >= '2007-11-04 07:00:00+00'::timestamp with time zone) AND (timezone(events.start_tz, ((((rd.next_date)::text || ' '::text) || ((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::time without time zone)::text))::timestamp without time zone) <= '2007-11-10 07:59:59+00'::timestamp with time zone))
       ->  Index Scan using index_events_on_user_id on events  (cost=0.00..7333.47 rows=4459 width=708) (actual time=16.901..9763.273 rows=5461 loops=1)
             Index Cond: (user_id = 2)
       ->  Index Scan using recurrence_dates_pkey on recurrence_dates rd  (cost=0.01..42.65 rows=1385 width=10) (actual time=0.411..0.440 rows=5 loops=5461)
             Index Cond: (((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::date = rd.date) AND (events.recurrence = rd.recurrence))
     Total runtime: 13575.670 ms
    (7 rows)
    
    try=# EXPLAIN ANALYZE SELECT *
      FROM recurring_events
     WHERE user_id = 2
       AND starts_at BETWEEN '2007-11-04 07:00:00' AND '2007-11-10 07:59:59';
                                                                                                                                                                                                                                                                                     QUERY PLAN                                                                                                                                                                                                                                                                                 
    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     Nested Loop  (cost=0.01..701867.56 rows=686202 width=712) (actual time=0.873..1653.552 rows=212 loops=1)
       Join Filter: ((timezone(events.start_tz, ((((rd.next_date)::text || ' '::text) || ((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::time without time zone)::text))::timestamp without time zone) >= '2007-11-04 07:00:00+00'::timestamp with time zone) AND (timezone(events.start_tz, ((((rd.next_date)::text || ' '::text) || ((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::time without time zone)::text))::timestamp without time zone) <= '2007-11-10 07:59:59+00'::timestamp with time zone))
       ->  Index Scan using index_events_on_user_id on events  (cost=0.00..7333.47 rows=4459 width=708) (actual time=0.261..152.600 rows=5461 loops=1)
             Index Cond: (user_id = 2)
       ->  Index Scan using recurrence_dates_pkey on recurrence_dates rd  (cost=0.01..42.65 rows=1385 width=10) (actual time=0.016..0.026 rows=5 loops=5461)
             Index Cond: (((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::date = rd.date) AND (events.recurrence = rd.recurrence))
     Total runtime: 1654.539 ms
    (7 rows)
    
  3. try=# select version();
                                                                       version                                                                    
    ----------------------------------------------------------------------------------------------------------------------------------------------
     PostgreSQL 8.2.3 on powerpc-apple-darwin8.8.0, compiled by GCC powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5250)
    (1 row)
    

Thanks!

—Theory

Theory wrote:

Yet More Info

And here is the EXPLAIN ANALYZE output for the same query, but without the date comparisons in the WHERE:

try=# explain analyze SELECT *
  FROM recurring_events
 WHERE user_id = 2;
                                                                           QUERY PLAN                                                                            
-----------------------------------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop Left Join  (cost=0.01..845975.05 rows=6175817 width=712) (actual time=0.207..2493.438 rows=30836 loops=1)
   ->  Index Scan using index_events_on_user_id on events  (cost=0.00..7333.47 rows=4459 width=708) (actual time=0.148..45.016 rows=5461 loops=1)
         Index Cond: (user_id = 2)
   ->  Index Scan using recurrence_dates_pkey on recurrence_dates rd  (cost=0.01..42.65 rows=1385 width=10) (actual time=0.013..0.022 rows=5 loops=5461)
         Index Cond: (((timezone(events.start_tz, (events.starts_at)::timestamp with time zone))::date = rd.date) AND (events.recurrence = rd.recurrence))
 Total runtime: 2502.230 ms
(6 rows)

Doesn't tell you much, does it? I think it's because the cost is all in the calculation of the values for the starts_at column (and two other columns that make the same calculations).

—Theory

Theory wrote:

hernan,

Yes, that's a good point. By creating a local_starts_at in the events table and putting in a trigger to keep it up-to-date, I could indeed eliminate about half the calculations per row. That would be a bonus, but I'm hoping that it can be even better than that. And more normalized.

That said, I will likely do it if I can't figure out anything better. Or else create a trigger to create all of the recurrences of any event in a table and then just join on that. Only trouble there is if we add more dates to the recurrence_dates table in the future, we'd need to re-run the trigger for every row in the table in order to add the new recurrences. With the view, however, they're just there. It's a trade-off, and frankly, I'm not perfectly happy with either solution.

Thanks for the suggestion!

—Theory

Frakkle wrote:

in response to: Need Help Reducing View Calculations

I'm posting this via trackback because of a strange posting problem:  (layout is strange and missing the email field?  Yes I am using a crud browser from here... that could be it) 

I have to agree with Hernan. The simplest saving I can see is a…

depesz wrote:

i think i have a solution for your problem.

if i might suggest - simply skip the recurrence_dates table, and use generate_series function. it will work much better.

if you have doubts about how to do it, please contact me on jabber (depesz[at]depesz.com) as i have some more questions about the case, but from what i see, it should be trivial and very fast solution.

Theory wrote:

Hi depesz,

The reason we have a table of recurrence dates is that date math is complex. Daily events are simple, and weekly events not bad to calculate, but monthly and yearly events can be a PITA. For monthly events, if you make an event for Oct 31, there won't be one in November, because there is no November 31. For annual, if you make an event for February 29, 2008, there wont' be another one until 2012, because there wont' be another Feb 29 until then.

So I wrote a Perl script that generates all the required dates for a length of time, and stores them in a table. Now if this could be replicated using generate_series, that would be great (our recurrence_dates table is huge). But I'm not at all clear on how it could be done. I'd more than welcome your suggestions; I've learned a lot from your blog already, and am thrilled to learn more (PostgreSQL hacking being one of the things I really enjoy doing these days).

—Theory

Infernoz wrote:

Theory:

My trick is, to get the end of month would be:

  1. set the day of month to 1
  2. set the month number
  3. increment the month number
  4. decrement the day of month

Easy!

Mage2k wrote:

Theory, you'd be surprised at how well generate series can work with date math:

test=# select '2007-10-31'::timestamp + (s.a || ''month')::interval from generate_series(0,6) as s(a);

     ?column?       
--------------------- 
2007-10-31 00:00:00 
2007-11-30 00:00:00 
2007-12-31 00:00:00 
2008-01-31 00:00:00 
2008-02-29 00:00:00 
2008-03-31 00:00:00 
2008-04-30 00:00:00

Mage2k wrote:

Sorry, didn't realize I needed to handle my own formatting (you should probably offer the ability to preview comments if you're going to force that):

test=# select '2007-10-31'::timestamp + (s.a || ' month')::interval from generate_series(0,6) as s(a);
      
      ?column?       
--------------------- 
2007-10-31 00:00:00 
2007-11-30 00:00:00 
2007-12-31 00:00:00 
2008-01-31 00:00:00 
2008-02-29 00:00:00 
2008-03-31 00:00:00 
2008-04-30 00:00:00

Theory wrote:

Infernoz: Not sure what you're getting at. Nowhere here am I trying to find the first of the month.

Mage2K: Yep, that's exactly what I've come to thanks to off-site help from depesz. I'll post an update in a separate blog entry describing the solution I finally settled on. And yes, the site doesn't handle any comment formatting, but I always go back and clean up the comments. :-)

Thanks!

—Theory

Powered by KinoSearch