Just a Theory

Trans rights are human rights

JSON Path Operator Confusion

The CipherDoc service offers a robust secondary key lookup API and search interface powered by JSON/SQL Path queries run against a GIN-indexed JSONB column. SQL/JSON Path, introduced in SQL:2016 and added to Postgres in version 12 in 2019, nicely enables an end-to-end JSON workflow and entity lifecycle. It’s a powerful enabler and fundamental technology underpinning CipherDoc. I’m so happy to have found it.

Confusion

However, the distinction between the SQL/JSON Path operators @@ and @? confused me. Even as I found that the @? operator worked for my needs and @@ did not, I tucked the problem into my mental backlog for later study.

The question arose again on a recent work project, and I can take a hint. It’s time to figure this thing out. Let’s see where it goes.

The docs say:

jsonb @? jsonpath → boolean
Does JSON path return any item for the specified JSON value?

'{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] ? (@ > 2)' → t


jsonb @@ jsonpath → boolean
Returns the result of a JSON path predicate check for the specified JSON value. Only the first item of the result is taken into account. If the result is not Boolean, then NULL is returned.

'{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2' → t

These read quite similarly to me: Both return true if the path query returns an item. So what’s the difference? When should I use @@ and when @?? I went so far as to ask Stack Overflow about it. The one answer directed my attention back to the jsonb_path_query() function, which returns the results from a path query.

So let’s explore how various SQL/JSON Path queries work, what values various expressions return.

Queries

The docs for jsonb_path_query say:1

jsonb_path_query ( target jsonb, path jsonpath [, vars jsonb [, silent boolean ]] ) → setof jsonb
Returns all JSON items returned by the JSON path for the specified JSON value. If the vars argument is specified, it must be a JSON object, and its fields provide named values to be substituted into the jsonpath expression. If the silent argument is specified and is true, the function suppresses the same errors as the @? and @@ operators do.
select * from jsonb_path_query(
    '{"a":[1,2,3,4,5]}',
    '$.a[*] ? (@ >= $min && @ <= $max)',
    '{"min":2, "max":4}'
) 
 jsonb_path_query
------------------
 2
 3
 4

The first thing to note is that a SQL/JSON Path query may return more than one value. This feature matters for the @@ and @? operators, which return a single boolean value based on the values returned by a path query. And path queries can return a huge variety of values. Let’s explore some examples, derived from the sample JSON value and path query from the docs.2

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
    jsonb_path_query    
------------------------
 {"a": [1, 2, 3, 4, 5]}
(1 row)

This query returns the entire JSON value, because that’s what $ selects at the start of the path expression. The ?() filter returns true because its predicate expression finds at least one value in the $.a array greater than 2. Here’s what happens when the filter returns false:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 5)');
 jsonb_path_query 
------------------
(0 rows)

None of the values in the $.a array are greater than five, so the query returns no value.

To select just the array, append it to the path expression after the ?() filter:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2).a');
 jsonb_path_query 
------------------
 [1, 2, 3, 4, 5]
(1 row)

Path Modes

One might think you could select $.a at the start of the path query to get the full array if the filter returns true, but look what happens:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a ?(@[*] > 2)');
 jsonb_path_query 
------------------
 3
 4
 5
(3 rows)

That’s not the array, but the individual array values that each match the predicate. Turns out this is a quirk of the Postgres implementation of path modes. From what I can glean, the SQL:2016 standard dictates something like these SQL Server descriptions:

  • In lax mode, the function returns empty values if the path expression contains an error. For example, if you request the value $.name, and the JSON text doesn’t contain a name key, the function returns null, but does not raise an error.
  • In strict mode, the function raises an error if the path expression contains an error.

But the Postgres lax mode does more than suppress errors. From the docs (emphasis added):

The lax mode facilitates matching of a JSON document structure and path expression if the JSON data does not conform to the expected schema. If an operand does not match the requirements of a particular operation, it can be automatically wrapped as an SQL/JSON array or unwrapped by converting its elements into an SQL/JSON sequence before performing this operation. Besides, comparison operators automatically unwrap their operands in the lax mode, so you can compare SQL/JSON arrays out-of-the-box.

There are a few more details, but this is the crux of it: In lax mode, which is the default, Postgres always unwraps an array. Hence the unexpected list of results.3 This could be particularly confusing when querying multiple rows:

select jsonb_path_query(v, '$.a ?(@[*] > 2)')
        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
 jsonb_path_query 
------------------
 3
 4
 5
 3
 5
 8
(6 rows)

Switching to strict mode by preprending strict to the JSON Path query restores the expected behavior:

select jsonb_path_query(v, 'strict $.a ?(@[*] > 2)')
        from (values ('{"a":[1,2,3,4,5]}'::jsonb), ('{"a":[3,5,8]}')) x(v);
 jsonb_path_query 
------------------
 [1, 2, 3, 4, 5]
 [3, 5, 8]
(2 rows)

Important gotcha to watch for, and a good reason to test path queries thoroughly to ensure you get the results you expect. Lax mode nicely prevents errors when a query references a path that doesn’t exist, as this simple example demonstrates:

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
ERROR:  JSON object does not contain key "b"

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'lax $.b');
 jsonb_path_query 
------------------
(0 rows)

In general, I suggest always using strict mode when executing queries. Better still, perhaps always prefer strict mode with our friends the @@ and @? operators, which suppress some errors even in strict mode:

The jsonpath operators @? and @@ suppress the following errors: missing object field or array element, unexpected JSON item type, datetime and numeric errors. The jsonpath-related functions described below can also be told to suppress these types of errors. This behavior might be helpful when searching JSON document collections of varying structure.

Have a look:

select '{"a":[1,2,3,4,5]}' @? 'strict $.a';
 ?column? 
----------
 t
(1 row)

select '{"a":[1,2,3,4,5]}' @? 'strict $.b';
 ?column? 
----------
 <null>
(1 row)

No error for the unknown JSON key b in that second query! As for the error suppression in the jsonpath-related functions, that’s what the silent argument does. Compare:

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b');
ERROR:  JSON object does not contain key "b"

select jsonb_path_query('{"a":[1,2,3,4,5]}', 'strict $.b', '{}', true);
 jsonb_path_query 
------------------
(0 rows)

Boolean Predicates

The Postgres SQL/JSON Path Language docs briefly mention a pretty significant deviation from the SQL standard:

A path expression can be a Boolean predicate, although the SQL/JSON standard allows predicates only in filters. This is necessary for implementation of the @@ operator. For example, the following jsonpath expression is valid in PostgreSQL:

$.track.segments[*].HR < 70

This pithy statement has pretty significant implications for the return value of a path query. The SQL standard allows predicate expressions, which are akin to an SQL WHERE expression, only in ?() filters, as seen previously:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$ ?(@.a[*] > 2)');
    jsonb_path_query    
------------------------
 {"a": [1, 2, 3, 4, 5]}
(1 row)

This can be read as “return the path $ if @.a[*] > 2 is true. But have a look at a predicate-only path query:

select jsonb_path_query('{"a":[1,2,3,4,5]}', '$.a[*] > 2');
 jsonb_path_query 
------------------
 true
(1 row)

This path query can be read as “Return the result of the predicate $.a[*] > 2, which in this case is true. This is quite the divergence from the standard, which returns contents from the JSON queried, while a predicate query returns the result of the predicate expression itself. It’s almost like they’re two different things!

Don’t confuse the predicate path query return value with selecting a boolean value from the JSON. Consider this example:

select jsonb_path_query('{"a":[true,false]}', '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
(1 row)

Looks the same as the predicate-only query, right? But it’s not, as shown by adding another true value to the $.a array:

select jsonb_path_query('{"a":[true,false,true]}', '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
 true
(2 rows)

This path query returns the trues it finds in the $.a array. The fact that it returns values from the JSON rather than the filter predicate becomes more apparent in strict mode, which returns all of $a if one or more elements of the array has the value true:

select jsonb_path_query('{"a":[true,false,true]}', 'strict $.a ?(@[*] == true)');
  jsonb_path_query   
---------------------
 [true, false, true]
(1 row)

This brief aside, and its mention of the @@ operator, turns out to be key to understanding the difference between @? and @@. Because it’s not just that this feature is “necessary for implementation of the @@ operator”. No, I would argue that it’s the only kind of expression usable with the @@ operator

Match vs. Exists

Let’s get back to the @@ operator. We can use a boolean predicate JSON Path like so:

select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 2';
 ?column? 
----------
 t
(1 row)

It returns true because the predicate JSON path query $.a[*] > 2 returns true. And when it returns false?

select '{"a":[1,2,3,4,5]}'::jsonb @@ '$.a[*] > 6';
 ?column? 
----------
 f
(1 row)

So far so good. What happens when we try to use a filter expression that returns a true value selected from the JSONB?

select '{"a":[true,false]}'::jsonb @@ '$.a ?(@[*] == true)';
 ?column? 
----------
 t
(1 row)

Looks right, doesn’t it? But recall that this query returns all of the true values from $.@, but @@ wants only a single boolean. What happens when we add another?

select '{"a":[true,false,true]}'::jsonb @@ 'strict $.a ?(@[*] == true)';
 ?column? 
----------
 <null>
(1 row)

Now it returns NULL, even though it’s clearly true that @[*] == true matches. This is because it returns all of the values it matches, as jsonb_path_query() demonstrates:

select jsonb_path_query('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
 jsonb_path_query 
------------------
 true
 true
(2 rows)

This clearly violates the @@ documentation claim that “Only the first item of the result is taken into account”. If that were true, it would see the first value is true and return true. But it doesn’t. Turns out, the corresponding jsonb_path_match() function shows why:

select jsonb_path_match('{"a":[true,false,true]}'::jsonb, '$.a ?(@[*] == true)');
ERROR:  single boolean result is expected

Conclusion: The documentation is inaccurate. Only a single boolean is expected by @@. Anything else is an error.

Futhermore, it’s dangerous, at best, to use an SQL standard JSON Path expression with @@. If you need to use it with a filter expression, you can turn it into a boolean predicate by wrapping it in exists():

select jsonb_path_match('{"a":[true,false,true]}'::jsonb, 'exists($.a ?(@[*] == true))');
 jsonb_path_match 
------------------
 t
(1 row)

But there’s no reason to do so, because that’s effectively what the @? operator (and the corresponding, cleverly-named jsonb_path_exists() function does): it returns true if the SQL standard JSON Path expression contains any results:

select '{"a":[true,false,true]}'::jsonb @? '$.a ?(@[*] == true)';
 ?column? 
----------
 t
(1 row)

Here’s the key thing about @?: you don’t want to use a boolean predicate path query with it, either. Consider this predicate-only query:

select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, '$.a[*] > 6');
 jsonb_path_query 
------------------
 false
(1 row)

But see what happens when we use it with @?:

select '{"a":[1,2,3,4,5]}'::jsonb @? '$.a[*] > 6';
 ?column? 
----------
 t
(1 row)

It returns true even though the query itself returns false! Why? Because false is a value that exists and is returned by the query. Even a query that returns null is considered to exist, as it will when a strict query encounters an error:

select jsonb_path_query('{"a":[1,2,3,4,5]}'::jsonb, 'strict $[*] > 6');
 jsonb_path_query 
------------------
 null
(1 row)

select '{"a":[1,2,3,4,5]}'::jsonb @? 'strict $[*] > 6';
 ?column? 
----------
 t
(1 row)

The key thing to know about the @? operator is that it returns true if anything is returned by the path query, and returns false only if nothing is selected at all.

The Difference

In summary, the difference between the @? and @@ JSONB operators is this:

  • @? (and jsonb_path_exists()) returns true if the path query returns any values — even false or null — and false if it returns no values. This operator should be used only with SQL-standard JSON path queries that select data from the JSONB. Do not use predicate-only JSON path expressions with @?.
  • @@ (and jsonb_path_match()) returns true if the path query returns the single boolean value true and false otherwise. This operator should be used only with Postgres-specific boolean predicate JSON path queries, that return data from the predicate expression. Do not use SQL-standard JSON path expressions with @@.

This difference of course assumes awareness of this distinction between predicate path queries and SQL standard path queries. To that end, I submitted a patch that expounds the difference between these types of JSON Path queries, and plan to submit another linking these differences in the docs for @@ and @?.

Oh, and probably another to explain the difference in return values between strict and lax queries due to array unwrapping.

Thanks

Many thanks to Erik Wienhold for patiently answering my pgsql-hackers questions and linking me to a detailed pgsql-general thread in which the oddities of @@ were previously discussed in detail.


  1. Well almost. The docs for jsonb_path_query actually say, about the last two arguments, “The optional vars and silent arguments act the same as for jsonb_path_exists.” I replaced that sentence with the relevant sentences from the jsonb_path_exists docs, about which more later. ↩︎

  2. Though omitting the vars argument, as variable interpolation just gets in the way of understanding basic query result behavior. ↩︎

  3. In fairness, the Oracle docs also discuss “implicit array wrapping and unwrapping”, but I don’t have a recent Oracle server to experiment with at the moment. ↩︎

CipherDoc: A Searchable, Encrypted JSON Document Service on Postgres

Over the last year, I designed and implemented a simple web service, code-named “CipherDoc”, that provides a CRUD API for creating, updating, searching, and deleting JSON documents. The app enforces document structure via JSON schema, while JSON/SQL Path powers the search API by querying a hashed subset of the schema stored in a GIN-indexed JSONB column in Postgres.

In may I gave a public presentation on the design and implementation of the service at PGCon: CipherDoc: A Searchable, Encrypted JSON Document Service on Postgres. Links:

I enjoyed designing this service. The ability to dynamically change the JSON schema at runtime without database changes enables more agile development cycles for busy teams. Its data privacy features required a level of intellectual challenge and raw problem-solving (a.k.a., engineering) that challenge and invigorate me.

Two minor updates since May:

  1. I re-implemented the JSON/SQL Path parser using the original Postgres path grammar and goyacc, replacing the hand-written parser roundly castigated in the presentation.
  2. The service has yet to be open-sourced, but I remain optimistic, and continue to work with leadership at The Times towards an open-source policy to enable its release.

UK Architecture

We saw some spectacular architecture on a whirlwind trip to London, Bath, Edinburgh, Glasgow, and Canterbury last month. Among the long list of museums, churches, and universities we visited, I managed to capture a few photos I quite like.

Collective Decision-Making with AHP

Me, writing for NYT Open:

The Identity Team at the Times, responsible for building and maintaining identity and authentication services for all of our users, has embarked on an ambitious project to build a centralized identity platform. We’re going to make a lot of decisions, such as what languages we should use, what database, how we can best protect personal information, what the API should look like, and so much more. Just thinking about the discussions and consensus-building required for a project of this scope daunts even the most experienced decision-makers among us. Fortuitously, a presentation at StaffPlus NYC by Comcast Fellow John Riviello introduced a super fascinating approach to collective decision-making, the Analytic Hierarchy Process (AHP).

I quite enjoyed our experiment with AHP, a super useful tool for collective decision-making. For a less technical primer, Wikipedia has some great examples:

RFC: Restful Secondary Key API

I’ve been working on a simple CRUD API at work, with an eye to make a nicely-designed REST interface for managing a single type of resource. It’s not a complicated API, following best practices recommended by Apigee and Microsoft. It features exactly the sorts for APIs you’d expect if you’re familiar with REST, including:

  • POST /users: Create a new user resource
  • GET /users/{uid}: Read a user resource
  • PUT /users/{uid}: Update a user resource
  • DELETE /users/{uid}: Delete a user resource
  • GET /users?{params}: Search for user resources

If you’re familiar with REST, you get the idea.

There is one requirement that proved a bit of design challenge. We will be creating canonical ID for all resources managed by the service, which will function as the primary key. The APIs above reference that key by the {uid} path variable. However, we also need to support fetching a single resource by a number of existing identifiers, including multiple legacy IDs, and natural keys like, sticking to the users example, usernames and email addresses. Unlike the search API, which returns an array of resources, we need a nice single API like GET /users/{uid} that returns a single resource, but for a secondary key. What should it look like?

None of my initial proposals were great (using username as the sample secondary key, though again, we need to support a bunch of these):

  • GET /users?username={username} — consistent with search, but does it return a collection like search or just a single entry like GET /users/{uid}? Would be weird not to return an array or not based on which parameters were used.
  • GET /users/by/username/{username} — bit weird to put a preposition in the URL. Besides, it might conflict with a planned API to fetch subsets of info for a single resource, e.g., GET /users/{uid}/profile, which might return just the profile object.
  • GET /user?username={username} — Too subtle to have the singular rather than plural, but perhaps the most REST-ish.
  • GET /lookup?obj=user&username={username} Use special verb, not very RESTful

I asked around a coding Slack, posting a few possibilities, and friendly API designers suggested some others. We agreed it was an interesting problem, easily solved if there was just one alternate that never conflicts with the primary key ID, such as GET /users/{uid || username}. But of course that’s not the problem we have: there are a bunch of these fields, and they may well overlap!

There was some interest in GET /users/by/username/{username} as an aesthetically-pleasing URL, plus it allows for

  • /by => list of unique fields
  • /by/username/ => list of all usernames?

But again, it runs up against the planned use of subdirectories to return sub-objects of a resource. One other I played around with was: GET /users/user?username={username}: The user sub-path indicates we want just one user much more than /by does, and it’s unlikely we’d ever use user to name an object in a user resource. But still, it overloads the path to mean one thing when it’s user and another when it’s a UID.

Looking back through the options, I realized that what we really want is an API that is identical to GET /users/{uid} in its behaviors and response, just with a different key. So what if we just keep using that, as originally suggested by a colleague as GET /users/{uid || username} but instead of just the raw value, we encode the key name in the URL. Turns out, colons (:) are valid in paths, so I defined this route:

  • GET /users/{key}:{value}: Fetch a single resource by looking up the {key} with the {value}. Supported {key} params are legacy_id, username, email_address, and even uid. This then becomes the canonical “look up a user resource by an ID” API.

The nice thing about this API is that it’s consistent: all keys are treated the same, as long as no key name contains a colon. Best of all, we can keep the original GET /users/{uid} API around as an alias for GET /users/uid:{value}. Or, better, continue to refer to it as the canonical path, since the PUT and DELETE actions map only to it, and document the GET /users/{key}:{value} API as accessing an alias for symlink for GET /users/{uid}. Perhaps return a Location header to the canonical URL, too?

In any event, as far as I can tell this is a unique design, so maybe it’s too weird or not properly RESTful? Would love to know of any other patterns designed to solve the problem of supporting arbitrarily-named secondary unique keys. What do you think?

Update: Aristotle Pagaltzis started a discussion on this pattern in a Gist.

Bryce Canyon 1987

Back in 1987, my mom and I went on a trip around the American Southwest. I was 18, freshly graduated from high school. We had reservations to ride donkeys down into the Grand Canyon, but, sadly I got a flu and kept us in the hotel along the rim.

The highlight of the trip turned out to be Bryce Canyon, where I made this photo of its famous hoodoos. Likely shot with Kodachrome 64, my go-to for sunny summer shots at the time, on a Pentax ME Super SLR with, as I recall, a 28-105mm lens. Mom asked me yesterday if I’d scanned photos from that trip and, digging into my scans, the deeply saturated colors with those lovely evergreens took my breath away.

Feynman’s Genius

Yours truly, in a 2018 review of Genius, by James Gleick:

Because our ways of understanding the universe are not the universe itself. They’re explanatory tools we develop, use, and sometimes discard in favor of newer, more effective tools. They’re imperfect, products of their times and cultures. But sometimes, in the face of an intractable problem, a maverick mind, cognizant of this reality, will take the radical step of discarding some part of the prevailing doctrine in an attempt to simplify the problem, or just to see what might happen. Feynman was such a mind, as Gleick shows again and again.

In case you’re wondering why I’m linking to my own blog, while this piece dates from 2018, I posted it only a few weeks ago. Originally I posted it on Goodreads, but when Goodreads unceremoniously deleted my account I thought it was gone for good. But two months later, Goodreads sent me my content. I was back in business! With my data recovered and added to my StoryGraph profile, I also took the opportunity to post the one review I had put some effort into on my own site. So here were are.

In other words, I’m more likely to post book reviews on Just a Theory from here on, but meanwhile, I’d be happy to be your friend on StoryGraph.

How Goodreads Deleted My Account

On 12:31pm on February 2, I got an email from Goodreads:

Hi David,

This is a notice to let you know that the password for your account has been changed.

If you did not recently reset or change your password, it is possible that your account has been compromised. If you have any questions about this, please reach out to us using our Contact Us form. Alternatively, visit Goodreads Help.

Since I had not changed my password, I immediately hit the “Goodreads Help” link (not the one in the email, mind you) and reported the issue. At 2:40pm I wrote:

I got an email saying my password had been changed. I did not change my password. I went to the site and tried go log in, but the login failed. I tried to reset my password, but got an email saying my email is not in the system.

So someone has compromised the account. Please help me recover it.

I also tried to log in, but failed. I tried the app on my phone, and had been logged out there, too.

The following day at 11:53am, Goodreads replied asking me for a link to my account. I had no idea what the link to my account was, and since I assumed that all my information had been changed by the attackers, I didn’t think to search for it.

Three minutes later, at 11:56, I replied:

No, I always just used the domain and logged in, or the iOS app. I’ve attached the last update email I got around 12:30 EST yesterday, in case that helps. I’ve also attached the email telling me my password had been changed around 2:30 yesterday. That was when I became aware of the fact that the account was taken over.

A day and half later, at 5:46pm on the 4th, Goodreads support replied to say that they needed the URL in order to find it and investigate and asked if I remembered the name on the account. This seemed odd to me, since until at least the February 2nd it was associated with my name and email address.

I replied 3 minutes later at 5:49:

The name is mine. The username maybe? I’m usually “theory”, “itheory”, or “justatheory”, though if I set up a username for Goodreads it was ages ago and never really came up. Where could I find an account link?

Over the weekend I can log into Amazon and Facebook and see if I see any old integration messages.

The following day was Saturday the fifth. I logged into Facebook to see what I could find. I had deleted the link to Goodreads in 2018 (when I also ceased to use Facebook), but there was still a record of it, so I sent the link ID Facebook had. I also pointed out that my email address had been associated with the account for many years until it was changed on Feb 2. Couldn’t they find it in the history for the account?

I still didn’t know the link to my account, but forwarded the marketing redirect links that had been in the password change email, as well as an earlier email with a status on my reading activity.

After I sent the email, I realized I could ask some friends who I knew followed me on Goodreads to see if they could dig up the link. Within a few minutes my pal Travis had sent it to me, https://www.goodreads.com/user/show/7346356-david-wheeler. I was surprised, when I opened it, to see all my information there as I’d left it, no changes. I still could not log in, however. I immediately sent the link to Goodreads support (at 12:41pm).

That was the fifth. I did no hear back again until February 9th, when I was asked if I could provide some information about the account so they could confirm it was me. The message asked for:

  • Any connected apps or devices
  • Pending friend requests to your account
  • Any accounts linked to your Goodreads account (Goodreads accounts can be linked to Amazon, Apple, Google, and/or Facebook accounts)
  • The name of any private/secret groups of which you are a part
  • Any other account-specific information you can recall

Since I of course had no access to the account, I replied 30 minutes later with what information I could recall from memory: my devices, Amazon Kindle connection (Kindle would sometimes update my reading progress, though not always), membership in some groups that may or may not have been public, and the last couple books I’d updated.

Presumably, most of that information was public, and the devices may have been changed by the hackers. I heard nothing back. I sent followup inquiries on February 12th and 16th but got no replies.

On February 23rd I complained on Twitter. Four minutes later @goodreads replied and I started to hope there might be some progress again. They asked me to get in touch with Support again, which i did at 10:59am, sending all the previous information and context I could.

Then, at 12:38am, this bombshell arrived in my inbox from Goodreads support:

Thanks for your your patience while we looked into this. I have found that your account was deleted due to suspected suspicious activity. Unfortunately, once an account has been deleted, all of the account data is permanently removed from our database to comply with the data regulations which means that we are unable to retrieve your account or the related data. I know that’s not the news you wanted and I am sincerely sorry for the inconvenience.Please let me know if there’s anything else I ​can assist you with.

I was stunned. I mean of course there was suspicious activity, the account was taken over 19 days previously! As of the 5th when I found the link it still existed, and I had been in touch a number of times previously. Goodreads knew that the account had been reported stolen and still deleted it?

And no chance of recovery due to compliance rules? I don’t live in the EU, and even if I was subject to the GDPR or CCPA, there is no provision to delete my data unless I request it.

WTAF.

So to summarize:

  • Someone took control of my account on February 2
  • I reported it within hours
  • On February 5 my account was still on Goodreads
  • We exchanged a number of messages
  • By February 23 the account was deleted with no chance of recovery due to suspicious activity

Because of course there was suspicious activity. I told them there was an issue!

How did this happen? What was the security configuration for my account?

  • I created an entry for Goodreads in 1Password on January 5, 2012. The account may have been older than that, but for at least 10 years I’ve had it, and used it semi-regularly.
  • The password was 16 random ASCII characters generated by 1Password on October 27, 2018. I create unique random passwords for all of my accounts, so it would not be found in a breached database (and I have updated all breached accounts 1Password has identified).
  • The account had no additional factors of authentication or fallbacks to something like SMS, because Goodreads does not offer them. There was only my email address and password.
  • On February 2nd someone changed my password. I had clicked no links in emails, so phishing is unlikely. Was Goodreads support social-engineered to let someone else change the password? How did this happen?
  • I exchanged multiple messages with Goodreads support between February 2 and 23rd, to no avail. By February 23rd, my account was gone with all my reviews and reading lists.

Unlike Nelson, who’s account was also recently deleted without chance of recovery, I had not been making and backups of my data. Never occurred to me, perhaps because I never put a ton of effort into my Goodreads account, mostly just tracked reading and a few brief reviews. I’ll miss my reading list the most. Will have to start a new one on my own machines.

Though all this, Goodreads support were polite but not particularly responsive. days and then weeks went by without response. The company deleted the account for suspicious activity an claim no path to recovery for the original owner. Clearly the company doesn’t give its support people the tools they need to adequately support cases such as this.

I can think of a number of ways in which these situations can be better handled and even avoided. In fact, given my current job designing identity systems I’m going to put a lot of thought into it.

But sadly I’ll be trusting third parties less with my data in the future. Redundancy and backups are key, but so is adequate account protection. Letterboxed, for example, has no multifactor authentication features, making it vulnerable should someone decide it’s worthwhile to steal accounts to spam reviews or try to artificially pump up the scores for certain titles. Just made a backup.

You should, too, and backup your Goodreads account regularly. Meanwhile, I’m on the lookout for a new social reading site that supports multifactor authentication. But even with that, in the future I’ll post reviews here on Just a Theory and just reference them, at best, from social sites.


Update April 3, 2022: This past week, I finally got some positive news from Goodreads, two months after this saga began:

The Goodreads team would like to apologize for your recent poor experience with your account. We sincerely value your contribution to the Goodreads community and understand how important your data is to you. We have investigated this issue and attached is a complete file of your reviews, ratings, and shelvings.

And that’s it, along with some instructions for creating a new account and loading the data. Still no account recovery, so my old URL is dead and there is no information about my Goodreads friends. Still, I’m happy to at least have my lists and reviews recovered. I imported them into a new Goodreads account, then exported them again and imported them into my new StoryGraph profile.

Every Day Is Jan 6 Now

The New York Times Editorial Board in an unusually direct piece last week:

It is regular citizens [who threaten election officials] and other public servants, who ask, “When can we use the guns?” and who vow to murder politicians who dare to vote their conscience. It is Republican lawmakers scrambling to make it harder for people to vote and easier to subvert their will if they do. It is Donald Trump who continues to stoke the flames of conflict with his rampant lies and limitless resentments and whose twisted version of reality still dominates one of the nation’s two major political parties.

In short, the Republic faces an existential threat from a movement that is openly contemptuous of democracy and has shown that it is willing to use violence to achieve its ends. No self-governing society can survive such a threat by denying that it exists. Rather, survival depends on looking back and forward at the same time.

See also this Vox piece. Great to see these outlets sound the alarm about the dangers to American democracy. The threats are very real, and clear-eyed discussions should ver much be dominating the public sphere.

More of this, please.

Review: Project Hail Mary

Cover of “Project Hail Mary” by Andy Weir

Project Hail Mary by Andy Weir

2021 Ballantine Books

Project Hail Mary follows the success of Andy Weir’s first novel, The Martian, and delivers the same kind of enjoyment. If a harrowing story of a solitary man in extreme environments using science and his wits to overcome one obstacle after another then this is the book for you. No super powers, no villains, no other people, really — just the a competent scientist overcoming the odds through experimentation, constant iteration, and sheer creativity. Personally I can’t get enough of it. Shoot it right into my veins.

Andy Weir seems to know his strengths and weaknesses, given these two books. If you want read stories of a diverse array of people interacting and growing through compelling character arcs, well, look elsewhere. Project Hail Mary doesn’t feature characters, really, but archetypes. No one really grows in this story: Ryland Grace, our protagonist and narrator, displays a consistent personality from start to finish. The book attempts to show him overcoming a character flaw, but it comes so late and at such variance to how he behaves and speaks to us that it frankly makes no sense.

But never mind, I can read other books for character growth and interaction. I’m here for the compelling plot, super interesting ideas and challenges (a whole new species that lives on the sun and migrates to Venus to breed? Lay it on me). It tickles my engineering and scientist inclinations, and we could use more of that sort of plotting in media.

So hoover it up. Project Hail Mary is a super fun adventure with compelling ideas, creative, competent people overcoming extreme circumstances without magic or hand-waving, and an unexpected friendship between two like-minded nerds in space.

I bet it’ll make a good movie, too.

Accelerate Perl Github Workflows with Caching

I’ve spent quite a few hours evenings and weekends recently building out a comprehensive suite of GitHub Actions for Sqitch. They cover a dozen versions of Perl, nearly 70 database versions amongst nine database engines, plus a coverage test and a release workflow. A pull request can expect over 100 actions to run. Each build requires over 100 direct dependencies, plus all their dependencies. Installing them for every build would make any given run untenable.

Happily, GitHub Actions include a caching feature, and thanks to a recent improvement to shogo82148/actions-setup-perl, it’s quite easy to use in a version-independent way. Here’s an example:

name: Test
on: [push, pull_request]
jobs:
  OS:
    strategy:
      matrix:
        os: [ ubuntu, macos, windows ]
        perl: [ 'latest', '5.34', '5.32', '5.30', '5.28' ]
    name: Perl ${{ matrix.perl }} on ${{ matrix.os }}
    runs-on: ${{ matrix.os }}-latest
    steps:
      - name: Checkout Source
        uses: actions/checkout@v3
      - name: Setup Perl
        id: perl
        uses: shogo82148/actions-setup-perl@v1
        with: { perl-version: "${{ matrix.perl }}" }
      - name: Cache CPAN Modules
        uses: actions/cache@v3
        with:
          path: local
          key: perl-${{ steps.perl.outputs.perl-hash }}
      - name: Install Dependencies
        run: cpm install --verbose --show-build-log-on-failure --no-test --cpanfile cpanfile
      - name: Run Tests
        env: { PERL5LIB: "${{ github.workspace }}/local/lib/perl5" }
        run: prove -lrj4

This workflow tests every permutation of OS and Perl version specified in jobs.OS.strategy.matrix, resulting in 15 jobs. The runs-on value determines the OS, while the steps section defines steps for each permutation. Let’s take each step in turn:

  • “Checkout Source” checks the project out of GitHub. Pretty much required for any project.
  • “Setup Perl” sets up the version of Perl using the value from the matrix. Note the id key set to perl, used in the next step.
  • “Cache CPAN Modules” uses the cache action to cache the directory named local with the key perl-${{ steps.perl.outputs.perl-hash }}. The key lets us keep different versions of the local directory based on a unique key. Here we’ve used the perl-hash output from the perl step defined above. The actions-setup-perl action outputs this value, which contains a hash of the output of perl -V, so we’re tying the cache to a very specific version and build of Perl. This is important since compiled modules are not compatible across major versions of Perl.
  • “Install Dependencies” uses cpm to quickly install Perl dependencies. By default, it puts them into the local subdirectory of the current directory — just where we configured the cache. On the first run for a given OS and Perl version, it will install all the dependencies. But on subsequent runs it will find the dependencies already present, thank to the cache, and quickly exit, reporting “All requirements are satisfied.” In this Sqitch job, it takes less than a second.
  • “Run Tests” runs the tests that require the dependencies. It requires the PERL5LIB environment variable to point to the location of our cached dependencies.

That’s the whole deal. The first run will be the slowest, depending on the number of dependencies, but subsequent runs will be much faster, up to the seven-day caching period. For a complex project like Sqitch, which uses the same OS and Perl version for most of its actions, this results in a tremendous build time savings. CI configurations we’ve used in the past often took an hour or more to run. Today, most builds take only a few minutes to test, with longer times determined not by dependency installation but by container and database latency.

Sign O’ The Times

The New York Times T Logo

Some news: I’m super happy to report that I started a new job last week at The New York Times.

After ten years at iovation, the last four working remotely from New York City and the last three under the ownership of TransUnion, I felt it was time to find something new. At The Times, I’ve taken the role of Staff Engineer on a new team, User Systems. I’m particularly stoked for this gig, as it falls right into areas of abiding interest, including privacy-by design, personal data protection, encryption, authentication, credential management, and scaling a vital app for the whole business. Add that to the straightforward commute once the office reopens, and it’s hard to find something more ideal.

I truly appreciate the extraordinary experience of my ten years at iovation. I originally thought I’d stay a couple years, but was so engaged by the people and the great work we did that I kept at it. I learned a ton about product engineering, product design, and scalable architectures, but especially about working with terrific colleagues who made me a better person even as I tried to be of service to them. I will especially miss working with Scott, Kurk, Clara, Travis, John, and Eric — and countless others. I wish them all the best, and would enjoy working with any and all of them again anytime.

Now I’m excited to make new connections working with my amazing new colleagues at The Times. I expect we’ll collaborate on fulfilling work building super useful tools that advance The Times mission to inform and empower its readers. I’m delighted to be jumping on this ride with them.

Assume Positive Intensifies

Lets talk about that well-worn bit of wisdom: “assume positive intent.” On the surface it’s excellent advice: practice empathy by mindfully assuming that people may create issues despite their best intentions. You’ve heard the parables, from Steven Covey’s paradigm shift on the subway to David Foster Wallace’s latent condemnation of gas-guzzling traffic and soul-sucking supermarkets. Pepsi CEO Indra Nooyi has popularized the notion to ubiquity in corporate America.

In practice, the assumption of positive intent enables some pretty serious anti-patterns.

First, focusing on intent downplays impact. Good intentions don’t change the outcomes of one’s actions: we still must deal with whatever broke. At best, good intentions enable openness to feedback and growth, but do not erase those mistakes.

Which leads us to a more fundamental dilemma. In a piece for Medium last year, Ruth Terry, quoting the Kirwan Institute’s Lena Tenney, summarizes it aptly:

By downplaying actual impact, assuming positive intent can deprioritize the experience of already marginalized people.

“All of this focus on intention essentially remarginalizes a person of color who’s speaking up about racism by telling them that their experience doesn’t matter because the person didn’t mean it that way,” says Tenney, who helped create interactive implicit bias learning tools for the Kirwan Institute.

This remarginalization of the vulnerable seriously undermines the convictions behind “assume positive intent,” not to mention the culture at large. But the impact transcends racial contexts: it appears wherever people present uncomfortable issues to people in a dominant position.

Take the workplace. A brave employee publicly calls out a problematic behavior or practice, often highlighting implicit bias or, at the very least, patterns that contradict the professed values of the organization. Management nods and says, “I’m glad you brought that up, but it’s important for us all to assume positive intent in our interactions with our co-workers.” Then they explain the context for the actions, or, more likely, list potential mitigating details — without the diligence of investigation or even consequences. Assume positive intent, guess at or manufacture explanations, but little more.

This response minimizes the report’s impact to management while simultaneously de-emphasizing the experience of the worker who voiced it. Such brave folks, speaking just a little truth to power, may start to doubt themselves or what they’ve seen. The manager has successfully gaslighted the worker.

Leaders: please don’t do this. The phrase is not “Assume positive intent for me, but not for thee.” Extend the assumption only to the people reporting uncomfortable issues. There’s a damn good chance they came to you only by the assumption of positive intent: if your coworkers thought you had ill-intent, they would not speak at all.

If you feel inclined to defend behavior or patterns based on presumption of good intent, avoid that reflex, too. Good intent may be key to transgressors accepting difficult feedback, but hold them accountable and don’t let assumptions stand on their own. Impact matters, and so must consequences.

Most importantly, Never use the assumption of good intent to downplay or dismiss the crucial but uncomfortable or inconvenient feedback brave souls bring to you.

Assume positive intent in yourself, never assert it in others, and know that, regardless of intent, problems still must be addressed without making excuses or devaluing or dismissing the people who have suffered them.

Central Park Autumn

It’s that most magical time of year in Central Park: Autumn. I spent a lot of time wandering around The Pool yesterday. Lots of folks were about, taking in the views, shooting photos. The spectacular foliage photographed best backlit by the sun. Here’s another one.

Both shot with an iPhone 12 Pro.

Automate Postgres Extension Releases on GitHub and PGXN

Back in June, I wrote about testing Postgres extensions on multiple versions of Postgres using GitHub Actions. The pattern relies on Docker image, pgxn/pgxn-tools, which contains scripts to build and run any version of PostgreSQL, install additional dependencies, build, test, bundle, and release an extension. I’ve since updated it to support testing on the the latest development release of Postgres, meaning one can test on any major version from 8.4 to (currently) 14. I’ve also created GitHub workflows for all of my PGXN extensions (except for pgTAP, which is complicated). I’m quite happy with it.

But I was never quite satisfied with the release process. Quite a number of Postgres extensions also release on GitHub; indeed, Paul Ramsey told me straight up that he did not want to manually upload extensions like pgsql-http and PostGIS to PGXN, but for PGXN to automatically pull them in when they were published on GitHub. It’s pretty cool that newer packaging systems like pkg.go.dev auto-index any packages on GibHub. Adding such a feature to PGXN would be an interesting exercise.

But since I’m low on TUITs for such a significant undertaking, I decided instead to work out how to automatically publish a release on GitHub and PGXN via GitHub Actions. After experimenting for a few months, I’ve worked out a straightforward method that should meet the needs of most projects. I’ve proven the pattern via the pair extension’s release.yml, which successfully published the v0.1.7 release today on both GitHub and PGXN. With that success, I updated the pgxn/pgxn-tools documentation with a starter example. It looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
name: Release
on:
  push:
    tags:
      - 'v*' # Push events matching v1.0, v20.15.10, etc.
jobs:
  release:
    name: Release on GitHub and PGXN
    runs-on: ubuntu-latest
    container: pgxn/pgxn-tools
    env:
      # Required to create GitHub release and upload the bundle.
      GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
    - name: Check out the repo
      uses: actions/checkout@v3
    - name: Bundle the Release
      id: bundle
      run: pgxn-bundle
    - name: Release on PGXN
      env:
        # Required to release on PGXN.
        PGXN_USERNAME: ${{ secrets.PGXN_USERNAME }}
        PGXN_USERNAME: ${{ secrets.PGXN_PASSWORD }}
      run: pgxn-release
    - name: Create GitHub Release
      id: release
      uses: actions/create-release@v1
      with:
        tag_name: ${{ github.ref }}
        release_name: Release ${{ github.ref }}
        body: |
          Changes in this Release
          - First Change
          - Second Change          
    - name: Upload Release Asset
      uses: actions/upload-release-asset@v1
      with:
        # Reference the upload URL and bundle name from previous steps.
        upload_url: ${{ steps.release.outputs.upload_url }}
        asset_path: ./${{ steps.bundle.outputs.bundle }}
        asset_name: ${{ steps.bundle.outputs.bundle }}
        asset_content_type: application/zip

Here’s how it works:

  • Lines 4-5 trigger the workflow only when a tag starting with the letter v is pushed to the repository. This follows the common convention of tagging releases with version numbers, such as v0.1.7 or v4.6.0-dev. This assumes that the tag represents the commit for the release.

  • Line 10 specifies that the job run in the pgxn/pgxn-tools container, where we have our tools for building and releasing extensions.

  • Line 13 passes the GITHUB_TOKEN variable into the container. This is the GitHub personal access token that’s automatically set for every build. It lets us call the GitHub API via actions later in the workflow.

  • Step “Bundle the Release”, on Lines 17-19, validates the extension META.json file and creates the release zip file. It does so by simply reading the distribution name and version from the META.json file and archiving the Git repo into a zip file. If your process for creating a release file is more complicated, you can do it yourself here; just be sure to include an id for the step, and emit a line of text so that later actions know what file to release. The output should be appended to the $GITHUB_OUTPUT file like this, with $filename representing the name of the release file, usually $extension-$version.zip:

    echo bundle=$filename >> $GITHUB_OUTPUT
    
  • Step “Release on PGXN”, on lines 20-25, releases the extension on PGXN. We take this step first because it’s the strictest, and therefore the most likely to fail. If it fails, we don’t end up with an orphan GitHub release to clean up once we’ve fixed things for PGXN.

  • With the success of a PGXN release, step “Create GitHub Release”, on lines 26-35, uses the GitHub create-release action to create a release corresponding to the tag. Note the inclusion of id: release, which will be referenced below. You’ll want to customize the body of the release; for the pair extension, I added a simple make target to generate a file, then pass it via the body_path config:

    - name: Generate Release Changes
      run: make latest-changes.md
    - name: Create GitHub Release
      id: release
      uses: actions/create-release@v1
      with:
        tag_name: ${{ github.ref }}
        release_name: Release ${{ github.ref }}
        body_path: latest-changes.md
    
  • Step “Upload Release Asset”, on lines 36-43, adds the release file to the GitHub release, using output of the release step to specify the URL to upload to, and the output of the bundle step to know what file to upload.

Lotta steps, but works nicely. I only wish I could require that the testing workflow finish before doing a release, but I generally tag a release once it has been thoroughly tested in previous commits, so I think it’s acceptable.

Now if you’ll excuse me, I’m off to add this workflow to my other PGXN extensions.

George Washington Bridge Pier

View of the Manhattan pier of the George Washington Bridge, Taken on 20 September, 2020 with an iPhone Xs.