Seldomly Used SQL: The Bool Aggregate Functions

45 points by neovintage 10 years ago · 28 comments

Reader

I don't know if postgres allows one to treat boolean as 0 and 1 natively but in t-sql, the equivalent would be something like:

  DECLARE @orders TABLE (
      category VARCHAR(5),
      express_delivered INT 
  );
  
  INSERT INTO @orders (category, express_delivered) VALUES ( 'food', 1);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'food', 0);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'shoes',0);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'shoes',0);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'auto', 1);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'auto', 1);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'book', 1);
  INSERT INTO @orders (category, express_delivered) VALUES ( 'book', 0);
  
  SELECT 
      category, 
      CASE WHEN SUM(express_delivered) = 0 THEN 0 ELSE 1 END AS ever_been_express_delivered
  FROM @orders
  GROUP BY category

Pxtl 10 years ago

Goddamnit I hate how there is no true boolean datatype in t-sql. I can't write a simple predicate without comparing its output to 1 or 0.
gary__ 10 years ago

Since SQL Server 2012, you can use the lovely IIF statement, which is translated into a CASE behind the scenes.
So:
CASE WHEN SUM(express_delivered) = 0 THEN 0 ELSE 1 END AS ever_been_express_delivered
becomes:
IIF(SUM(express_delivered) = 0, 0, 1) AS ever_been_express_delivered
https://msdn.microsoft.com/en-GB/library/hh213574.aspx
aetherson 10 years ago

That's fine for an "or" aggregation, but it's harder for an "and" aggregation, in which case I guess you'd have to say:
select category case when sum(express_delivered) = count(express_delivered) then 1 else 0 end as always_been_express_delivered from etc.
- MSM 10 years ago
  I think instead of trying to shoehorn the "AND" that he discusses in the article into a case statement you'd just look for each category to see if there has been a non expressed delivered package.
  SELECT category, ISNULL(NonExpress.NonExpress, 0) AS ever_been_non_express_delivered FROM @orders O OUTER APPLY ( SELECT TOP 1 1 as [NonExpress] FROM @Orders O2 WHERE O2.Category = O.Category AND express_delivered = 0 ) NonExpress GROUP BY category
  Edit: Changed the logic up, the original code's field was contradicting my logic.
- nickpeterson 10 years ago
  How about:
  CASE WHEN MIN(express_delivered) = 0 THEN 0 ELSE 1 END AS always_been_express_delivered
netcraft 10 years ago

its one of my few aggravations with pgsql, but you cant use 0 and 1 unless they are quoted. The only non-quoted values you can use are TRUE and FALSE. http://www.postgresql.org/docs/9.1/static/datatype-boolean.h...

netcraft 10 years ago

Very cool. It looks like you can also use `every()` instead of `bool_and()` - and also note that you can provide any boolean value to these, so you can do something like `every(column = 'value')...`

rhinoceraptor 10 years ago

I love `json_build_object()`.

With `json_build_object()`, you can select fields as a JSON object and not have to do that JSON building in your app.

Here's a snippet demonstrating building, and then querying a JSON object:

  with location as (select json_build_object(
  	'street1', street1,
  	'street2', street2,
  	'city', city,
  	'state', state,
  	'zip', zip
  ) as loc from my_table)
  
  select
  	loc ->> 'street1' as street1,
  	loc ->> 'street2' as street2,
  	loc ->> 'city' as city,
  	loc ->> 'state' as state,
  	loc ->> 'zip' as zip
  from location;

kbenson 10 years ago

Honest question, why do you consider this better? I know there are development strategies that prefer to put as much model control as possible in the DB, but there are others that prefer to keep the DB more clearly as data, and not the end representation of that data (it's a trade-off, I don't see one as clearly better than the other. One allows for tighter control of data, the other allows more scalability and the use of more tooling).
I guess my question is, to you, what's the main benefit of building JSON in the inthe DB instead of your app?
- matthewmacleod 10 years ago
  
  There are some quite impressive performance gains that can be exploited - especially beneficial if you're using a framework with higher overheads. Using Rails, for example, it's possible to generate JSON output directly from Postgres and stream it to a client, instead of instantiating zillions of ActiveRecord objects just to query flat fields.
  There is an obvious trade off of flexibility, but this technique is beneficial in that it allows rich object graphs through the ORM where required, while performing well for simple operations.
  - kbenson 10 years ago
    
    I see that more as working around the limitations of the ORM, not really as an inherent strength of doing it inthe DB. For example, in Perl and using DBIx::Class, I would get around that problem by applying another ResultClass to the ResultSet (the HashRefInflater ResultClass), so instead of inflating results into DBIx::Class objects, it returns a hash directly. With some of the helper classes, this is as simple as changing $result_set->search( \%criteria )->all; to $result_set->search( \%criteria )->hri->all;. This allows you to move the serialization task to the controller, and in what should be a fairly efficient way. I can't imagine ActiveRecord doesn't have something equivalent.
    Now, where I could see returning JSON being really useful is if you can build complex structures out of it, in a way that follows constraints. E.g.
    { "title" : "Cool Hand Luke", "released" : "1967-11-01", "cast" : [ { "name": "Paul Newman" }, { "name": "George Kennedy" }, ... ], }
    If that can be efficiently generated on the DB, that could help immensely with the current state of prefetching relationships, which to my knowledge, currently requires either redundant data through joins, and is thus increasingly inefficient the more items you relate (such as with DBIx::Class's prefetch), or uses multiple queries (ActiveRecord's :include), which includes either processing between queries or duplicating portions of results in subsequent queries to get the correct subset of relations.
  - treebeard901 10 years ago
    
    Another advantage from a performance standpoint is that the data used to generate the JSON output is likely already cached in memory by your database engine. With SQL Server (and presumably with Postgres) JSON can be indexed in various ways, leading to even more performance gains.
- rhinoceraptor 10 years ago
  
  I have to tell the database which columns I want in either case. The only difference (in my opinion) is now it's in a more convenient form, especially if you have a Node/Ruby/etc app where json/jsonb types is put into a native hash object by the PG library, and you would have to build that object anyways.
  - kbenson 10 years ago
    
    Are you telling me there's not a way in Ruby or Node to query the database and put each result directly into the native hash object equivalent? That seems like the sort of thing that gets optimized fairly early in a database interface's existence. That JSON ends up being faster would surprise me, as there's still a deserialization step that needs to take place (JSONB might indeed help with this aspect if available though).
    Now, if you are indeed just passing the JSON directly through to some client further up the stack, there probably is some benefit to generating JSON on the DB, as long as your DB isn't CPU constrained. For any other purpose, I imagine generating JSON would be less space efficient and require additional and/or more onerous steps to converting the record to the appropriate language construct.

xxbondsxx 10 years ago

Doesn't this become less valueable as the data grows? You'll essentially always have at least one "true" value, and at that point you're basically doing the query of:

SELECT cateogry FROM orders WHERE express_delivery=TRUE GROUP BY category

Also are these aggregate functions as efficient?

mjevans 10 years ago

It still has to get the data from every row, unless you happened to make an index that correlates the two columns in just the right order.
This is where a more effective question would be more useful.
A question like 'in the last quarter, how many sales did we have for each category and shipping type?' You can then take the results and calculate more useful values like the percentage of express shipments, etc.
mcherm 10 years ago

I don't think that author's examples are particularly useful because, just as you point out, on any decent-sized dataset at least a COUPLE of customers will have chosen to behave oddly.
But they seem quite handy for searching for bad data in the database, rather than analyzing customer behavior.

cjauvin 10 years ago

It's actually the first time I'm seeing the `group by 1, 2` syntax.. is that Postgres-specific?

nickpeterson 10 years ago
MSSQL doesn't support it, but does support the ORDER BY 1,2 syntax that is very similar.
I don't believe GROUP BY n1,n2.. should be allowed, because it inverts the flow of control of the query. If you think about it, you're telling SQL to group by a column you haven't specified yet, using the order it appears in the SELECT clause. Someone could edit the query and add a new column at the front of the select list and completely change the meaning of the query. I think the ORDER BY clause has similar issues, but at least that occurs after the SELECT happens.
tl;dr I shouldn't have to look at the ordering of the select to understand what has been grouped by.
Order of SQL operations for the uninitiated, this explains why you can't, for instance, reference a column alias defined in a SELECT clause in the WHERE clause.
```
  6 SELECT
  1 FROM
  2 JOIN
  3 WHERE
  4 GROUP BY
  5 HAVING
  7 ORDER BY
```
coredog64 10 years ago

"group by 1,2" is for lazy people who never have to support an application in production. It tells the database to group by the columns in the order that they come back in the result and is supported by most engines.
- ak4g 10 years ago
  
  Much like SELECT * or JOIN foo USING(bar), it's the kind of thing that greatly speeds up interactive queries, but shouldn't ever end up in source control.
  - JoeCamel 10 years ago
    
    Why "JOIN foo USING (bar)" shouldn't be used?
    
    ak4g 10 years ago
    
    Because with USING, you're precluded from using the full table.column identifier in your JOIN condition. Without that, a future ALTER TABLE that adds another "bar" column to an existing table in the FROM-list will cause a previously-working query to break.
    https://gist.github.com/AdamG/1128b86fdafca1e53a4bd5253a6882...
- nimish 10 years ago
  
  it'd be bad style to use it for saved queries but for experimentation it's great
atwebb 10 years ago

Don't believe you can do it in MSSQL, though an order by outer reference would work.

jnpatel 10 years ago

Neat! I use the Python equivalents, `any()` and `all()` quite frequently.

kristianp 10 years ago

The title should be "seldomly used Postgres-specific SQL".

Settings

Seldomly Used SQL: The Bool Aggregate Functions

Keyboard Shortcuts