Seldomly Used SQL: The Bool Aggregate Functions
neovintage.orgI don't know if postgres allows one to treat boolean as 0 and 1 natively but in t-sql, the equivalent would be something like:
DECLARE @orders TABLE (
category VARCHAR(5),
express_delivered INT
);
INSERT INTO @orders (category, express_delivered) VALUES ( 'food', 1);
INSERT INTO @orders (category, express_delivered) VALUES ( 'food', 0);
INSERT INTO @orders (category, express_delivered) VALUES ( 'shoes',0);
INSERT INTO @orders (category, express_delivered) VALUES ( 'shoes',0);
INSERT INTO @orders (category, express_delivered) VALUES ( 'auto', 1);
INSERT INTO @orders (category, express_delivered) VALUES ( 'auto', 1);
INSERT INTO @orders (category, express_delivered) VALUES ( 'book', 1);
INSERT INTO @orders (category, express_delivered) VALUES ( 'book', 0);
SELECT
category,
CASE WHEN SUM(express_delivered) = 0 THEN 0 ELSE 1 END AS ever_been_express_delivered
FROM @orders
GROUP BY categoryGoddamnit I hate how there is no true boolean datatype in t-sql. I can't write a simple predicate without comparing its output to 1 or 0.
Since SQL Server 2012, you can use the lovely IIF statement, which is translated into a CASE behind the scenes.
So:
CASE WHEN SUM(express_delivered) = 0 THEN 0 ELSE 1 END AS ever_been_express_delivered
becomes:
IIF(SUM(express_delivered) = 0, 0, 1) AS ever_been_express_delivered
That's fine for an "or" aggregation, but it's harder for an "and" aggregation, in which case I guess you'd have to say:
select category case when sum(express_delivered) = count(express_delivered) then 1 else 0 end as always_been_express_delivered from etc.
I think instead of trying to shoehorn the "AND" that he discusses in the article into a case statement you'd just look for each category to see if there has been a non expressed delivered package.
Edit: Changed the logic up, the original code's field was contradicting my logic.SELECT category, ISNULL(NonExpress.NonExpress, 0) AS ever_been_non_express_delivered FROM @orders O OUTER APPLY ( SELECT TOP 1 1 as [NonExpress] FROM @Orders O2 WHERE O2.Category = O.Category AND express_delivered = 0 ) NonExpress GROUP BY categoryHow about:
CASE WHEN MIN(express_delivered) = 0 THEN 0 ELSE 1 END AS always_been_express_delivered
its one of my few aggravations with pgsql, but you cant use 0 and 1 unless they are quoted. The only non-quoted values you can use are TRUE and FALSE. http://www.postgresql.org/docs/9.1/static/datatype-boolean.h...
Very cool. It looks like you can also use `every()` instead of `bool_and()` - and also note that you can provide any boolean value to these, so you can do something like `every(column = 'value')...`
I love `json_build_object()`.
With `json_build_object()`, you can select fields as a JSON object and not have to do that JSON building in your app.
Here's a snippet demonstrating building, and then querying a JSON object:
with location as (select json_build_object(
'street1', street1,
'street2', street2,
'city', city,
'state', state,
'zip', zip
) as loc from my_table)
select
loc ->> 'street1' as street1,
loc ->> 'street2' as street2,
loc ->> 'city' as city,
loc ->> 'state' as state,
loc ->> 'zip' as zip
from location;Honest question, why do you consider this better? I know there are development strategies that prefer to put as much model control as possible in the DB, but there are others that prefer to keep the DB more clearly as data, and not the end representation of that data (it's a trade-off, I don't see one as clearly better than the other. One allows for tighter control of data, the other allows more scalability and the use of more tooling).
I guess my question is, to you, what's the main benefit of building JSON in the inthe DB instead of your app?
There are some quite impressive performance gains that can be exploited - especially beneficial if you're using a framework with higher overheads. Using Rails, for example, it's possible to generate JSON output directly from Postgres and stream it to a client, instead of instantiating zillions of ActiveRecord objects just to query flat fields.
There is an obvious trade off of flexibility, but this technique is beneficial in that it allows rich object graphs through the ORM where required, while performing well for simple operations.
I see that more as working around the limitations of the ORM, not really as an inherent strength of doing it inthe DB. For example, in Perl and using DBIx::Class, I would get around that problem by applying another ResultClass to the ResultSet (the HashRefInflater ResultClass), so instead of inflating results into DBIx::Class objects, it returns a hash directly. With some of the helper classes, this is as simple as changing $result_set->search( \%criteria )->all; to $result_set->search( \%criteria )->hri->all;. This allows you to move the serialization task to the controller, and in what should be a fairly efficient way. I can't imagine ActiveRecord doesn't have something equivalent.
Now, where I could see returning JSON being really useful is if you can build complex structures out of it, in a way that follows constraints. E.g.
If that can be efficiently generated on the DB, that could help immensely with the current state of prefetching relationships, which to my knowledge, currently requires either redundant data through joins, and is thus increasingly inefficient the more items you relate (such as with DBIx::Class's prefetch), or uses multiple queries (ActiveRecord's :include), which includes either processing between queries or duplicating portions of results in subsequent queries to get the correct subset of relations.{ "title" : "Cool Hand Luke", "released" : "1967-11-01", "cast" : [ { "name": "Paul Newman" }, { "name": "George Kennedy" }, ... ], }Another advantage from a performance standpoint is that the data used to generate the JSON output is likely already cached in memory by your database engine. With SQL Server (and presumably with Postgres) JSON can be indexed in various ways, leading to even more performance gains.
I have to tell the database which columns I want in either case. The only difference (in my opinion) is now it's in a more convenient form, especially if you have a Node/Ruby/etc app where json/jsonb types is put into a native hash object by the PG library, and you would have to build that object anyways.
Are you telling me there's not a way in Ruby or Node to query the database and put each result directly into the native hash object equivalent? That seems like the sort of thing that gets optimized fairly early in a database interface's existence. That JSON ends up being faster would surprise me, as there's still a deserialization step that needs to take place (JSONB might indeed help with this aspect if available though).
Now, if you are indeed just passing the JSON directly through to some client further up the stack, there probably is some benefit to generating JSON on the DB, as long as your DB isn't CPU constrained. For any other purpose, I imagine generating JSON would be less space efficient and require additional and/or more onerous steps to converting the record to the appropriate language construct.
Doesn't this become less valueable as the data grows? You'll essentially always have at least one "true" value, and at that point you're basically doing the query of:
SELECT cateogry FROM orders WHERE express_delivery=TRUE GROUP BY category
Also are these aggregate functions as efficient?
It still has to get the data from every row, unless you happened to make an index that correlates the two columns in just the right order.
This is where a more effective question would be more useful.
A question like 'in the last quarter, how many sales did we have for each category and shipping type?' You can then take the results and calculate more useful values like the percentage of express shipments, etc.
I don't think that author's examples are particularly useful because, just as you point out, on any decent-sized dataset at least a COUPLE of customers will have chosen to behave oddly.
But they seem quite handy for searching for bad data in the database, rather than analyzing customer behavior.
It's actually the first time I'm seeing the `group by 1, 2` syntax.. is that Postgres-specific?
MSSQL doesn't support it, but does support the ORDER BY 1,2 syntax that is very similar.
I don't believe GROUP BY n1,n2.. should be allowed, because it inverts the flow of control of the query. If you think about it, you're telling SQL to group by a column you haven't specified yet, using the order it appears in the SELECT clause. Someone could edit the query and add a new column at the front of the select list and completely change the meaning of the query. I think the ORDER BY clause has similar issues, but at least that occurs after the SELECT happens.
tl;dr I shouldn't have to look at the ordering of the select to understand what has been grouped by.
Order of SQL operations for the uninitiated, this explains why you can't, for instance, reference a column alias defined in a SELECT clause in the WHERE clause.
6 SELECT 1 FROM 2 JOIN 3 WHERE 4 GROUP BY 5 HAVING 7 ORDER BY"group by 1,2" is for lazy people who never have to support an application in production. It tells the database to group by the columns in the order that they come back in the result and is supported by most engines.
Much like SELECT * or JOIN foo USING(bar), it's the kind of thing that greatly speeds up interactive queries, but shouldn't ever end up in source control.
Why "JOIN foo USING (bar)" shouldn't be used?
Because with USING, you're precluded from using the full table.column identifier in your JOIN condition. Without that, a future ALTER TABLE that adds another "bar" column to an existing table in the FROM-list will cause a previously-working query to break.
https://gist.github.com/AdamG/1128b86fdafca1e53a4bd5253a6882...
it'd be bad style to use it for saved queries but for experimentation it's great
Don't believe you can do it in MSSQL, though an order by outer reference would work.
Neat! I use the Python equivalents, `any()` and `all()` quite frequently.
The title should be "seldomly used Postgres-specific SQL".