Why I Always Use PostgreSQL Functions For Everything

12 min read Original article ↗

Vedran Bilopavlović

Press enter or click to view image in full size

Photo VB, Vicinity of Makarska, Croatia

As a general rule, every single command or query for the application use — I always wrap up in a PostgreSQL user-defined function!

Sometimes, it’s the PostgreSQL stored procedures when I need fine-grained transaction handling, for example, but that is really rare. It’s mostly PostgreSQL user-defined functions for the majority of cases.

To use the terminology from the OOP world — they are my data contracts.

The formal agreements for data exchange between the PostgreSQL and the application instances, to be precise.

In particular:

  • Function (or procedure) parameters — are my input data contracts.
  • Function results (or lack thereof) — are my output data contracts.

Functions encapsulate commands, queries, and table access. Those are private parts. Data contracts are public. They do not change unless the application changes, too.

Let me give you a real-world example for a better understanding.

Suppose we have this big table with lots of data:

create table device_measurements (
id bigint not null generated always as identity primary key,
device_id int not null references devices(device_id),
timestamp timestamp not null,
value numeric not null
);

Now, this table is a standard table in a legacy system created by the ORM tool. Therefore, it has a standard, auto-generated integer primary key (although not very useful, but there it is).

There are a couple of standard indexes, like the one on the device_id and one on the timestamp.

Now, this legacy application, for purposes of analytical reporting, is creating a huge amount of exact same queries with different parameters:

select "timestamp", device_id, value
from device_measurements
where device_id = $1
order by id desc

The first thing to say is that when the application runs a large number of the same queries with different parameters, that’s a big code smell and is highly likely an issue that needs to be refactored and rewritten.

It is very likely that those parameter values are fetched from a different query, and now we’re doing another query for each result of the first one. Long story short, it’s very likely that we are looking at the N+1 antipattern at work, which makes the entire application slow and sluggish resource hog.

However, resolving those N+1 antipatterns requires very careful and thoughtful analysis first and foremost, and then an entire rewrite of that entire portion, which is a commodity not every legacy system can afford.

N+1 makes the application sluggish at best (it crashes at worst), but sluggish is still usable.

So, the first step is to understand what we want to get with that query. What is thy intention? And after we figure that out, we can extract input and output data contracts. Let’s see:

  • The intention is to return the latest device measurement. The one record that was last to be inserted.
  • Input data contracts are just the device_id number.
  • An output data contract is a single record that returns the timestamp, device ID, and value.

That gives us the following function:

create or replace function get_latest_device_measurement(
_device_id integer
)
returns table (
"timestamp" timestamp,
device_id integer,
value double precision
)
security definer
language sql as
$$
select
"timestamp", device_id, value
from
device_measurements
where
device_id = _device_id
order by id desc
limit 1;
$$;
  • The name clearly describes the intention of our function. What are we getting out?
  • Input parameters and the return type define our data contracts.

And now, we can replace that query in our legacy application with a function call like this:

select "timestamp", device_id, value from get_latest_device_measurement($1)

This makes our query and data access readable and maintainable. It’s basically — the Clean Code version of the SQL.

Uncle Bob would be proud.

Now, performance-wise, we didn’t do anything, really. Yet. But we did make our application infinitely more maintainable. Continue reading.

As time goes on, data grows. Who could have predicted that?

Jokes aside, for many different reasons that I will not get into — the decision was made to migrate the system to Timescale DB and to partition this big table on the timestamp field.

Timescale partitions don’t allow having a unique index on a field that is not used for partitioning, which is, in our case, a timestamp field.

That means that the primary key on the ID field has to be dropped - in order to build partitions correctly.

Hell, it wasn’t that useful anyway; it’s there only because of the ORM.

Except, of course, when it comes to our “get latest measurement” function. Dropping that primary key index made our reporting analytics unusable because this function from above relied on the primary key index.

Luckily for us, these types of functions can be changed (altered) in an atomic transaction, which doesn’t require any downtime or any application change or redeployment:

create or replace function get_latest_device_measurement(
_device_id integer
)
returns table (
"timestamp" timestamp,
device_id integer,
value double precision
)
security definer
language sql as
$$
select
"timestamp", device_id, value
from
device_measurements
where
device_id = _device_id
order by "timestamp" desc
limit 1;
$$;

Now, that worked fine; the right index was used, we had a few partitions, and the report was usable again.

However, as time goes on, data grows. Who could have predicted that again?

As it turns out, although this version used indices, it had to do an index scan for each partition separately. As the number of partitions grew, our reporting analytics became unusable again. We obviously need a slightly different strategy.

Given the nature of our problem (locate the record with the highest timestamp for a given device) — the following index would be appropriate:

create index on device_measurements (device_id, timestamp desc);

So now, we can query the max timestamp for a given device and later use that to locate the entire record with our existing indexes on timestamp and device_id:

create or replace function get_latest_device_measurement(
_device_id integer
)
returns table (
"timestamp" timestamp,
device_id integer,
value double precision
)
security definer
language sql as
$$
with cte as (
select max(timestamp)
from device_measurements
where thing_id = _device_id
)
select
m.timestamp, m.device_id, m.value
from
device_measurements m
join cte ON m.timestamp = cte.max
where
m.device_id = _device_id
limit 1;
$$;

The execution plan shows us that only one partition is index scanned to locate the latest record with this approach.

This works like a charm; our reporting analytics are usable again, and everything is fine.

Again, we didn’t even have to touch the application, let alone redeploy anything. Zero timeouts.

However, as time goes on, data grows. Again! Who could have predicted that!? Crazy, right?

Get Vedran Bilopavlović’s stories in your inbox

Join Medium for free to get updates from this writer.

Now, we have a large number of partitions, not just a few. The execution plan shows us that the query scans only the last partition, but it lasts almost a second. That is slow, considering that it runs in an N+1 loop and issues many queries like that, and the overall effect is slow analytics.

Now, if we examine the execution plan carefully, we can see that the planning time for a query is too long, almost an entire second, while the execution time is really fast.

That doesn’t tell us much, except that the engine takes a bit longer time to figure out what it will do. That may be because we have a large number of partitions, or maybe not, but we can try this:

Query only from the physical partition (or chunk, as the TimescalDB calls it). The last partition (with the highest timestamp date) — should contain the latest record we are looking for.

Now, our function has to execute a bit of the dynamic SQL, and it may look like this:


create or replace function get_latest_device_measurement(
_device_id integer
)
returns table (
"timestamp" timestamp,
device_id integer,
value double precision
)
security definer
language plpgsql as
$$
declare
_table text;
_record record;
begin
select chunk_schema || '.' || chunk_name
into _table
from timescaledb_information.chunks c
where
hypertable_schema = 'public'
and hypertable_name = 'device_measurements'
order by range_end desc
limit 1;

execute format(
$sql$
with cte as (
select max(timestamp) from %1$s
where device_id = $1
)
select
m.timestamp, m.device_id, m.value
from %1$s m
join cte ON m.timestamp = cte.max
where m.device_id = $1
limit 1;
$sql$, _table
) using _device_id into _record;

return query
select _record.timestamp, _record.device_id, _record.value;
end;
$$;

And it is a great success; the latest measurements are returned very fast again and reporting analytics is usable again.

However, as time goes on, data grows. Again! Who could have predicted that!?

Nah, I’m just joking. Of course, that data grows, but we have options. For example:

  • I could have created a trigger on this table that would update another table with the latest timestamps and the latest values for each device. This table with redundant data containing only the latest values should be really small and fast. Then, we can alter our function to return values from that table.
  • I could have created a materialized view containing the latest values for each device. There is an option in Timescale DB that utilizes so-called continuous aggregation, which automatically refreshes materialized view from the hypertables. Then, we can alter our function to return values from that table.

And so on. The point is we can do a lot of different things, and not once the application has to be changed or redeployed.

Zero timeouts, baby.

When various software architects talk about encapsulation and power abstraction — they are making a huge mistake when they not talking about database functions and procedures.

And they usually don’t.

Another big benefit of this approach is security.

Security is important; that’s what everyone says. However, when it comes to database security, it is usually completely ignored because developers will prioritize the use of ORM tools over security.

With this approach, we can easily implement a security principle called the Principle of Least Privilege or simply Least Privilege Security.

That means that every account must have the least number of privileges to perform the core function. In the army, they call this approach a need-to-know basis. Essentially, you are only allowed to know what is necessary to perform your tasks. Because if you are captured by the enemy, you can’t tell them what you don’t know. The army takes security very seriously, and so should we.

In any case, with PostgreSQL, we can easily configure our database users so they can only execute these functions. They don’t have access to the tables; they are the private parts. The enemy can steal your application database credentials, but they can never delete your data. Or at least, they can do only what the application does.

Let’s see what it would look like with PostgreSQL.

First, let’s create an application user that can’t do anything except login:

create role app_user with
login
nosuperuser
nocreatedb
nocreaterole
noinherit
noreplication
connection limit -1
password '******';

Note, that you can set the correct and very secret password with ALTER ROLE command.

This user doesn’t have any privileges and can only log in. So far, so good.

Now, what I like to do is create my function/procedure layer in a separate schema like this:

create schema if not exists measurements;

create or replace function measurements.get_latest(
_device_id integer
)
returns table (
"timestamp" timestamp,
device_id integer,
value double precision
)
security definer
language sql as
$$
with cte as (
select max(timestamp)
from device_measurements
where thing_id = _device_id
)
select
m.timestamp, m.device_id, m.value
from
device_measurements m
join cte ON m.timestamp = cte.max
where
m.device_id = _device_id
limit 1;
$$;

Now, we can group our functionalities and behavior encapsulated in user-defined functions and procedures — based on similarity. All measurement functionalities are in the measurement schema. We can also adjust the function name to remove redundant wording. Now it’s simply:

measurements.get_latest

So much Clean Code 🥰

If I don’t need measurement functionalities anymore for some reason, I can drop them all with one command:

create schema if exists measurements cascade;

This will also drop all associated objects in that schema (our measurement functions).

Similarly, I can also give an execute privilege to our application user for all functions in this schema:

grant execute on all functions in schema measurements TO app_user;

This applies to existing functions only.

We can also alter the default schema privileges to automatically give execute privileges to all existing functions — and the new ones yet to be created — to a specific user role:

alter default privileges in schema measurements
grant execute on functions to app_user;

It all depends on how strict we want to be with our least-privilege security.

Note that the user doesn’t have any access to tables or anything else. That’s because it’s the least privileged. That is why every function or procedure has to be created with the security definer security label. This label instructs function execution to be executed under a user context of the role that created this function, and that is usually the superuser or administrator role.

PostgreSQL user-defined functions and stored procedures can help us to improve:

  • Readability.
  • Maintainability.
  • Availability.
  • And, last but the not the least — Security.

It’s the Clean Code for your PostgreSQL, and you should start using them.

While I like postgres functions and the example is nice, the fact that you can change them without touching the application code isn’t that much of a big advantage.

SQL needs to be managed with the same release process and cycle as the app.

My response:

What we usually do is to use a Flyway migration mechanism or similar (there others too). So, in this case, all functions would go to a repeatable migration file — one migration file per function.

And when the function is changed, it is not changed directly on the server. but rather, it is changed in that repeatable migration file (and tested locally).

That file is then committed and pushed where it needs to be committed and pushed. And, in that case, we only deploy migration, which will, in turn, only update (or alter) that function that is changed.

So that solved that problem for us.