Does anyone integrate with their customers' DB directly?

19 points by johnyeocx 2 years ago · 47 comments · 2 min read

Hey all, integrations with third-party SaaS vendors are super common these days, especially with all the AI tools complementing and enriching SaaS platforms, but as the title states, I'm curious to hear if anyone integrates directly with their customers' data store like Supabase, Snowflake, etc.

In my last startup, we were building a chargeback management tool for fintechs, and essentially whenever a chargeback came in, we'd only get the transaction ID, so we'd have to query the fintech's database to search for that transaction ID and pull additional data points like name of seller, email of seller, etc.

Now, when we did this, it was more of a scrappy workaround to the fintech not being able to dedicate engineering resources to send data to an API endpoint of ours. However, after a while, I thought to myself, there's nothing wrong with querying the fintech's DB as long as we're treating infosec seriously, and so now I'm curious to hear how common this practice is.

Maybe it's just me, but it feels like integrating with internal DBs could allow a startup to onboard and go live with customers much more quickly. For some reason though, I haven't heard much of this around, so to be completely honest, I'm thinking of starting a tool to make this process easier / standardised.

Before that though, I'm trying to wrap my head around the use cases for this so it'd be amazing to hear about your story of integrating directly with a customer's DB -- what was it for and how was the experience!

drpossum 2 years ago

> there's nothing wrong with querying the fintech's DB as long as we're treating infosec seriously

This is an unfortunate attitude.

From a software architecture point of view, you're hard-coupling your software to that database which can create brittleness: You're at the mercy of the original database schema which may not be optimal, carries it's own tech debt, may require translations to your own implementation, and may cause mistakes in interpretation in anything of scale. Even worse, you're at the customer's mercy if they want to change anything which immediately breaks your code (likely without warning) and now you have to fix it (likely under duress).

Letting an outside org get direct access to a database is an unnecessary security risk, even if you are "treating infosec seriously". I'd argue no organization that treats infosec seriously would want to request access to or grant access to anything more than the minimal information they need. https://en.wikipedia.org/wiki/Principle_of_least_privilege

Modern software design has found putting up abstractions and interfaces to address those issues which is why it's not particularly common. In fact, one fintech I worked with did this and ruined a team's holiday weekend by making an unfortunate and unsignaled change late on a Friday.

akira2501 2 years ago

> You're at the mercy of the original database schema
SQL has views. As long as you can represent the underlying data through a view you should be fine.
> and may cause mistakes in interpretation in anything of scale
All APIs have this concern.
> direct access to a database is an unnecessary security risk
You can constrain access to be read-only and to only access the particular view or views necessary.
> has found putting up abstractions and interfaces to address those issues
It could just be a crutch to avoid deep understanding of the underlying tools themselves. It's much easier to greenfield a new API. It's much easier to forget and accidentally abandon these APIs.
- drdaeman 2 years ago
  
  > SQL has views.
  This, and it doesn't have to be SQL, a number of other databases have similar concepts.
  If you have a separate schema with views (in SQL terms), and the rest is either entirely walled off or available on a "you have been warned" basis, then this is simply an API with a clear contract, just a flexible one (not entirely unlike a GraphQL endpoint) and available over a less common (for APIs) protocol.
  It can be risky for the API provider, though, unless it's a proper multi-tenant database. Given that those are less common, and, I guess, potentially harder to manage, it's probably why database access APIs are rare.
Vampiero 2 years ago

I would gladly take some brittleness over mountains and mountains of equally unmaintainable boilerplate and layers upon layers of indirection and abstractions... For what is essentially the same exact boring old CRUD that we've been doing for the past 40 years.
Who cares anyway? It's fintech. These projects are meant to become legacy and be replaced every 5-10 years. Those that don't are a PITA to work with regardless of how much the data layer is decoupled from the business layer.
And even if you do decouple them, these *** projects WILL find a way to break on a friday night anyway. Because of course they will. For the most part, they're written by juniors on tight deadlines and there's like only two people in the entire org who even have a faint clue of what the entire codebase does. You can use all the design principles and patterns you want, they're not going to save you.
johnyeocxOP 2 years ago

Thanks for sharing your views, these are definitely insightful and valid points. Infosec can never be overstated, and it was perhaps a bit naive of me to just think "treating it seriously" would be enough.
To argue against both the points you made though, there are reverse ETL platforms like Hightouch and Census today which 1. have specific architectures to make data movement from a central warehouse to third-party SaaS platforms seamless and easy, and 2. also access the company's data stores directly. What's the difference between what they're doing, and this hypothetical third-party unifying data integrations?
wackget 2 years ago

Your scorn should be addressed at the OP's customer, not OP themselves. If the customer doesn't have the resource to implement an API or whatever then OP's choice becomes either "awkwardly refuse the job" or "use the credentials they're given".

bastawhiz 2 years ago

The only reasonable way to do this safely is by querying a read replica. You could take down your customer's systems very, very easily by running queries without proper indexes. Dealing with that is probably trickier than it sounds, because a DB of appreciable size just won't be queryable.

You might go live quicker. But the integration will break every time your customer makes a change or needs to upgrade. Pray there's documentation. You'll spend unending hours debugging your customers' weird data models. Issues will appear when logic changes but the data model doesn't.

I won't tell you not to do this because you've been so light on details of what you're actually doing (maybe it is easier this way!), but it would not fly on my watch.

crazygringo 2 years ago

Exactly this. It's very easy to accidentally write a query that takes 20 minutes to complete rather than 20 milliseconds. A few of those and boom, your server is dead.
There is no situation in which I'd let a customer write their own queries against the main shared production database, for performance reasons alone. It's a complete non-starter.
(Although like you say, if they want to pay for their own dedicated private read replica, and deal with breaking schema changes whenever they arise, let them go at it. That's sure not something I'd want as a customer though.)
budleigh_s 2 years ago

> The only reasonable way to do this safely is by querying a read replica.
I agree - this is a very good approach with the most amount of control around how often you need fresh data, how to trigger that etc.
Although, not every integration is customer-facing and does not have to be infinitely scalable. Scalability = cost, often a heavy cost. If you query the DB once a day, the customer can come back to you and say: our db's are a little hot at 3AM and you do your queries then - can you add some indexes to the view or run a DESCRIBE query and optimize this a bit?
And let's agree that the above point does not change whether you use an API layer or query directly.
- bastawhiz 2 years ago
  
  > Although, not every integration is customer-facing and does not have to be infinitely scalable. Scalability = cost, often a heavy cost.
  You're either putting that cost on the client or paying to replicate their DB yourself. If that's too expensive, you shouldn't be using this approach in the first place! Running arbitrary queries on your customer's database primary is simply never safe, nor can it be made safe.
  > If you query the DB once a day, the customer can come back to you and say: our db's are a little hot at 3AM and you do your queries then - can you add some indexes to the view or run a DESCRIBE query and optimize this a bit?
  It's not a question of the DB running a little hot. It's accidentally triggering one or more full table scans on a table with millions of records. You have no idea what time your customer experiences load, or when their own batch jobs run. Or whether the index you built against with EXPLAIN six months ago got dropped because there's no queries in your customer's codebase that could possibly reference it.
  And it's not just about their system staying up. What if you trigger so many IOPS that your customer has a massive bill? Or if you run a query that needs more space to prepare the results, which scales up the instance automatically, resulting in unexpected costs? You're gonna pay for that, right?
  You also might hit a performance cliff: the query planner might choose a bad plan for a query that's otherwise satisfiable with an index only, because the stats suggest (for instance) that the visibility map is stale. There's nothing you can do here to fix this short of tuning settings on the DB itself.
  > And let's agree that the above point does not change whether you use an API layer or query directly.
  It does! Your customer (hopefully) knows their own DB. You don't. Have your customer build the API and they'll support your integration. You can't possibly have a better coupling with the data in your customer's database than your customer does.
  - budleigh_s 2 years ago
    
    > Running arbitrary queries on your customer's database primary is simply never safe, nor can it be made safe
    Look, the only thing that's never safe is speaking in absolutes.
    > It's not a question of the DB running a little hot. It's accidentally triggering one or more full table scans on a table with millions of records.
    Does the table have a million records? And I mean come on, a million records and a FTS once a day on modern machine - for the sake of inserts I actually would go with it... Besides may I remind you: materialised views.
    > It does! Your customer (hopefully) knows their own DB. You don't. Have your customer build the API and they'll support your integration.
    I envy you. You live in a fantasy land where your customers don't make mistakes and are full pros. I mean that, or you don't care...

stackskipton 2 years ago

I've been at companies that do this. It's extremely brittle process with significant overhead. Normally, we have to setup S2S VPN with all security/administrative implications of that. Then deal with any database schema changes and like.

I'd hope to never work at a company like that again as SRE, my life was nightmare.

wolpoli 2 years ago

I would get the customer to create (and own) the API endpoint that talks to their database, and we talk to that API endpoint instead. Then it is up to them to test and maintain that endpoint.

This avoids a situation where we get an email out of the blue explaining that there is a database change and we need to dedicate engineering resources to make sure it is compatible by a certain date, or even worse and more likely, an urgent email explaining that the integration has broke and we need it back running last week.

Nican 2 years ago

Yes. SQL is a form of API, and it carries all the same challenges.

-> Permission control, making sure that the user of API can not see data they are not supposed to.

-> Auditability. Verify that the API is being used correctly.

-> Performance. Do not overload the endpoint. (Read from a read replica? And maybe you are not running hour-long analytics queries on the database)

-> Consistency. Are you using transactions to read your data? Could you be causing contention?

-> API versioning. How do you upgrade the underlying tables without breaking the users.

testemailfordg2 2 years ago

Something boring but workable is using CSV files, dropped to a location on customers end through SFTP containing your transaction IDs that you are interested in. If its MS SQL Server on the other side, then a SQL agent on that server using some SSIS can read the CSV as input and prepare your required output as a CSV again for you to pickup from their SFTP Server.

dbrueck 2 years ago

Yes, we did this at a few places and it's boring and low tech and... works extremely well. Loose coupling, very small interface, and bad behavior on either side doesn't impact the performance of the actual DB, etc.
webmonkeyuk 2 years ago

This works in some situations but the OPs use case seems to suggest near real-time access to records that would only have recently been created. I'm unsure if CSV could work well here.

tbrownaw 2 years ago

Congratulations, schema changes are now directly visible outside your organization.

For some of our outgoing files at $employer, there's a notice period of... I think I remember it being 30 days for additional code values and 90 days for layout changes. That sort of planning ahead becomes much harder if every single schema change is immediately visible to outsiders.

deepsun 2 years ago

API is just another schema. You can make a VIEW in SQL db to keep that VIEW schema stable, while changing the source schema without downtime.
- shermantanktop 2 years ago
  
  That’s great for making db operations easier. Just because you make a pseudo-API out of a view doesn’t mean you should.
  The problems are in the realm of non-functional requirements. Quotas, security, protocol support, asynchronous batch and bulk data access, granular access control, throttling…you can try to do this with your db, and many have, but the square peg will eventually not fit anymore.

BurningFrog 2 years ago

Seems fine to me, provided some sanity constraints, like having read-only access, and some guarantee that the fields you're interested in will not be removed/changed.

I've worked on systems that use a SQL DB as a communication layer. A main advantage is that every language comes with good SQL read/write libraries, so things just work right away.

klysm 2 years ago

A SQL database is an excellent integration tool that I think is very underutilized. You get strong semantics around transaction isolation, and can easily bake in permissions. Language support is nearly universal and you can avoid a ton of API decisions that are already made for you.

anitil 2 years ago

I've had to do this due to tech not getting a say in vendor choice. We were given an SSH tunnel, a DB user and a query to run by the vendor and sent on our way. Of course, periodically that query stops working because 'woops we changed our schema', and we have to wait for a new query for us to run.

In terms of Pros: - It was quick to set up and _seems_ to work, mostly

For Cons: - We have no guarantees that this query will continue to work - We have no understanding of the model behind this data (why are these records in this state, and these in another? No idea, and the vendor doesn't seem to understand the question) - We need to poll this data source. How often? No idea!

Culturally I think it suggests an immature tech organisation in general, so while not a red flag I'd suggest it is an orange flag.

mr_toad 2 years ago

Many DBAs don’t even like internal analysts directly querying production line of business databases. Usually they’d set up replicas, or even a full blown data warehouse.

segmondy 2 years ago

I've had companies ask to read it DB, answer is no and over my dead body. They get API access or data dump. I've been asked to read external companies DB, I always refuse, give me API or data dump. I don't even agree for internal teams to share DBs. It's a bad idea, but I can see the draw for startups and small companies without the skills. BTW, can we chat about your former charge acts oriented startup? My info is in my profile.

airpodsman 2 years ago

I was thinking the same thing. I have an idea that I've been thinking about for some time but requires data to the Customer's Data. They'd have to expose us to ingest that either through : 1) an API that the customer exposes to you 2) Connected to the data source directly in a secure manner.

Option 1 seems like a lot of work for the customer ... probably not viable.

Option 2: you can securely connect to their data-source using oAuth2.0. In fact I know MongoDB offers this (https://www.mongodb.com/docs/atlas/security-oidc/) for this exact use-case I believe. I recall using Vercel to integrate with my MongoDB.

MongoDB: https://www.mongodb.com/docs/atlas/security-oidc/ Supabase: https://supabase.com/docs/guides/auth

Though I haven't done it directly, I think integrating with the DB through a secure protocol like oAuth is the way to go. DB services offer this and must be for this exact use case.

johnyeocxOP 2 years ago

that's what I thinking actually, using oauth to connect to a DB is definitely a new concept though, and most larger enterprises either use data warehouses or self hosted DBs (just a guess), so oauth may not be an option
If the SaaS target customers are smaller companies / startups, then this may be viable
I'd love to hear more about your idea though! would you mind sharing a little bit about it?
- cj 2 years ago
  
  > If the SaaS target customers are smaller companies / startups, then this may be viable
  I’m your target audience.
  The #1 issue I have with this concept is, as a customer, I would have to trust that you understand the performance limits of my production database and that you won’t run (even accidentally) a burst of inefficient queries that takes down or impairs production.
  One way around this would be to query against a replica / secondary, but even then you risk overloading a secondary/replica, which could have negative effects in a failover scenario.
- airpodsman 2 years ago
  
  Let me know if you end up researching this further and what the conclusion is. I think the best way to tackle this would be via an API the customers can give you.
  The idea is a simple marketing tool to help businesses reach out to customers that didn’t convert fully through the funnel. So for example, user created a profile, but didn’t end up purchasing.
  If interested, I can share the link (once it’s up) and would love to get your feedback.

hluska 2 years ago

First off, this is creative and you did a great job of writing it up and explaining the benefits to your approach. And I think it is a good idea in extremely early stages of a software product because as you mention, it’s a quick way to get something up and running.

Unfortunately, the longer into the project you are, the more brittle that connection becomes. You’re tightly coupling a process to the state of the database at time of build. When you do that, you create a situation where your customers can kill your code.

So fundamentally, when you integrate directly, you create a situation where your customers’ engineers control how reliable your software can be. They likely won’t fess up when they break your tool so the problems will come back to you.

I know that sounds a lot like typing ‘npm install’ or ‘building an API’ but in this case, you’re handing the keys over to your customers. At best, it will cause reliability problems. At worst, it will add friction to a sales process. As companies scale and sales gains power, that can become a career issue for you. As a rule, it’s better to have difficult technical decisions when you’re employed than difficult financial decisions when you’re not.

So good writing and excellent analysis. But if you choose to go down that path, at some point in the future, I feel like that integration will become someone’s headache.

If you have to do it, it would be worth documenting that the customer in this case won’t provide proper access through an API. That’s a good reason to do something like this because that’s quite unreasonable of the customer. If they’re unwilling to provide API access, I feel like they’re just as unlikely to provide a read replica so aside from the integration’s brittleness, your queries have to be rock solid or you could slow down prod.

It’s all risky and you have a big decision to make. Good luck and have fun.

superice 2 years ago

We’ve done this for an SME customer of ours. We were essentially building a new module on top of their existing in-house systems. We laid out our query patterns in advance so they could index properly, and only query specific views. Write-out is only in tables dedicated for our product.

It works surprisingly well, and is pretty resilient since it essentially acts as a message queue too. Then again, this is all low traffic stuff in a SME / B2B setting, with zero multitenancy involved.

I’ll still take a proper API or message queue any day though.

Edit: I suppose our biggest benefit is that our customer can actually change the interface with us fairly quickly. Then have database experts in-house, but devs who could do APIs. So the collaboration has been mostly smooth an account of that, and that is a huge advantage compared to them having to outsource any API work. Technically speaking I’m not a fan, but the non-technical results have been useful.

drewcoo 2 years ago

Even if you're only doing reads, you'll be relying on interfaces that can change at any time with no notice. That's a kind of tech debt time bomb waiting to blow up your service and make a big pager party!

If you're doing writes, this is potentially harmful to the customer, too, not just data exfiltration but also potentially software breaking and that could be lawsuit territory.

Even doing this with your own services' data stores is bad practice. Direct reads/writes to a service's data store without going through its defined interfaces means unexpected, often unmonitored changes happening. I strongly advise against this pattern of "secret APIs," as opposed to overt ones.

Also, please note that this applies to any data store and not just a DB.

tw04 2 years ago

If they don’t have staff to hit an api, tying to their database sounds like a one-way trip to a lawsuit. The second something goes wrong they’re going to blame you and won’t have the technical expertise to understand your explanation of why it’s not your fault.

RajT88 2 years ago

Not me, but have seen this with a number of vendors.

- Customer buys a service from a vendor

- Customer creates a Vnet peering in Azure between their subscriptions

- Customer queries the vendor's DB's directly

I will not name names for the vendors or customers, but you have heard of all of them.

wackget 2 years ago

It's really funny seeing some of the outraged, self-righteous responses here from developers who probably take themselves too seriously.

The OP already said that their customer didn't have the resource to even connect to an API, so what on earth makes you think they're gonna be able to create some highly-abstrated API layer or some other "bEsT PrAcTiCe" way of exposing the data?

And even if they did create a view or API for the database instead of giving direct access, what makes you think they'd keep that view or API up-to-date? It's just as likely to break as a changing database schema is.

rco8786 2 years ago

> the fintech not being able to dedicate engineering resources to send data to an API endpoint of ours

There's an extremely real chance that this fintech giving you direct DB access was breaking the law by doing so.

rukuu001 2 years ago

Yeah that's a blackops under the radar get it done quick solution that should be flagged like crazy for future spend on getting an API in place.

Of course, no one will dedicate time or money to that API until something breaks. Then you pull out all the emails, risk/assumption statements etc to highlight your due-diligence on informing the client. Then fix the problem, pushing the whole issue down the road again until the next time.

But seriously, expect to see this anywhere there are budget and time constraints. As soon as the fintech is big enough to get more worried about risk than growth, they should start taking this kind of thing seriously.

breadwinner 2 years ago

Power BI does this using on-premise data gateway: https://learn.microsoft.com/en-us/power-bi/connect-data/serv...

486sx33 2 years ago

Trigger my database and you’re fired, software scab!

But in reality, any big corp assumes if you have the access, you’ve used it. So it should be pretty common

gfody 2 years ago

in telecom resporgs query somos by directly connecting to their oracle database. it's pretty nice compared to say some rest api especially if you're integrating it into your own db as a linked server or remote table.

mandeepj 2 years ago

how about recording all those attributes (name of seller, email of seller, etc) when a sale is made? Either by calling an API on your side, FTP file upload, or email. I prefer an Api call though

thehappypm 2 years ago

Data visualization products do this regularly

lloydatkinson 2 years ago

Are there any good open source/free ones? I tried Apache Superset a few months ago which seemed promising but was appalling bad when it came to trying to use it.
- thehappypm 2 years ago
  
  R is the gold standard
  Microsoft’s PowerBI and Google has some tools included in G Suite
  Tableau has a free tier I think
- elchief 2 years ago
  
  I like metabase

budleigh_s 2 years ago

I'm thinking that much of the architecture around building APIs in front of databases stems from the religious approach to SOA over HTTP/REST. I don't see much wrong with this as to me the database is still a sort of an API with a different protocol.

At Scramjet we've built an Platform+Agent engine and put it to use and in many cases we do integrate into the DB's directly without APIs through long-lived integration functions.

Let me address some of the comments below:

- hard coupling of DB is bad - I really don't see how adding an additional middleware results in lessening this, you're hard coupling into something else. While it might help if a database structure is changed on a whim but, from my experience, most of the schema changes reflect changed requirements.

- hard couping is bad in general - is it though? As a platform engineer I went into a pitfall of making everything reusable only to learn that 90% of the stack has never changed and the XML-based protocol projects developed in 2004 are still being developed without change to the protocol at all. A reactive approach with a separated client/data model is much better in my opinion and quite frankly it's mostly impossible to write a system these days without one.

- schema changes are unavoidable - but hey, as someone pointed out, there are views and materialised views. The RDBs do offer a good framework for keeping legacy schemas during the development cycle and this framework will save you the most money as it affects the least amount of moving parts in your system. It's there since the 90s at least, so why not use it?

- the security mumbo jumbo - this seems to be the most uninformed opinion - all leading RDBMS have an extremely robust and granual way to grant access to the data. They are battle tested for nearly half of a century. They are well known and easily tested. And, unlike the REST frameworks, HTTP servers, relays, caches and what not, those are contained in a single piece of software, limitting the attack vectors.

If you can deploy the client to the DB remotely, near the database then I'd say - you're doing this right and not only that, you're limitting the amount of code you need to deploy, maintain and depend on. If you separate the data, the access code and the logic from each other, you have achieved decoupling. But that does not mean that REST API's are the "one correct way".

Such an approach was the underlying idea of our Scramjet Platform and we've taken this approach not only in integrating into DBs, but also file storage, direct protocols and even GPIOs in some cases. And though we have a trick up our sleve of being able to split the code across distributed environments, even in general terms I'd say: it works, it's maintainable and way less expensive.

white_dragon88 2 years ago

> integrating with internal DBs could allow a startup to onboard and go live with customers much more quickly

The technical debt you'd accrue from this would be MASSIVE, and screw your entire architecture down to its roots. Hard pass from me.

Settings

Does anyone integrate with their customers' DB directly?

Keyboard Shortcuts