Show HN: A Schemaless Data Store Within Your SQL Database

45 points by dfragnito 4 years ago · 46 comments · 4 min read

Reader

Hi HN, we're Eric and Dean, creators of SchemafreeSQL. Its roots go back to an on-line Web App Development Environment we developed back in 1999. It was comprised of an IDE, Web Server, Object Store, Virtual File System, Template System, and polyglot (Java, JavaScript, and Python). Of course, we named it “.OS”. Then we ended up dropping it. But that's a story for another time. It was the ease of use of the Object Store from .OS that we really missed, which brings us back to SchemafreeSQL. It provides an enhanced API to your SQL Database which allows it to function as a Schemaless Objectstore. Yes, "Schemaless Objectstore" might be an oxymoron if you consider that objects have strict "schemas". So what are we talking about here?

It started with a simple goal. We wanted to store data just as simply as one could throw a JSON document into a doc store, but while maintaining the power of the data's relations and references in an object sense.

What we ended up with was a JSON over HTTP API (for ubiquitous access from any language), that maps any JSON structure into an optimized, static set of indexed tables in any SQL database (MySQL being our first port). Using it, we can immediately store any data we want, without having to create tables and indexes and optimize them.

Inserting data is as easy as sending a JSON doc to a doc store, but in SFSQL all nested relations inherent within that document are stored as objects and primitives. And, with a little extra JSON syntax we can use references and eliminate duplication of data (e.g. "#ref": "oid-goes-here"), all from within the same source JSON doc. There is also no added speed penalty for referencing an object. In fact, all parent nodes in the JSON source data are stored as objects. So any object can be referenced from anywhere, allowing for complex relations (e.g. any-to-any), enabling rapid prototyping of data structures (no tables to create), and enabling unique on-the-fly relations if need be.

Queries are issued using JOIN-free SQL query syntax (e.g. SELECT COUNT(*) WHERE $s:person.address.state='NY') that eliminates the verbosity of the standard SQL JOIN, yet still maintains access to the primitive and aggregate functions of the underlying SQL database (e.g. FORMAT(), COUNT(), etc.) which simply get passed through.

What it's not? It's not a database on it's own and it's not a horizontally scalable big data store. Although, since it can extend practically any SQL database, it's ability to scale is in large part tied to the scaling ability of the chosen SQL database that it's extending. It's also not an ORM as there is no strict mapping of objects in code to objects in the datastore. And yes, there were trade offs in it's design. One design choice was that every primitive be indexed, which results in better performance than against a straight SQL table without the proper indexes, though slower than a straight SQL table with a composite index tailored to that query. But, the ability to JOIN to an existing SQL table within a SFSQL query is on the roadmap which will allow you have the best of both worlds (custom tables w/custom indexes referenced from within a SFSQL query). As well, there is a situation one encounters in indexing choices (but that you won't encounter with SFSQL) that we like to call the "Index Permutation Limitation". Simply put, there is a real limit to the number of composite indexes that you can create for any table/collection. This is especially a problem when - for instance - you want to give end users the ability to query across any of the possible attributes across the myriad product types that a big online store may carry. We get into detail on this point within our FAQ.

So what it is? It's an extension to SQL databases that makes storing and querying and modeling and just generally using a datastore... easier. We're hoping it will stir up some interesting use-cases.

Thanks in advance for the feedback!

productceo 4 years ago

>Web App Development Environment we developed back in 1999. It was comprised of an IDE, Web Server, Object Store, Virtual File System, Template System, and polyglot (Java, JavaScript, and Python).

When you made this, why did you believe building this was what people needed?

And what did you end up observing that made you discontinue the project?

AtlasBarfed 4 years ago

Why isn't it horizontally scalable? Are there tons of joins?

I wrote a JSON/Document/Property Graph database front end with Cassandra and DynamoDB backends, so that is ... kinda ... schemaless.

I ended up doing a lot of things that RethinkDB and OrientDB did but didn't scale like cassandra. I was trying to do a streaming document layer. Also a JSON document database with simple relations is a property graph database.

But Elassandra and similar projects kinda did the same thing so I dropped it. But I came out of it thinking it shouldn't be hard to scale JSON, which is basically schemaless.

I had schema-optional things too where certain named properties/elements of a JSON doc would be rendered onto a fixed table and schema, are you.guys doing anything like that? I think a schema optional or options for partial schemas for documents would enable the transition of schemaless data to schema'd stores in enterprise system evolution.

Do you guys intermix all documents like graph databases do, which seems a bit insane if you have to do some large scale table scan or migration of data and have to sift through data? Or build an index on a heap of documents with only some of them needing indexing?

rmbyrro 4 years ago

Hey, quick question: would this by any chance work as an AWS Lambda container image [1] connected to Aurora Serverless [2] (MySQL or Postgres), perhaps using the Data API [3]?

If yes, I think this is a big plus for teams looking for a managed infra.

[1] https://docs.aws.amazon.com/lambda/latest/dg/lambda-images.h...

[2] https://aws.amazon.com/rds/aurora/serverless/

[3] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

dfragnitoOP 4 years ago

It would not work in its current state but we have plans for a Lambda compatable offering. We are also looking at fly.io and other "serverless" function providers.

brad0 4 years ago

Interesting project! I'm curious about how this compares to other tech:

- How is data stored internally?

- What are the tradeoffs with this implementation?

- Do you generate indexes on data as it comes in?

- How are nested keys handled?

- Are indexes eventually consistent?

etirk 4 years ago

Thanks!
- How is data stored internally?
During our beta we will be hosting the databases and the underlying table structures are not viewable. However, on release, you will have full access to inspect the tables and indexes as they will be created within your own database.
- What are the tradeoffs with this implementation?
Depends on what you compare it to, but let's assume we are looking at the tradeoffs of SFSQL vs straight SQL tables and indexes. If you needed to store a number of columns in a SQL table and you knew exactly what columns would be queried on (as in filtered on, not just selected), then you would not be able to beat the speed of a composite index on those exact SQL columns. However, as more columns are added to that table, and the demand for more of those columns be queryable at index speeds increases, then the possible combinations of covering indexes grows. In general, more queryable columns end up resulting in more indexes, and if you can't accomodate every possible combination of columns across those indexes, then someone will hit a query which ends up table scanning, and of course more indexes result in slower inserts as well. With SFSQL there is a static number of indexes. It's insert speed and query speed might not be the fastest compared to custom composite indexes. But query speed degrades predictibly as the number of columns queried grows. You might be surprised at the performance given that you can add and query new attributes on the fly without any tables or indexes to create.
- Do you generate indexes on data as it comes in?
No, we don't have to do that. Everything is indexed.
- How are nested keys handled?
Every primitive attribute (non-object) is indexed by it's value and the relation between objects uses oid/poid indexes. Joins are constructed on the fly that make use of the indexes on the primitives and between the objects.
- Are indexes eventually consistent?
The indexes are immediately consistent.
- singron 4 years ago
  
  It sounds like you have single-column indexes on every column and then would rely on an intersecting query plan (e.g. bitmap heap scan) if you query multiple columns.
  This isn't just slightly slower, it depends a lot on your data how slow this is. It's potentially linear in the length of single-column matches. E.g. a query plan for (region='us' AND product='1234' AND year='2021') could scan the region index for all 'us' tuples, the product index for all '1234' tuples, and the year index for all '2021' tuples. The result might only be 0.0001% of rows, but this might have to scan more index entries than you have rows in the database. In some cases it's probably better to use one index and apply the remaining column filters as you fetch rows from the heap.
  If you don't care about performance or only have a few MB of data then maybe you will get away with it, but I've never worked on a project where database performance wasn't an issue.
  - etirk 4 years ago
    
    That WOULD be horrible. But we arent doing that. In essence we found that JOINS using optimized indexes are less costly than full table scans (on unindexed attributes).
- etirk 4 years ago
  
  And just to be clear. This is NOT a mutually exclusive solution. Just as people use a combination of different databases for different projects and even within the same project, SFSQL can be used alongside other solutions. It is especially convenient to use as an additional datastore when you are already storing your data in a SQL DB because there is no additional devops beyond what was already in place for that SQL DB.
- simonw 4 years ago
  
  "During our beta we will be hosting the databases and the underlying table structures are not viewable."
  A problem here is that during your beta I would imagine potential customers will want to understand exactly how the table structures work - I know that I wouldn't move forward with evaluating this product without understanding what it's actually doing for me.
  - dfragnitoOP 4 years ago
    
    I would hope that the “what” SFSQL is doing for you is answered on our home page, if not our apologies. We are constantly working on describing the benefits of using SFSQL. “How” we do it is a bit of a moving target for now. The purpose of the beta is to access our implementation and make any needed tweaks. With that said we will provide a better answer to the "How", stay tuned. As previously stated upon the release of our self hosted offering all will be revealed.
    *edit typo

rmbyrro 4 years ago

Sounds interesting and useful.

How likely is it for this product to be alive and well maintained for the next 5 years at least?

dfragnitoOP 4 years ago

Great question.
We have shut down our consulting business and are all in on SFSQL. We will not be offering a FOSS solution. Our plans are to provide SFSQL via a subscription based self hosted model. Delivered initially as a docker container.
SAAS offerings would be provided through infrastructure provider partners. We would prefer not to be the SAAS provider of SFSQL. We have other business models that we plan to deploy but we don't want to get ahead of ourselves.
- rmbyrro 4 years ago
  
  I don't mean to be negative, but just two cents of feedback:
  I believe (could be wrong) many developers would be worried about using this.
  Seems to me you're a small shop (?). AWS can get away with a proprietary DB API like DynamoDB's. They are big and can provide strong assurances.
  Since SFSQL is just starting out, by a small shop, it would be a high risk to rely on it being a closed source, proprietary API.
  Might be worth thinking about how you're going to address such concerns. Must have something really strong to support.
  - dfragnitoOP 4 years ago
    
    In a self hosted model developers would host SFSQL on their own infrastructure, which could be AWS, Google, Azure, DO, etc.. There is no reliance on us. Yes we are maintaining the API and underlying implementation. Software is a Service thus the subscription model.
    
    rmbyrro 4 years ago
    
    Yeah, I totally agree that providing the self-hosted option is a big plus.
    Covers maintainability concerns at least in the short term. Some might still be worried about long-term security & bug fixing, support, etc.

mdellavo 4 years ago

Are you unrolling the nested JSON data structures and storing as traditional K/Vs in an EAV pattern? Possibly using one table for each datatype or using a sparse table?

I'd be curious how this performs for complex queries - does this rely heavily on index intersection?

etirk 4 years ago

Yes, an EAV pattern is being used. The end result in query speed with the particular table designs and indexes is essentially just like what index intersection gives you but without the setup. And of course EAV is not a pattern that you can easily roll out by hand when you need it.
Please try this demo which is a sample of some queries against Clinical Trials data. https://schemafreesql.com/demo.html#clinicalTrial Although this demo data set is a limited size, the same queries when run against the entire clinicalTrial dataset performed very respectably.
- mdellavo 4 years ago
  
  EAV is a pattern you can easily roll out - https://docs.sqlalchemy.org/en/14/orm/examples.html#module-e...
  Another commenter illustrated some of the issues you will run into with this pattern at scale.
  It would probably help to benchmark some complex queries on a sizable data set. And compare against mongo, postgres jsonb, vanilla eav, clickhouse, etc. Without much information to go on, it's hard to know what this is.
  - etirk 4 years ago
    
    I agree with getting some benchmarks up there. Until that's done, I can share what some preliminary tests revealed. Just to give you more of feel for where this tool might fit.
    We imported the complete Clinical Trials dataset (380k docs ~ 200 attributes each) into SFSQL, Mongo and a Postgres JSON column.
    Import speed with raw documents: Mongo and Postgres win hands down. Why? Very little processing to be done on their parts, while SFSQL unravels the structures and stores everything with indexes already in place. Excluded use cases for SFSQL? high-volume logging, data-sinks, etc.
    Query speed against raw document elements: Mongo and Postgres very fast, SFSQL respectable. Why? Mongo is optimized for querying raw documents, Postgres obviously did their work as well. Excluded use cases for SFSQL? storage of raw, unprocessed json documents.
    Then we extracted all unique instances of a particular attribute from that data and put them into their own collection in mongo and it's own table in Postgres. The number of distinct objects extracted to external collection/table was just 11. Then we modified the queries so that they JOINed to the external collection/table. Result: SFSQL still respectable (nothing changed internally or speed wise). Postgres and mongo displayed a huge slowdown (and this was just a single join). Included use cases for SFSQL? complex relational/referential data.
    
    mdellavo 4 years ago
    
    what about applying a GIN index on the JSONB?
    
    etirk 4 years ago
    
    Did not try that kind of index on postgres. Thx. Will keep that on the list for benchmarking.
dfragnitoOP 4 years ago

A more detailed response will be provided shortly but I do encourage you to launch a demo, https://schemafreesql.com/demo.html. You will be provided with an endpoint and access key to your own dedicated SFSQL service. No login required. You can start trying it out from within your own environment.

cryptonector 4 years ago

Want schema-less SQL? Use an EAV (entity-attribute-value) schema.

etirk 4 years ago

Exactly what you are getting! But without having to code, optimize and maintain your own EAV structure and API.
threeseed 4 years ago

Pretty good approach provided you don't want to query your data.
- etirk 4 years ago
  
  So right. It is VERY difficult to query a hand-rolled EAV structure. Kinda like editing a binary image with a text editor, lol. Please check out some of the demo queries. Hope you think it's as slick as we do.

azth 4 years ago

Is it similar to this: https://eng.uber.com/logging ?

etirk 4 years ago

Not at all. However, clickhouse is one of the underlying DBs on our list of test ports. We have been VERY curious how it would perform on clickhouse. In fact, there are some particular features of SFSQL such as 2 stage deletes and the general structure of the tables that could lend themselves to work well within clickhouse's mutations mechanism. We don't know yet, but if performance proves good on clickhouse then a whole set of use-cases opens up.

pvg 4 years ago

There's something wrong with your account creation process, it rejects just about any auto-generated (by Chrome) password even though they easily meet the listed criteria.

dfragnitoOP 4 years ago

Looks like chrome pw gen does not generate a special character we have removed that restriction. Please try again.
- pvg 4 years ago
  
  I think it does but not always. It works now, thanks!
  I don't know if you've seen
  https://news.ycombinator.com/showhn.html
  one of the things it talks about is minimizing 'try it' friction and this one felt pretty high, even without the password thing. People bounce at not just account creation but also stuff like 'beta access' especially if it's not clear from the 'beta access' verbiage if it means 'access right now' or just ending up on some list.
  - dfragnitoOP 4 years ago
    
    I see, thanks we will change the verbiage.
    We tried to remove friction by allowing beta account creation without a password and one click demo account creation.
    You can start using the beta service without creating a password, maybe that was not clear? We have "beta" there just so you know it's not production ready. We can put the beta verbiage in the account creation page description but not in the Create Account Button call to action.
    Thanks for the feedback.
dfragnitoOP 4 years ago

Checking into this now.

crooked-v 4 years ago

This sounds like a lot of reinvention of the JSON/JSONB support that Postgres has already had since 9.2/9.4. They have a slide deck with a variety of in-depth examples. https://wiki.postgresql.org/images/7/7c/JSON-by-example_FOSD...

    CREATE INDEX review_review_jsonb ON reviews USING GIN (review_jsonb);

    -- Select data with JSON
    SELECT review_jsonb#>> '{product,title}' AS title
         , avg((review_jsonb#>> '{review,rating}')::int) AS average_rating
    FROM reviews
    WHERE review_jsonb@>'{"product": {"category": "Sheet Music & Scores"}}' GROUP BY title
    ORDER BY average_rating DESC
    ;

etirk 4 years ago

Glad you asked about the JSON datatype. The Postgres JSON type is a great addition and certainly works well. Especially for situations where you have some common traits shared across something like 'products' (e.g. price), store those common traits in columns and then use the JSON type to store the uncommon columns (e.g. 'flavor').
Where SFSQL really helps is:
- When new attributes are created that need indexing, you still have to be aware of those new columns before you can index them. An app might let users create their own attributes on the fly. In SFSQL all new attributes are indexed.
- if that JSON document contains some deeper structures (e.g. 'Brand' which contains 'Address', 'Support Phone', etc) that you want to pull out and into their own tables (now or eventually). SFSQL stores all objects (aka nodes in the document) as objects.
- SFSQL updates individual attributes of that document independently from the other attributes, meaning you could even nest a 'counter' within the original document and constantly update it efficiently.
- Simplicity. You can still store anything (just like you would in a JSON type column) yet you gain the ability to reference objects as objects and query (even new attributes) without first indexing.
- CoffeeOnWrite 4 years ago
  
  How do you protect against user input that produces a psychopathic volume of indexes?
  - etirk 4 years ago
    
    That's a problem that standard SQL tables encounter - chasing new columns with new/modified indexes. (more info at our FAQ). SFSQL never runs into that problem since the indexes are static in structure. No new indexes are made, even if every attribute name was unique. The result is a very consistent query performance.
  - rmbyrro 4 years ago
    
    I guess that would be on the developer side to sanitize/limit/clean in some way, like limiting the number of key-value pairs, object depth, etc?
time0ut 4 years ago

MySQL and Oracle have had similar features for quite a while as well. I believe even sqlite added something like this recently, but I haven’t gotten around to trying it. Definitely curious what the advantages of this product are over the native capabilities. The page does describe the benefit of this solution indexing everything, but that sounds a little terrifying to me…
- etirk 4 years ago
  
  "terrifying" is understood. However, the system is NOT indexing every single permutation of attributes possible. That would be impossible of course. Please see the FAQ on indexing for more info: https://schemafreesql.com/faq.html#optimization-free

Settings

Show HN: A Schemaless Data Store Within Your SQL Database

Keyboard Shortcuts