Ask HN: Is KDB a sane choice for a datalake in 2024?

40 points by sonthonax 2 years ago · 42 comments · 3 min read

Pardon the vague question, but KDB is very much institutional knowledge hidden from the outside world. People have built their livelihoods around it and use it as a hammer for all sorts of nails.

It's also extremely expensive and written in a language with origins so obtuse that it's progenitor APL needed a custom keyboard laden with mathematical symbols.

Within my firm, it's very hard to get an outside perspective, the KDB developers are true believers in KDB, but they they obviously don't want to be professionally replaced. So I'm asking the more forward leaning HN.

One nail in my job, is KDB as a data-lake and I'm being driven nuts by it. I write code in Rust that prices options. There's a lot of complex code involved in this, I use a mix of numeric simulations to calculate greeks and somewhat lengthy analytical formulas.

The data that I save to KDB is quite raw, I save the market data and derived volatility surfaces, which are themselves complex-ish models needing some carefully unit-tested code to convert in to implied vols.

Right now my desk has no proper tooling for backtesting that uses our own data. And I'm constantly being asked to do something about it, and I don't know what to do!

I'm 99% sure KDB is the wrong tool for the job, because of three things:

- It's not horizontally scalable. A divide and conquer algo on N<{small_number} cores is pointless.

- I'm scared to do queries that return a lot of data. It's non trivial to get a day's worth of data. The query will just often freeze, it doesn't even buffer. Even if I'm just trying to fetch what should be a logical partition, the wire format is really inefficient and uncompressed. I feel like I need to engineering work for trivial things.

- The main thing is that I need to do complex math to convert my raw data, order-books and vol-surfaces into useful data to backtest.

I have no idea how do do any of this in KDB. My firm is primarily a spot desk, and while I respect my colleagues, their answer is:

> Other firms are really invested in KDB and use KDB for this, just figure it out.

I'm going nuts because I'm under the assumption that these other firms are way larger and have teams of KDB-quants doing the actual research. While we have some quant traders who know a bit of KDB but they work in the spot side with far more simple math.

I keep on advocating for some Parquet style data-store with Spark/Dask/Arrow/Polars running on top of it that can be horizontally scaled and most importantly, with Polars, I can write my backtests in Rust and leverage the libraries I've already written.

I get shot down with "we use KDB here". I just don't know how I can deliver a maintainable solution to my traders with this current infrastructure. Bizarrely, and this is a financial firm, no one in a team of ~100 devs has ever touched Spark style tech other than me here.

What should I do? Are my concerns overblown? Am I misunderstanding the power of KDB?

vessenes 2 years ago

Long time Kdb/q enthusiast, absolutely NO enterprise deployment experience whatsoever.

This feels like a ‘pick your poison’ situation. You’ve been told already you won’t be allowed to dump kdb; it’s probably embedded in your infra in a bunch of ways, and ripping it out is a no-go.

OK, so, you have data in kdb. What you’re doing right now (it sounds like) is using it as literally just a raw data store. That’s the worst way to use it; a lot of work went into making it very fast to run summarization/grouping/sorting/etc all right on the kdb servers. Note that this is very unlike how an Apache project works.

Unfortunately, you wrote a rust library that probably doesn’t really distinguish your kdb storage from, say, JSON files, so you are at a crossroads.

Option 1: Get some good data cloning up, clone data over to your preferred generalized data lake tech, run rust against it.

Option 2: Go through your rust code with a fine tooth comb and figure where exactly it’s doing things that cannot be done semantically in q/k. Start slimming down your Rust lib, or more exactly, rework what queries its sending, and what state of data it expects

Option 3: dump your rust library and rewrite it in q or k.

Of these, I would be willing to bet that for an ‘ideal’ developer, meaning a 160+ IQ dev skilled in Rust, vs a 160+ IQ dev skilled in kdb, vs a 160+IQ dev skilled in say Java + Spark, Option 3 is going to be by far the least resource intensive in terms of deployed hardware, and the fastest / lowest latency.

That said, given where you’re at, a principled Rustacean who’s looking at coming to grips with kdb realtime, I think I’d recommend you think hard about Option 2. By the end of Option 2, you will probably be like “Yeah, this could be all k, or nearly all,” but you’re likely going to have some learning to do.

Think of it this way, when you’re done, you’ll be on the other side of the cabal, and can double your base rate for your next gig. :)

alexpotato 2 years ago

I'm speaking as someone who:

- has worked in finance for both hedge funds and banks

- has managed a project where KDB was mandated to be used by mgmt.

- on the above project, tasked one of the smartest developers I've ever worked with as the person to learn KDB and use it in the application

- I'm a SRE (and former Operations) so that colors my perspective.

Given the above, I list out the pros and cons:

PROS

- KDB is pretty fast (on some metrics of fast)

CONS

- VERY few people can write/read good Q (compared to say people who know Pandas/SKlearn etc)

- The learning curve is INCREDIBLY steep. Even the most cited documentation and tutorials have something like this in the intro "are you really sure you need KDB? B/c Q is REALLY hard to learn"

- As you mention, open source industry standards have come a LONG way since it made sense to have KDB (e.g. in the late 2000s/early 2010s)

Conclusion:

If you have a lot of in house expertise, then sure, it probably makes sense. If you are starting from scratch, I would not recommend it.

On that note, this point stood out: > People have built their livelihoods around it and use it as a hammer for all sorts of nails.

If you work in the industry long enough, you will find a lot of complexity added to systems for three reasons:

1. Some things in finance really do need to be complex due to the math etc

2. Smart people with quant backgrounds tend to LOVE complex things.

3. Smart, rational people realize that adding complexity is one way to build a fortress around their job. This is particularly true in high paying firms where people realize that it's their knowledge of the complex systems that is keeps them in that high paying job.

Give that, if you are looking to make a name for yourself at your firm: making things run faster, with fewer issues etc is a good way to stand out. Just be careful that you don't eliminate so much complexity that people get mad at you.

steveBK123 2 years ago

Longtime KDB user here I think you maybe have some misunderstanding personally and some poor engineering at your firm around the the tech/data. Timeseries data particularly market data is exactly the use case the product excels at.

The wire format is compressed.

KDB horizontally scales (even their competitors comparison pages state this - https://www.influxdata.com/comparison/kdb-vs-tsdb/)

A few things to consider that might help - you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis. KDB will not excel for this, nor will anything else. KDB shines when you learn to move your code to the data rather than your data to the code.

What does "move the code to the data" mean in practice?

You can do things like use PyKX which allows you to run your python & kdb code together on top of the data directly in the same process.

You should do as much of the filter/aggregation/joins/etc over on the KDB side before pulling the results back. You should also define, generate and use pre-aggregated data where it makes sense for your use case (second / minute / day bars).

Backtesting in KDB is relatively trivial as you have historical data organized by day and symbol. Any half decent KDB dev should be able to cook one up of increasing complexity for you.

Nick Psaris has a couple books that cover more advanced topics that may be of use.

reisse 2 years ago

> you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis.
Honest question - why? An entire day of market data for busy option series will be in low hundreds of gigabytes with proper wire format, maybe with some compression it'd be tens of gigabytes. Even with 10 Gbit/s networking (which is kinda slow - I believe you can get at least 40 Gbit/s for Amazon EC2<->EBS) the whole day of data will be transferred in a few minutes, which means your bottleneck will be compute, not IO/network. And compute can be parallelized pretty easily.
- kragen 2 years ago
  
  because 10 gigabit per second networking is 204 times slower than the hbm2 memory-to-cpu interface, which is 2048 gigabits per second. that means that some computations over the whole dataset will be 204 times faster, running in a few hundred milliseconds instead of a few minutes. your question implies that no such computations exist, or at least could be of interest, but that's self-evidently false
  that's assuming the data is in ram, but even a single nvme flash drive can reach 60 gigabits per second
  (disclaimer, i've never used kdb, just numpy, pandas, glsl, etc.)
  - reisse 2 years ago
    
    > your question implies that no such computations exist, or at least could be of interest, but that's self-evidently false
    My questions implies the specific use case being discussed here. Backtesting is mostly about doing a lot of computations over the same data with different parameters, so you can prefetch data once and then iterate over it multiple times - the network penalty is paid only once.
    
    kragen 2 years ago
    
    my experience is that you can often compute a conservative approximation to the signals you're looking for that's valid over a range of parameters, vastly decreasing the data you have to ship across the wire
- sonthonaxOP 2 years ago
  
  If it’s partitioned this should be even faster.
blitzar 2 years ago

> you do not want a solution (in any language/tech) that involves pulling an entire day of market data off disk, across the wire and over to your process for analysis.
Personally I do this and just throw time / compute at the problem - mostly because I don't want to pay for KDB in $ or learning curve.
If, however, one does it that way then the actual db is largely irrelevant - if the shop uses KDB, write a query once to pull symbol & timeframe and process locally.
sonthonaxOP 2 years ago

> What does "move the code to the data" mean in practice?
This grasps at why I’m finding KDB so hard to use. I’ve written a pricing and risk library in Rust. Historical data really needs to be taken processed in Rust rather than KDB.

oneplane 2 years ago

As powerful as KDB is, finding people to make use of it is almost not worth it. But as it tends to be entrenched (usually via poor reasoning), you usually are screwed when some company or project uses it. I personally would just quit and work somewhere else or on something else.

It's about as isolated as mainframe engineering; great on paper, great in closed off circles, but practically dead in the tech community at large.

succint 2 years ago

You definitely should be able to do these calculations in q near-data. In fact, porting your code from Rust to Q might even reveal bugs and/or sub-optimal code. This was many years ago but I ported over some non-trivial image processing code to Q just to learn the language. I was amazed by how everything fit into a page of code and how seeing all of it together revealed a subtle bug and a couple of ways to optimize better.

sonthonaxOP 2 years ago

I care about my job, enjoy my work, and I take pride in being able to deliver things, but I don't know how to deliver value for my traders here.

I'd really appreciate some perspective from seasoned data-engineers who might have seen this KDB as a data-lake pattern before, and what they did about it. Not just technically, but how they managed the organisational change of KDB for the KDB quants.

I also just don't really know where else to ask. There's not really an online KDB community, and you have a lot of KDB devs who are really good at KDB but know barely anything else which makes me skeptical of their advice.

steveBK123 2 years ago

Some resources both 3rd & 1st party..
https://www.timestored.com/kdb-guides/
https://learninghub.kx.com/forums/forum/kdb/
https://dataintellect.com/thoughts/?_sft_category=blog
https://groups.google.com/g/personal-kdbplus?pli=1
https://k4.topicbox.com/groups/k4?subscription_form=e1ca20f8...
https://code.kx.com/q/kb/faq-listbox/

nextworddev 2 years ago

No. Don’t go with KDB. (Source: built multiple production backtesting systems in prop desks)

sonthonaxOP 2 years ago

What did you do for your backtesting testing?

rbanffy 2 years ago

> Pardon the vague question, but KDB is very much institutional knowledge hidden from the outside world.

This answers the question for me: unless you are sure the performance lives up to your expectations AND it gives you a competitive advantage (which can be easily lost with the human in the loop), don't even think of it. Get the next best tech that's easy to use, and well documented, and remove the human in the loop to gain the edge.

steveBK123 2 years ago

Also you mention rust - https://docs.rs/kdb/latest/kdb/

The old style of kdb integration you could compile a .so and load it in to extend the language. Now people use PyKx to load in python modules. I have a guy doing this on my team to load a Python wrapped Rust lib.

It looks like you have a few options with Rust as per the link above.

Note this allows you to do the "move the code to the data" trick I mentioned.

sonthonaxOP 2 years ago

So I’ve actually contributed to this project.
My concern with adding custom libraries into KDB, while better than writing duplicative Q code is the maintenance nightmare of keeping them up to date in KDB.
It’s still an investment but I need to be aware of of the risks and downsides.
- steveBK123 2 years ago
  
  Loading your rust code into your existing KDB data lake and periodically updating it will be a significantly smaller lift than rewriting your data lake.
  It sounds like you are some sort of Quant Dev on a desk, and so it's really up to you what you want to do. If you push against the grain to do a data lake rewrite, you'll own the time/effort/outcome of a big Data Engineering project. So you better be very right and also very fast.
  If you are looking for solutions within your existing data lake, I've posted up a few sources / thoughts for you to get on and do your Quant Dev work.
  You sound very set on some sort of rewrite, so you should do what your heart desires. Just make sure you deliver value to your desk.
  - sonthonaxOP 2 years ago
    
    > Loading your rust code into your existing KDB data lake and periodically updating it will be a significantly smaller lift than rewriting your data lake.
    This is probably going to be what I do until KDB creaks over.
    > You sound very set on some sort of rewrite
    I vacillate between the two things. I'm personally used to data engineering with parquet and spark, which are widely used outside of finance, and don't have expensive vendor lock in.
    And then I realise that I'd have to own this stuff, and my job isn't a data engineer, and I'm a quant dev.

pyuser583 2 years ago

If you’re using KDB, use KDB. The decision has been made, the license paid.

Work to change your organization, not your technology.

Nuclear bombs used to be controlled by decades old systems that worked off floppy disks. Why? Because the systems were so important people worked around the tech.

You’re in a similar spot.

Even when using conventional languages and platforms, sometimes the decision has been made and you’re stuck with it.

KDB might not be the best fit for a datalake, but plenty of people will sleep better just knowing it’s KDB.

Change the people, not the tech.

steveBK123 2 years ago

Right what's left unsaid is - even if you are right, having your firms entire data lake rewritten into a new tech stack is a multi-million dollar, multi-year project that will probably take longer than your tenure at the firm.
So figure out how to make it work for you as its a powerful tool and more than adequate for the job, as you aren't going to find something 10x better for time series market data.
Have your team talk to some consulting firms for one-off projects or advisory assistance within your team (like Data Intellect) if the central org is not responsive.
- blitzar 2 years ago
  
  >rewritten into a new tech stack is a multi-million dollar, multi-year project
  (In normal companies i.e. non tech but especially finance) I have seen many promising careers destroyed by these, your corporate life is about the same as a bomb diffuser in a war zone.
  12 months and $10mil to roll out the entire system will become $100mil and 4 years to roll out half the system with hacky interconnects to the legacy systems and eventually everyone from IT to compliance will have the finger in the pie taking a cut.

sneakyavacado 2 years ago

I’ve intermittently worked with kdb for the past three years and feel broadly the same.

Can you deploy to a host that has a mount of the database and run a local q or are you forced to query via ipc? Are you on cloud or on prem?

A fight for horizontal scaling and running a local q against the data might be an easier one to win than a full replacement of the database.

gigatexal 2 years ago

Totally out of the loop here, what's KDB?

mooreds 2 years ago

https://code.kx.com/home/
- gigatexal 2 years ago
  
  Interesting. I’ll take a look
pyuser583 2 years ago

It’s a language, and more importantly execution environment, used by financial firms.
It’s been around since the 90s and has built up a level of trust that’s incomparable.
It’s also a completely closed language. You have to pay an insanely high license fee to use it.
cpach 2 years ago

AFAICT, the official name is “kdb+”.
https://en.m.wikipedia.org/wiki/Kdb%2B
drewg123 2 years ago

Thanks for asking. The only kdb I'm familiar with is the FreeBSD kernel debugger: https://docs.freebsd.org/en/books/developers-handbook/kernel...

mrj 2 years ago

    I get shot down with "we use KDB here"

Well, so fundamentally a decision has been made already. You got shot down. Unless there's some significant new data that might change that decision, it is what it is. The next step is for you to decide if you can get on board with that, or if it's time to start planning your next move. I don't know anything about KDB and you might be right. But it sounds like the powers that be don't want to make this change.

Sorry, but it makes no sense to swim against the tide in an organization where you don't have the rank to make the decisions.

     I keep on advocating for some Parquet style data-store...

The worst thing you can do is to continue spending your time advocating for change, they've heard you. If you stay on and the current tech does become untenable, you are unlikely to be the hero even in that case. They might just remember you as the detractor.

sonthonaxOP 2 years ago

> Unless there's some significant new data that might change that decision, it is what it is.
This is what I’m grasping at.
Are the challenges of writing a KDB system for back testing derivatives data which needs to work in tandem with a rust pricing library substantially different from a KDB engineered for back testing spot data.
One has complex and specific math and one doesn’t.
- michaelg7x 2 years ago
  
  I'm not sure if anyone's yet suggested that you embed your code as a library for KDB, that it could load dynamically? There's some pointer walking fun involved which Rust may _hate_ but it's not that hard and after that you'd be left with a the numerical arrays you're interested in.
- mrj 2 years ago
  
  It sounds like you're not given the tools to do your job

jmakov 2 years ago

What's wrong with delta lake?

sonthonaxOP 2 years ago

Is that a pun?

michaelg7x 2 years ago

Hi, KDB is used for this kind of thing in probably all the Tier 1 banks, or has been at some point. I'm surprised that you seem to have been given so little help by the KDB guys as it really matters how you store your data. That's informed by the data itself and the access patterns you're likely to use. When you say you're saving them as complex-ish models it makes me think that it may not be optimal for KDB to process.

KDB is in some respects as dumb as a bag of rocks. There is no execution profiler nor explain plan, no query analysis at all. When running your query over tabular data it simply applies the where-clause constraints in-order, passing the boolean vector result from one to the next, which refines the rows still under active consideration. It's for this reason that newbies are always told to put the date-constraint first, or they'll try to load the entire history (typically VOD.L) into memory.

KDB really is very fast at processing vector data. Writing nested vectors or dictionaries to individual cells could easily be slowing you down; I've heard of one approach which writes nested dictionaries into vectors with the addition of a column to contain the dictionary keys. Then you get KDB to go faster over the 1-D data, nicely laid out on disk. You really do need to write it down in a way that is sympathetic to the way you will eventually process it.

You can create hashmap indices over column data but the typical way of writing down equity L1 data is to "partition by date" (write it into a date directory) and "apply the parted attribute" to the symbol column (group by symbol, sort by time ascending). Each of the remaining vectors (time, price, size, exchange, whatnot) are obviously sorted to match and finding the next or previous trade for a given symbol is O1 simplicity itself. I've never worked on options data and so can't opine on the problems it presents, but if you've been asked to write this down without any help, then it's pretty "rubbish" of the KDB guys in your firm. You have asked for help, right?

I'm really going on a bit but just a few more things:

- KDB will compress IPC data — if it wants to. The data needs to exceed some size-threshold and you must, I think, be sending it between hosts. It won't bother compressing it to localhost, at least, according to some wisdom received from one of the guys at Kx, many moons ago. The IPC format itself is more or less a tag-length-value format, and good enough. It evolved to support vectors bigger than INT32_MAX a while ago but most IPC-interop libraries don't tend to advertise support for the later version that lets you send silly-big amounts of data around, so my guess is you may not want to load data out of KDB a day at a time. Try to do the processing in KDB!

You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end? This may sound glib but you wouldn't be the first person to have been given a VM to do your dev-work on that isn't quite up to the job. You can find out the size of the payload you're trying to read by running the same query with the "-22!" system call. It'll tell you how many bytes it's trying to send. Surely there's help to be had from the KDB guys if you reach out?

- I'm confused by the use of the term "data lake": to me this includes unstructured data. I'm not sure I'd ever characterise a KDB HDB as such.

- If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.

- KDB can scale horizontally. It can employ a number (I forget whether it's bounded) of slave instances and farm-out work. I think I read that the latest version has a better work-stealing algo. It's often about the data, though: if the data for a particular symbol/date tuple is on that one server over there, then you're probably better off doing big historic-reads on that one server alone. I doubt very much you're compute-bound or you'd have told us that your KDB licence limited you to a single or N (rather than any number) of cores.

- Many years ago I was told never to run KDB on NFS. Except Solaris' NFS. I have no idea whether this is relevant ;)

Good luck, sonthonax

sonthonaxOP 2 years ago
Thanks for the thorough response.
But firstly:
> If your firm has had KDB for ages there's a good chance it's big enough to be signed up to one of the research groups who maintain a set of test-suites they will run over a vendor's latest hardware offering, letting them claim the crown for the fastest Greeks or something. If your firm is a member you may be able to access the test-suites and look at how the data in the options tests is being written and read, and there are quite a few, I think.
Unfortunately my firm isn't that big ~150 in total and maybe about ~40 developers, if which there are 2 full time KDB devs who's job is mostly maintaining the ingestion and writing some quite basic functions like `as_of`. There's only two people who work on our options desk as developers, so there's a lack of resourcing for KDB. When I have these issues with KDB around performance, it's quite hard to get support within my firm from the two KDB devs (one of which is very junior).
> I've never worked on options data and so can't opine on the problems it presents
The thing about options data is that it's generally lower frequency but a lot more complex. If spot data is 1 dimensional, and futures data is 2 dimensional, options are 3 dimensional. You also have a lot more parameterizations which leads me to the second point :)
> you may not want to load data out of KDB a day at a time. Try to do the processing in KDB
Just to give you a very specific example of the processing I need to do. I have a data structure in KDB like this (sort of typescript notation):
```
     row = mapping<datetime, { a: number, b: number, mu: number: sigma: number, rho: number}>
```
This is vol surface. To convert that into volatility requires:
```
    f = log_moneyness - m;
    total_var = a + b * (rho * f + (f * f + sigma * sigma).sqrt())
    vol = total_var / time
```
Then in order to calculate the log_moneyness I need to calculate the forward price from an interest rate which is slightly more trivial.
Now I have a base in which I can start generating data like the delta, but this also requires a lot of math.
I was pulling this stuff out of KDB because I already had my code in rust that does all of this.
> You said you're scared to do queries that return a lot of data, and that it often freezes. Are you sure the problem is at the KDB end?
Yeah, I'm pretty sure in my case. We have some functions designed for getting data written by the KDB guys. Even for functions that return 30 something rows, like an as_of query it takes ~10s.
- rak1507 2 years ago
  
  The volatility calculation looks like it should be doable in q/k, I'm not sure about the more complicated stuff but at the end of the day it's a general purpose language too so anything is possible. KDB being columnar means thinking in terms of rows can often be slower. Sounds like you have a keyed table? If the KDB guys you have aren't that good/helpful you could check out some other forums. Could be useful for the future to be able to circumvent issues you have with the kdb devs.

Settings

Ask HN: Is KDB a sane choice for a datalake in 2024?

Keyboard Shortcuts