Show HN: s3-lambda – Lambda functions over S3 objects: each, map, reduce, filter

177 points by wellsjohnston 9 years ago · 77 comments

Reader

hoodoof 9 years ago

Its weird how S3 seems to be the unwanted stepchild of AWS.

So many obvious innovations just aren't turning up.

For example, strangely, AWS introduced tagging for S3 resources, but you can't search/filter by tag, nor is the tag even returned when you get a list of objects, you can only get the tag with an object request. The word "pointless" springs to mind.

In fact it's strange that there is NO useful filtering at all apart from the very useful folder/hierarchy/prefix filtering. But apart from that you can't do wildcard searches or filters or date filters or tag filters.

I'm building an application right now that needs to get a list of all the jpg files - the only way to do that is get every single object in the bucket and manually filter out the unwanted ones - feels like its 1988 again.

It seems like it would also be valuable for there to be alternate interfaces to S3 such as the ability to send data via ftp or SMTP or sftp or whatever, but there are no such interfaces.

Hopefully Google will goad AWS into action on S3 innovation by implementing such features.

lobster_johnson 9 years ago

S3's API is so rudimentary that I prefer to think of it as a non-enumerable key/value store.
I learned this the hard way: We had an application where made the mistake of storing about a billion files in a nearly flat structure — one level of nesting, probably 100m "folders" in the root. Then one day we needed to go through it to prune stuff that was no longer in use. Unfortunately, if you don't have a "shardable" prefix, list requests are impossible to parallelize efficiently (because you can't subdivide the work), and our scripts took weeks to run to completion. Hard-earned experience: If you're storing large quantities of stuff in S3, always pick a shardable prefix. The upload date is a good choice. A random string will also do.
After this, my solution for any non-trivially-sized storage use case is to store an inventory of objects separately in a performant PostgreSQL database, and make sure all writes go through a service layer that shields the consumer from the details of S3. This has some benefits over a hypothetical centralized approach (but some downsides, like the possibility that things get out of sync if you sidestep the inventory). Overall, I wish S3 would store its metadata in something like BigQuery.
Anyone know if Google Cloud Platform's S3 equivalent, Cloud Storage, improves on these issues?
- lobster_johnson 9 years ago
  
  Replying to myself: Disappointingly, it seems GCP's Cloud Storage is pretty much a carbon clone of S3 as far as the API is concerned, down to the prefix/delimiter-based search.
- jessaustin 9 years ago
  
  I wonder if "bucket notifications" are reliable enough that one could keep such an index DB populated automatically?
  - nolite 9 years ago
    
    Yes, just hook those up to a lambda function and write to dynamodb or something
    
    otterley 9 years ago
    
    I tried this, but if you want to query by tags, using an RDS database works much better. DynamoDB is not well suited to this particular problem.
  - lobster_johnson 9 years ago
    
    I think SQS would be reliable enough here, yes.
- otterley 9 years ago
  
  Have you looked into the inventory functionality? It was just added last November. http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inven...
  - lobster_johnson 9 years ago
    
    Wow, you get a CSV file of all the objects. That's a solution I did not expect.
    Sounds a bit like something they cooked up in a hurry to avoid having to design a BigQuery-type service for querying arbitrary metadata; I bet they had some huge customer with a need to get a CSV file for a bucket, that were willing to effectively bankroll the development of this feature.
    But yes. That would sidestep the issue. You'd still have to turn on the feature and wait for the CSV file to build (apparently the best granularity is daily), of course, but it would help tremendously. Wish that had existed when we had our difficulties, about a year ago.
- philliphaydon 9 years ago
  
  I did something similar storing the information in PostgreSQL but made the inserts/updates/deletes based on the events of s3. If an object was stored it would insert into the database. If it was deleted it would soft delete in the database. Worked out well for me.
- afandian 9 years ago
  As someone heading down a similar path (and I'm fairly sure I've got sensible prefixes) can you share an example of a prefix that caused you trouble. Is it something like
  /path/to/big-dir/«lots-of-sequential-filenames» ?
  - lobster_johnson 9 years ago
    
    Exactly. It works fine for most tasks, of course, but if you ever want to process the contents of the S3 bucket in bulk, nothing will ever be able to parallelize that one list request to /path/to/big-dir.
    If you don't use the evenly-distributed-prefix trick, your only chance of speeding it up is knowing the file names beforehand. If they're all sequentially numbered, you might do that, of course.
    The shardable prefix doesn't need to be at the top level. So you could also organize it like so, for example:
    /secret/documents/2016-01-01/00000001.doc
    
    afandian 9 years ago
    
    Thanks! I've read the docs and blog posts, but it was interesting to see a real live antipattern.
  - akx 9 years ago
    
    I suppose it's like with regular file systems -- don't have too many files in a directory.
    In your use case, consider `/path/to/big-dir/AA/AABB/AABBCC` or similar?
    
    afandian 9 years ago
    
    Sorry, that example wasn't my data, it was to illustrate question.
kot-behemoth 9 years ago

While great points, I think it might then go beyond the "Simple" in the S3 name itself. Wasn't the original purpose of the service to have it as a dumb storage, and you'll layer metadata as required? I.e. storing indices separately with whatever functionality is needed (be it date/path filtering).
- hoodoof 9 years ago
  
  Perhaps true.
  I'll never do that though because I'd have to use DynamoDB, which is a technology that is high on my list of "technologies that I am least enthused about".
  Also, I really shouldn't have to go to all the work of creating and maintaining a metadata database and implementing a query API just because I want to do searches more powerful than "list all objects" - that's Amazon's job.
  - maxaf 9 years ago
    
    Moreover, even if you had gone to the trouble of building such an API, S3 still doesn't offer bulk operations, so you'd have to operate on each matching object... one object at a time.
    
    hoodoof 9 years ago
    
    This isn't such an issue because you can update the DynamoDB index using an AWS Lambda function on every putobject or removeobject event.
    It's still not something I want to do, mainly because I'd have to touch DynamoDB but secondly because, well, why the heck doesn't AWS do it?
    
    LightskinKanye 9 years ago
    
    What do you have against DynamoDB?
- edblarney 9 years ago
  
  Surely some extra functionality would not obfuscate the inherent simplicity of S3.
  A an S3Query module would not, I think, make things harder for S3 users.
  And frankly - it would be awesome.
  I used s3 a lot, and loathe to switch to a DB if I can avoid it.
  Some querying and indexing features I think would be taken up by a large number of devs.
- xtracto 9 years ago
  
  This, we love S3. What we did is add a SQL tier for some of the data we are storing there in case we want to do some more structured operations.
  - hoodoof 9 years ago
    
    Yes, but are you sure your database matches the underlying data store?
    The real problem with building a metadata index outside is that you then have the synchronization validation - yuk.
    
    illumin8 9 years ago
    
    You can always do a full scan of your S3 namespace every week or so and synchronize the index. This gives your consumers low latency access to the object store, as index lookups are extremely fast, it minimizes the cost of lookup events on S3.
    
    otterley 9 years ago
    
    This was just introduced last November: http://docs.aws.amazon.com/AmazonS3/latest/dev/storage-inven...
    
    hoodoof 9 years ago
    
    So my database is up to a week wrong? Errr.....
    
    posterboy 9 years ago
    
    in that it stores undeleted files, until weekly clean-up
    the DB is only incomplete for as long it takes to commit to the SQL layer after storing successfully.
illumin8 9 years ago

First, I wanted to say, you bring up some very good points. S3 wasn't really designed to be a searchable key/value store, as you have to pay for lookups, and pagination kills your ability to effectively search anything greater than a few thousand objects in a hierarchy, within a reasonable amount of time.
There are, however, ways to solve this: you could fire a Lambda function whenever an object is put into your S3 bucket that simply adds a single row to a DynamoDB table with the object name, along with any additional metadata you might like to capture to assure data provenance. Then, to search, you can simply query the DynamoDB table.
As always, there are many basic building blocks at AWS, but you have to connect them together (like legos) before they become useful for most applications.
- hoodoof 9 years ago
  
  As mentioned elsewhere in this thread, an external metadata database of S3 object immediately introduces synching and validity issues.
  DOS is smarter than S3.
  - philliphaydon 9 years ago
    
    I believe it works fine.
  - noobiemcfoob 9 years ago
    
    DOS is 32 bit...
wellsjohnstonOP 9 years ago

I've had the same frustration with S3, and the reasons you went over are what drove me to create this. It seems like they made S3 (10 years ago now?) and just forgot about it. There is no way to sort/filter or even list all the objects in a bucket without writing a recursive algorithm using one of their SDKs.
tjholowaychuk 9 years ago

I just index S3 with SQL, that combination is plenty powerful, not a huge worry, but it might be interesting to see more native support for that kind of thing.

dschnurr 9 years ago

Might make sense to rename this to avoid confusion with AWS Lambda (I immediately thought it was related). Otherwise, looks like an awesome library!

kmfrk 9 years ago

S3 also has a built-in function to perform a Lambda operation on each upload, which makes it more confusing.
I mean, the README is excellently written and makes clear what the project does, so it's not a big deal beyond the ambiguous name.
wellsjohnstonOP 9 years ago

Ah yeah, just realizing this...what would you recommend?
- cocktailpeanuts 9 years ago
  
  I also came here thinking this is some sort of aws lambda triggered within the context of certain s3 file. I would say anyone who's heard of AWS lambda would think that way.
  Maybe functional-s3?
  - wellsjohnstonOP 9 years ago
    
    Okay this seems like a good alternative. I just renamed the repo. Renaming it on npm...is a bit cumbersome :|
  - stevewilhelm 9 years ago
    
    fp-4-s3?
- gbrits 9 years ago
  
  s3-dataflow / s3-pipe
- lancefisher 9 years ago
  
  Smap for S3 map. :)
- andkon 9 years ago
  
  func-s3?

simonw 9 years ago

First impression: this is a brilliant piece of software design.

The ability to compose a map/filter chain and execute it in parallel against every object in an S3 bucket that matches a specific prefix - wow.

The set of problems that can be quickly and cheaply solved with this thing is enormous. My biggest problem with lambda functions is that they are a bit of a pain to actually write - for transforming data in S3 this looks like my ideal abstraction.

koolba 9 years ago

... Except it's not!
The "lambda" here isn't AWS Lambda. It's a locally executed function.
Now if this scheduled a bunch of real Lambdas to execute the work for each bucket then yes that'd be awesome.
- kot-behemoth 9 years ago
  
  It should be fairly easily doable with Gordon (https://github.com/jorgebastida/gordon), and scheduling via CloudWatch Events. Or Airflow.
- simonw 9 years ago
  
  Bah. My first impression was totally wrong in that case. Here's hoping someone builds a version of this that executes magically in the lambda cloud.
  - illumin8 9 years ago
    
    Well, you could run it on a large EC2 instance (x1.32xlarge?!:O) and it would be running the lambdas on the cloud, technically... ;-)
- glogla 9 years ago
  
  That would be glorious.
rpedela 9 years ago

Yes, I concur! I am definitely trying this out. I have a couple use cases where I think lambda functions would be useful but I don't currently have the time to figure out how to write and execute them.
wellsjohnstonOP 9 years ago

Thank you :)
Writing this was a necessity for me, being a 1-person data team coming from a Node.js background.

hayd 9 years ago

see also aws athena https://aws.amazon.com/athena/ ?

dajohnson89 9 years ago

That seems cool but paying per query (per TB scanned) frightens me. I imagine having to fret about how efficient my queries are...
- illumin8 9 years ago
  
  It's not that bad. You can compress the data on S3 in ORC or Parquet format, and you only pay for the compressed data you read, so 1TB can be 130GB after compression. Plus, these formats store summary data, so queries like SELECT COUNT don't have to do a full table scan - they can read just a few KB of summary data for the result.
  - dajohnson89 9 years ago
    
    But that's a lot of work....Just to have sane costs for reads of your data
    
    illumin8 9 years ago
    
    It's actually just two commands:
    1. hive 2. INSERT INTO parquet_table SELECT * FROM csv_table;
wellsjohnstonOP 9 years ago

I did not know about this...looks like Amazon's version of BigQuery. Fantastic!
nstj 9 years ago

Somehow I'd overlooked this too: nice find

DenisM 9 years ago

So... the client-side code iterates S3 objects matching a certain filter, and then schedules a lambda for each one of those objects. Is that right? Or does the iteration procedure itself is a lambda? Also, when you chain several operators together, where does the chaining happen?

I'd like to understand where different parts of the code are being executed.

daviding 9 years ago

On a quick page-down through the code, I think this is not related to AWS-Lambda, it's more 'local lambda' where the map etc is run locally.
wellsjohnstonOP 9 years ago

First, a list of keys is generated based on the set context (and modifier functions). "context" returns a Request object, allowing you to call a lambda function (each, forEach, map, reduce, filter). Each lambda function returns a Promise, allowing you to chain them together. They will operate over the same context, in sequence.
Edit: This is not related to aws lambda...sorry for the confusion
tjholowaychuk 9 years ago

I thought this too. You could easily ship/eval the toString()'s of the functions in individual Lambda functions, the name is definitely confusing haha.

avip 9 years ago

This is a nice project. For real-world use cases, we have good alternatives:

1. Migrate s3 ==> gc and use BigQuery which does support udf

2. Register to databricks (I'm not affiliated)

3. (for the brave) poke aws support to implement udf on Athena

_Marak_ 9 years ago

If anyone is interested in this same kind of architecture for multi-cloud file-system providers ( no cloud lock-in ), please check out this project: https://github.com/bigcompany/hook.io-vfs

Used in production, but it could use some contributors.

kvz 9 years ago

Getting aan index of (millions of) files on s3 is very slow for us, like, days. Is there anything you do to work around this? It seems since this is not an AWS Lambda project the client first has to acquire an index from S3 before concurrency benefits set in?

wellsjohnstonOP 9 years ago

This does not have to do with AWS Lambda, I'm thinking about renaming it to "functional-s3", or something similar.
To answer your question, there isn't really a workaround for this yet, although indexing should be much quicker than "days". All the keys are listed recursively before running the lambda expression locally. If you have a huge number of files, this can take several minutes, maybe hours depending on the scope.
A workaround I've been considering is using a generator function to list the keys; that way, the lambda expression can start immediately, generating keys as it needs them.

cle 9 years ago

Is this susceptible to any of S3's eventual consistency constraints?

bpicolo 9 years ago

Anything you read from S3 is, so yes.
The best way to prevent eventual consistency issues in s3 is to use immutable files. Then you have consistency-now.
They don't have explicit SLAs on this, unfortunately, but I've heard rumored that internal pagers start firing with consistency behind on the order of hours.
wellsjohnstonOP 9 years ago

I'm not aware of S3's consistency constraints. What are those?
- primax 9 years ago
  
  http://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction....

wcdolphin 9 years ago

I thinking having the default be destructive for mapping is a strange design decision. That is going to bite someone one day soon.

wellsjohnstonOP 9 years ago

Good point...I will make it so that destructive is opt-in

dhpe 9 years ago

Really nice to have a generic functional interface to S3. Thanks.

stolendog 9 years ago

where actually you can use it ? in which cases? can you provide examples?

wellsjohnstonOP 9 years ago

Sure. We have application logs that come in to s3 and are stored by date prefix. I have CRON jobs that run node scripts that do various counts/statistics.

Settings

Show HN: s3-lambda – Lambda functions over S3 objects: each, map, reduce, filter

Keyboard Shortcuts