Mountpoint – file client for S3 written in Rust, from AWS
github.comThis is really interesting and something I've been thinking about for a while now. The SEMANTICS[1] doc details what is and isn't supported from a POSIX filesystem API perspective, and this stands out:
Write operations (write, writev, pwrite, pwritev) are not currently supported. In the future, Mountpoint for Amazon S3 will support sequential writes, but with some limitations:
Writes will only be supported to new files, and must be done sequentially.
Modifying existing files will not be supported.
Truncation will not be supported.
The sequential requirement for writes is the part that I've been mulling over whether or not it's actually required in S3. Last year I discovered that S3 can do transactional I/O via multipart upload[2] operations combined with the CopyObject[3] operation. This should, in theory, allow for out of order writes, existing partial object re-use, and file appends.[1] https://github.com/awslabs/mountpoint-s3/blob/main/doc/SEMAN...
[2] https://docs.aws.amazon.com/AmazonS3/latest/userguide/mpuove...
[3] https://docs.aws.amazon.com/AmazonS3/latest/API/API_CopyObje...
I use a WebDAV server for storing backups (Fastmail Files). The server allows 10GB usage, but max file size is 250MB, and in any case WebDAV does not support partial writes. So writing a file requires reuploading it, which is the same situation as S3.
What I did is:
1. Create 10000 files, each of 1MB size, so that the total usage is 10GB.
2. Mount each file as a loopback block device using `losetup`.
3. Create a RAID device over the 10000 loopback devices with `mdadm --build --level=linear`. This RAID device appears as a single block device of 10GB size. `--level=linear` means the RAID device is just a concatenation of the underlying devices. `--build` means that mdadm does not store metadata blocks in the devices, unlike `--create` which does. Not only would metadata blocks use up a significant portion of the 1MB device size, but also I don't really need mdadm to "discover" this device automatically, and also the metadata superblock does not support 10000 devices anyway (the max is 2000 IIRC).
4. From here the 10GB block device can be used as any other block device. In my case I created a LUKS device on top of this, then an XFS filesystem on the top of the LUKS device, then that XFS filesystem is my backup directory.
So any modification of files in the XFS layer eventually results in some of the 1MB blocks at the lowest layer being modified, and only those modified 1MB blocks need to be synced to the WebDAV server.
(Note: SI units. 1KB == 1000B, 1MB == 1000KB, 1GB == 1000MB.)
Of course, despite working on this for a week I only now discovered this... dm_linear is an easier way than mdadm to concatenate the loopback devices into a single device. Setting up the table input to `dmsetup create`'s stdin is more complicated than just `mdadm --build ... /dev/loop1{0000..9999}`, but it's all scripted anyway so it doesn't matter. And `mdadm --stop` blocks for multiple minutes for some unexplained reason, whereas `dmcreate remove` is almost instantaneous.
One caveat is that my 1MB (actually 999936B) block devices have 1953 sectors (999936B / 512B) but mdadm had silently only used 1920 sectors from each. In my first attempt at replacing mdadm with dm_linear I used 1953 as the number of sectors, which led to garbage when decrypted with dm_crypt. I discovered mdadm's behavior by inspecting the first two loopback devices and the RAID device in xxd. Using 1920 as the number of sectors fixed that, though I'll probably just nuke the LUKS partition and rebuild it on top of dm_linear with 1953 sectors each.
What a coincidence, I just recently did something similar.
Did you run into any problems with discard/zeroing/trim support?
This was a problem with sshfs — I can’t change the version/settings on the other side, and files seemed to simply grow and become more fragmented.
I suspected WebDAV and Samba might have had been the solution but never looked into it since sshfs is so solid.
Upon reading this idea I created https://github.com/lrvl/PosixSyncFS - feel free to comment
I did create the block files as sparse originally (using `truncate`), but at some point in the process they became realized on disk. Don't know if it was the losetup or the mdadm or the cryptsetup. I didn't really worry about it, since the block files need to be synced to the WebDAV server in full anyway.
Ahh OK, I think I see -- since the block files are synced in full, you are always swapping blocks and doing ~1MB of writing no matter what.
> I use a WebDAV server for storing backups (Fastmail Files). The server allows 10GB usage, but max file size is 250MB, *and in any case WebDAV does not support partial writes*. So writing a file requires reuploading it, which is the same situation as S3.
This is the part I absolutely missed. I was wondering how you were ensuring 1MB writes -- whether it was at the XFS level or mdraid level...
I think another thing that is missing which I'm inferring (hopefully correctly) is that you've mounted your webdav server to disk. So your stack is:
- LUKS
- mdraid
- losetup
- webdav fs mount
Is that correct?
The stack is XFS inside cryptsetup inside mdraid on top of losetup. The directory containing the losetup block files could be `rclone mount`'d from the WebDAV server, but that would make the setup unavailable if I didn't have network access. So instead I chose to have the block files in a regular directory, and I make sure to `rclone sync` that directory to the WebDAV server when I make changes in the XFS layer. Manually syncing also lets me run `sync` and watch the `rclone sync` output, which gives me greater confidence that all the layers have synced successfully.
>Ahh OK, I think I see -- since the block files are synced in full, you are always swapping blocks and doing ~1MB of writing no matter what.
Right. Let's say I update two files in the XFS layer. Those writes eventually result in three blocks in the lowest layer being modified. So now the `rclone sync` will need to do a `PUT` request to replace those three blocks on the WebDAV server, which means it'll upload 3MB of data to the server.
Thanks for the explanation, this makes perfect sense now, didn't realize the syncing was manual/separate.
If they're using LUKS then I think trimming/discard won't be possible.
My immediate instinct was that LUKS could issue trim/discard.
It looks like there's some anecdotal evidence out there that LUKS can discard
https://superuser.com/questions/124310/does-luks-encryption-... https://unix.stackexchange.com/questions/341442/luks-discard...
My question is more for the mdraid at the bottom of the stack than anything. I'm also a little curious about performance of something webdav vs. samba vs. sshfs (sshfs usually wins out and webdav does not strike me as particularly efficient)
Wouldn't the blocks all be cached locally for the most part? WebDAV is being used as a write behind log/backup. It should be as fast as local access through a file system created over mdraid loopback block devices ...
you're right (see the sibling comment chain), I didn't realize this was just being done on local disk with periodic backup, thought webdav was below it all!
FWIW this is similar to Apple's "sparse image bundle" feature, where you can create a disk image that internally is stored in 1MB chunks (the chunk size is probably only customizable via the command line `hdiutil` not the UI). You can encrypt it and put a filesystem on top of it.
Are you using davfs2 to mount the 1MB files from the WebDAV server?
I started out with davfs2 but it was a) very slow at uploading for some reason, b) there was no way to explicitly sync it so I had to either wait a minute for some internal timer to trigger the sync or to unmount it, and c) it implements writes by writing to a cache directory in /var/cache, which was just a redundant 10GB copy of the data I already have.
I use `rclone`. Currently rclone doesn't support the SHA1 checksums that Fastmail Files implements. I have a PR for that: https://github.com/rclone/rclone/pull/6839
Thanks for the response.
So you are using rclone sync to periodically push changes locally up to the webdav server?
This is a very nice solution.
I think you’re spot on: using multipart uploads, different sections of the ultimate object can be created out of order. Unfortunately, though, that’s subject to restrictions that require you to ensure all but the last part are sufficiently sized.
I’m a little disappointed that this library (which is supposed to be “read optimized”) doesn’t take advantage of S3 Range requests to optimize read after seek. The simple example is a zip file in S3 for which you want only the listing of files from the central directory record at the end. As far as I can tell this library reads the entire zip to get that. I have some experience with this[1][2].
[1] https://github.com/mlhpdx/seekable-s3-stream [2] https://github.com/mlhpdx/s3-upload-stream
Wouldn’t you be maintaining your own list of what is in the zip offline at this point?
Forgive the question but I never quite understood the point of S3. It seems it’s a terrible protocol but it’s designed for bandwidth. Why couldn’t they have used something like, say, 9P or Ceph? Surely I’m missing something fundamental.
EDIT: In my personal experience with S3 it’s always been super slow.
Because you don't have to allocate any fixed amount up front, and it's pay as you go. At the time when the best storage options you could get were fixed-size hard drives from VPS providers, this was a big change, especially on both the "very small" and "very large" ends of the spectrum. It has always spoken HTTP with a relatively straightforward request-signing scheme for security, so integration at the basic levels is very easy -- you can have signed GET requests, written by hand, working in 20 minutes. The parallel throughput (on AWS, at least) is more than good enough for the vast, vast majority of apps assuming they actually design with it in mind a little. Latency could improve (especially externally) but realistically you can just put an HTTP caching layer of some sort in front to mitigate that and that's exactly what everybody does.
Ceph was also released many years after S3 was released. And I've never seen a highly performant 9P implementation come anywhere close to even third party S3 implementations. There was nothing for Amazon to copy. That's why everyone else copied Amazon, instead.
It's not the most insanely hyper-optimized thing from the user POV (HTTP, etc) and in the past some semantics were pretty underspecified e.g. before full consistency guarantees several years ago, you only got "read your writes" and that's it. But it's not that hard to see why it's popular, IMO, given the historical context and use cases. It's hard to beat in the average case for both ease of use and commitment.
Thanks, it see now. Essentially I lacked the original context. I got many excellent answers and can’t reply everyone.
When S3 was released the Internet was very different. Two of the things that stood out were:
1. It offered a resilient key/object store over HTTP.
2. By the standards of the day for bandwidth and storage it was (and to a certain extent still is) very inexpensive.
Since then much of AWS has been built on the foundation of S3 and so its importance has changed from merely being a tool to basically a pervasive dependency of the AWS stack. Also, it very much is designed for objects larger than 1KB and for applications that need durable storage of many, many large objects.
The key benefit, at least according to AWS marketing, is that you don't have to host it yourself.
Simple api
Absurdly cheap storage
Extremely HA
Absurdly durable
Effectively unlimited bandwidth
Effectively unbounded storage without reservation or other management
Everything supports its api
It’s not a file system. It’s a blob store. It’s useful for spraying vast amounts of data into it and getting vast amounts of data out of it at any scale. It’s not low latency, it’s not a block store, but it is really cheap and the scaling of bandwidth and storage and concurrency make it possible to build stuff like snowflake that couldn’t be built on Ceph in any reasonable way.
The problem is S3 is just a lexicographically ordered key value store with (what I suspect is) key-range partitions[1] for the key part and Reed-Solomon encoded blobs for the value part. In other words, it’s a glorified NoSQL database with no semantics that you’d typically expect of a file system, and therefore repeated writes are slow because any modification to an object involves writing a new version of the key along with its new object.
[1] https://martinfowler.com/articles/patterns-of-distributed-sy...
These aren't really problems tho, just features.
These features may or may not be a problem for your application depending on your specific requirements.
It's clear that for many many applications S3 works just fine.
If you require file system semantics or interfaces (i.e. POSIX) or you update objects a lot or require non-sequential updates or.... then maybe it's not for you.
S3 is straight HTTP, the most widespread API. It can be directly used on the browser, has libraries in pretty much every language, and can reuse the mountain of available software and frameworks for load-balancing, redirections, auth, distributed storage etc
I think theres an interesting story in software ecosystems where there are two flavors of applications (which coexist) that prefer object stores over filesystems and vice versa. Good reference point for this I think exists in many modern video transcoding infrastructures.
Using something like FSx [1] gives you a performant option for the use cases when the tooling involved prefers filesystem semantics.
Here are reasons I'm using S3 in some projects:
1. Cost. It might vary depending on vendor, but generally S3 is much cheaper than block storage, at the same time with some welcome guarantees (like 3 copies).
2. Pay for what you use.
3. Very easy to hand off URL to client rather than creating some kind of file server. Also works with uploads AFAIR.
4. Offloads traffic. Big files often are the main source of traffic on many websites. Using S3 allows to remove that burden. And S3 usually served by multiple servers which further increases speed.
5. Provider-independent. I think that every mature cloud offers S3 API.
I think that there are more reasons. Encryption, multi-region and so on. I didn't use those features. Of course you can implement everything with your own software, but reusing good implementation is a good idea for most projects. You don't rewrite postgres, so you don't rewrite S3.
Thanks, I was unclear and meant only the protocol S3 not the service, but I see now that as a KV store it makes sense.
> In my personal experience with S3 it’s always been super slow.
Numbers? I feel like it's been a while, but my experience was it is in the 50ms latency range. That's fast enough that you can do most things. Your page loads might not be instant, but 50ms is fast enough for a wide range of applications.
The big mistake I see though is a lack of connection pooling: I find code going through the entire TCP connection setup, TLS setup, just for a single request, tearing it all down, and repeating. boto also enouranges some code patterns which result in GET bucket or HEAD object requests which you don't need and can avoid; none of this gives you good latency.
S3 works over HTTP, which means that it is designed to work over the internet.
Other protocols you mentioned, including NFS, does not work well over the internet.
Some of them are exclusively designed to work within the same network, or very sensitive to network latency.
> Forgive the question but I never quite understood the point of S3.
S3 and DynamoDB are essentially a decoupled BigTable; in that both are KV databases: One is used for high performance, small obj workloads; the other for high throughput, large obj workloads.
They have NFS (called EFS), but it's about 10x more expensive.
I wouldn’t give a number because the pricing models are fairly different and the real cost will depend on how you’re using it and how easy it is to shift your access patterns. On my apps using EFA, that 10x is more like .8-1.1x — an easy call versus rewriting a bunch of code.
Good luck mounting EFS in Windows.
Apparently the nfs client in windows only supports nfs v3 - while efs only supports v4. The closest I found was:
http://citi.umich.edu/projects/nfsv4/windows/readme.html
Seems odd that there are no commercial nfs v4 clients for windows? Might now be possible to mount via wsl?
I see ractos has nfs client - but could not figure out which version...
WinFsp (FUSE for Windows) has an NFS driver: https://github.com/winfsp/nfs-win
Do you mean EFS specifically, or you find that NFS doesn't work? Because it was my recollection that Windows included NFS machinery natively
EFS - this is what is being talked about here.
AWS also offers FSx for Windows File Server, and FSx for ONTAP if you need remote Windows file service.
We are talking about EFS here
S3 is slow but at the same time low cost, if you want fast AWS has other alternatives but pricier.
This is misleading. S3 is also incredibly fast. The former when you’re sequentially writing (or reading) objects and the latter when concurrently writing (or reading) vast numbers of them.
That depends on what you consider "fast". EFS (the "serverless" NFS) has sub-millisecond operation latency. S3 is more in the 10-20ms range for most operations, with occasional spikes.
BTW, if you need a pure Go client for NFSv4 (including AWS EFS), feel free to check my: https://github.com/Cyberax/go-nfs-client
We can write vast numbers and volume of objects to S3 per second using concurrent processes (spawn 1000 lambda invocations and try it). As long as I have the network bandwidth, I can push stuff essentially as fast as I want. Is that true for EFS? Handle limits. Network interface limits. Protocol limits.
I’m not saying that S3 is perfect or even good for most workloads. However, it is most excellent when the workload fits.
Yea it kind of is! I've used EFS in real-world scenarios with more than 1,000 concurrent readers/writers. EFS's costs are just otherworldly compared to S3. If you need that interface though, it's a good (albeit expensive) choice.
At one point we had a ~560tb EFS disk that ran a variety of mixed workloads (large and small files). It was untenable - raw reading/writing IO is OK, but metadata IO hits a brick wall and destroys the performance of the whole disk for all connections (not just ones accessing a particular partition/tree/whatever).
In order to migrate off it and onto s3 I had to build a custom tool in rust that used libnfs directly to list the contents of the disk. We then launched a large number of lambdas to copy individual files to s3.
It was fun, but in my experience EFS is only good if you have a very homogenous workload and are able to carefully optimise metadata IO. I wouldn’t recommend it - s3 is just cheaper, faster and better.
EFS will handle 1000 readers/writers. We tested it as a data exchange medium for computational tasks. The meta-information APIs in EFS in my experience are faster than S3's (LIST in S3 is notorious). The overall amount of data we stored in EFS was pretty limited (single-digit terabytes), though.
I wouldn't use EFS to store petabytes of data, but if you need a resilient and scalable storage that you can easily integrate into your application, then EFS is great.
One thing that I loved, is the ease of use in local development. With EFS you can simply mount the shared volume into your Docker/K8s container in production, and a local directory when you're developing tasks locally on your laptop. You can even run tasks without a container and monitor their output by looking at the exchange directory. There are AWS API emulators (e.g. Localstack) but they are not as convenient.
fast is an overloaded word. Could mean throughput, or latency. S3 throughput is incredible.
note, I worked on Amazon at S3 2015-2017.
I could have worded it better. I was just trying to understand the why of the protocol. My experience was probably irrelevant as we were just using it for storage and interfacing with an FS translation (rclone or similar). We have long stopped using AWS for cost reasons though.
Same with my experience. Not a fan
After teaching customers for years that S3 shouldn't be mounted as a filesystem because of its whole object-or-nothing semantics, and even offering a paid solution named "storage gateway" to prevent issues between FS and S3 semantics, it's rather interesting they'd release a product like this.
Amazon should really just fix the underlying issue of semantics by providing a PatchObjectPart API call that overwrites a particular multipart upload chunk with a new chunk uploaded from the client. CopyObjectPart+CompleteMultipartUpload still requires the client to issue CopyObjectPart calls for the entire object.
> it's rather interesting they'd release a product like this
Azure has a feature where you can mount a blob store storage container into a container/VM, is this possibly aiming to match that feature?
I definitely think people should stop trying to pretend S3 is a file system and embrace what’s it’s good at instead, but I have had many times when having an easy and fast read-only view into an S3 bucket would be insanely useful.
Eventually AWS always gives customers what they want even if it's a "bad idea".
Bad ideas are very relative.
Some bad ideas work extremely well if they fit your use case, you understand very well the tradeoffs and you’re building safeguards (disaster recovery).
Some other companies try to convince (force?) you into a workflow or into a specific solution. Aws just gives you the tools and some guidance on how to use them best.
Indeed.
Distributed patching becomes hell. You need transactional semantics and files are not laid out well to help you define invariants that should reject the transaction.
There is no reason why the descriptor of objects can’t be updated with a new value that has all of the old chunks and a new one, since S3 doesn’t do deduplication anyway the other chunks may be resized internally with an asynchronous process that gets rid of the excess data corresponding to the now overridden chunk.
> This is an alpha release and not yet ready for production use. We're especially interested in early feedback on features, performance, and compatibility. Please send feedback by opening a GitHub issue. See Current status for more limitations.
JungleDisk was backup software I used ~2009 that allowed mounting S3. They were bought by Rackspace and the product wasn't updated. Seems to be called/part of Cyberfortress now.
Later I used Panic's Transmit Disk but they removed the feature.
Recently I'd been looking at s3fs-fuse to use with gocryptfs but haven't actually installed it yet!
We've used the s3fs-fuse library for a while at work for SFTP/FTP server alternatives (AWS wants you to pay $150+/server/month last I checked!) and it's worked like a dream. We scripted the setup of new users via a simple bash script and the S3 CloudWatch events for file uploads is a dream. Its been pretty seamless and hasn't caused many headaches.
We've had to perform occasional maintenance but its operated for years with no major issues. 99% are solved with a server restart + a startup script to auto-re-mount s3fs-fuse in all the appropriate places.
Give them a try, I recommend it!
> Later I used Panic's Transmit Disk but they removed the feature.
BTW, Panic seemingly intends to re-build Transmit Disk. Hopefully it'll be part of Transmit 6: https://help.panic.com/transmit/transmit5/transmit-disk/#tec...
A supported macOS option appears to be Mountain Duck: https://mountainduck.io/
ForkLift also lets you mount S3 as a drive. https://binarynights.com
There's a similar project under awslabs for using S3 as a FileSystem within the Java JVM: https://github.com/awslabs/aws-java-nio-spi-for-s3
There's some really confusing use of unsafe going on.
For example I'm not sure what they're doing here:
https://github.com/awslabs/mountpoint-s3/blob/main/mountpoin...
Something similar that I've been using for a while now for an S3 filesystem: Cyberduck[0]
In a similar vein, I've been using ExpanDrive [0] for a while. Though admittedly it's only suitable for infrequent access / long term storage type use.
I just wish cyberduck would get a more standard UI. It is so win95 vb. Otherwise works great!
For anyone looking to mount S3 as file system, I will suggest giving rclone a shot. It can mount, copy and do all file operations not just on s3 but on a wide range of cloud providers, you can also declare a remote as encrypted so it does client side encryption
I want a better client for Google Cloud Storage, too, while we’re at it. The Python gcloud / gsutil stuff is mediocre on the best of days.
In theory, you can just use this library since GC Storage supports S3 protocol. But in practice, I’m not sure
Or https://clone.org ?
Presumably you forgot the 'r': https://rclone.org/
Autocorrect strikes again! Yes, I meant rclone - thank you.
FUSE makes everything worse, not better. The Unix file API is awful in general, and a terrible mismatch for key-values storage systems.
`gsutil ...` was pretty bad (it is python like you say, and based on a very outdated fork of boto2).
I've had really good luck with `gcloud storage ...` though, which takes essentially the same CLI args. It's much faster and IIRC written in golang.
I'm just about to start using it. I'd love to know what issues you've encountered. Thanks!
I find all the “written in Rust” qualifier in all these post title to just be a distraction. It doesn’t feel like it’s telling me anything.
Amazon is starting to invest in rust internally, strategically, since joining some of the leadership (https://aws.amazon.com/blogs/opensource/why-aws-loves-rust-a...).
It's considered a good replacement for C++, and like go, is really good for releasing tools. Tools like the AWS CLI... that work great when you can plop a single exe down as your install story, as opposed to say a python install and app (aws cli).
But it's also still new. Releasing a tool like this is likely a big deal in the area, and they're likely quite prod of it due to the effort of things like getting legal approval, marketing, etc, let alone the cool nerd factor of a filesystem, who doesn't want to show off by having written a filesystem or hell a fuse plugin... file system over dns anyone?
I would suggest changing the title to "Mountpoint-S3 - ..." as that's the project name to avoid confusion with mountpoint(1): https://man7.org/linux/man-pages/man1/mountpoint.1.html
It would be interesting to see how this compares to other solutions in this space, such as s3fs (the FUSE driver, not the python package), goofys, and the rclone mount feature, among others. This certainly has less features (notably, mounts are read-only!).
If you are looking for something supports atomic rename, you can checkout Blobfuse2[0] + ADLS Gen2[1]
Disclaimer: work for MSFT
[0] https://github.com/Azure/azure-storage-fuse
[1] https://learn.microsoft.com/en-us/azure/storage/blobs/data-l...
They should benchmark it against rclone
Couldn't tell from the README, does this do any sort of cache management or LRU type thing? In other words, does it fetch the underlying S3 object in real time, and then eventually eject them from memory and/or the backing FS when they haven't been used for a while?
`catfs` is a FUSE FS that can do this for you. You'll need some changes to make it work well. I'll have a friend upstream them soon, but they're easy to make yourself.
Could replace `goofys` with this and then stick `catfs` in front.
A simple, high-throughput file client for mounting an Amazon S3 bucket as a local file system.
This is exactly what I need. The current python scrips are good enough but a rust utility would be preferable
I don’t understand whether this is just a higher level abstraction of boto’s s3 client (à la s3fs)
It looks to be a completely different codebase from boto/s3fs.
Not having used s3fs I'm going to guess that s3fs is limited due to the limits of the underlying language - Python - namely poor performance overall and poor multi-thread story.
I'd imagine s3fs is useful for stuff like backing up personal projects, quickly sharing files between developers etc.
For operating at any kind of scale - in terms of concurrent requests, number or size of files etc - I'd guess that Mountpoint would be the only viable solution.
How does this compare to goofys?