3K, 60fps, 130ms: achieving it with Rust
blog.tonari.noAside from the Rust aspect (which is cool!), I can't believe we've come this far and still don't have low-latency video conferencing. Maybe I'm overly sensitive, but people talking over each other and the lack of conversational flow drives me crazy with things like hangouts.
John Carmack always has an interesting point to make about latency: https://twitter.com/ID_AA_Carmack/status/193480622533120001
>I can send an IP packet to Europe faster than I can send a pixel to the screen. How f’d up is that?
and to relate to the other post about landlines: https://twitter.com/ID_AA_Carmack/status/992778768417722368
>I made a long internal post yesterday about audio latency, and it included “Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations.”
> Many people reading this are too young to remember analog local phone calls, and how the lag from cell phones changed conversations
Is there somewhere to read about the changes in question?
I'm old enough to remember extensive use of analog landlines, and can't really think of any difference to a cellphone other than audio quality.
In my world, using regular cell service (not VoLTE), seems nearly as instantaneous as I remember analog lines. I remember how hard a satellite phone call was and I never have that much latency in a call.
Isn't this mostly because actually showing a pixel requires a macroscopic change?
Cisco "telepresence" solved this 15 years ago. Standardized rooms on both sides with high quality cameras and low latencies. Polycom had a similar but worse setup at the time. The Cisco experience was very close to being in a shared meeting with the other people. It made meetings across continents work very well and was an actual competitor to flying everywhere. Between the hardware being too expensive and the link requirements being very high I only ever saw it implemented in multinational telecoms for whom it was an actual work tool but also something to impress their clients with.
Either Cisco needed to bring down the cost massively to expand access or someone needed to build it in major cities and bill by the hour to compete against flying. None of those happened so it stayed a niche. Compared to those experiences more than a decade ago the common VC is still very slowly catching up. Part of it is setup, like installing VC rooms with 2 smaller TVs side by side instead of one large one so you can see the document and the other people at decent sizes. But part of it is still the technology. Those "telepresences" were almost surely on a dedicated link running on the telecom core network that guaranteed quality instead of routing through the internet and randomly failing. I suspect getting really low latency will require that kind of telecom level QoS otherwise you'll be increasing buffer sizes to avoid freezes.
Cisco and HP Halo were incredible but the biggest problem they had was 1) the requirement to build out an actual room for it and 2) the shitty software setup experience. The big corporates that could afford to build out real estate for VCs also bogged the shit down in "enterpriseyness" that made the shit impossible to use.
About 10 years ago I go to go on a tour of the Taiwan HP office. One thing that stands out in my mind was the telepresence rooms. Absolutely fabulous, large table, with screens across the table that showed high fidelity low latency image of whoever was sitting at a connected table.
Latency was still a huge issue with the HP Halo. I remember a specific meeting where they talked about upgrading the audio codec which didn't seem to address things much. It was kind of a running joke that any applause or laughter would and with a huge, noticable lag between locations.
I worked at a company that had a Cisco telepresence machine on wheels. You had to make sure it was plugged into a certain color Ethernet wall jack for it to work but every room had one. You could reserve it and then wheel it to the conference room you wanted.
That's nothing like a Cisco telepresence room. You have to have used one to understand. It's nothing too sci-fi -- not floor to ceiling curved displays or whatnot -- but just the multiple large TVs all in a curved setup on the other side of a curved table makes a huge difference.
And a standardized wall color and camera location, so that everyone that joins in from another telepresence room blends in as if they were really there.
Here's a picture:
https://en.wikipedia.org/wiki/Cisco_TelePresence#/media/File...
It would seem like they relaxed rules about what's in the background. But then, my knowledge is from a Telepresence room having been setup at a previous employer somewhere between 10 and 15 years ago (and I wasn't directly involved).
It would be interesting if a camera was on top of every tv, so that you have a 1-to-1 with every recipient.
That way, when you turn your head to the person on each tv, it would seem as if you were actually looking at them.
Getting off topic here, but this makes me think of what can be seen now in some Japanese programs because of social distancing measures. I don't know what kind of setup they have, but in some programs, from the spectator's perspective, you see people lined up behind a table, but some of them are actually on large monitors that make them appear at the right size. The interesting thing is that the ones on monitors act like if they were actually there, turning their head in the direction of the person speaking.
What Japanese programs?
ひるおび is one of them IIRC.
My first job out of school was doing product verification for the cameras that were used in those Cisco systems! It was pretty impressive, I think they managed to squeeze 1080p at 60fps over USB2. Had a lot of fun building jigs and testing setups to test the MTBF on a tight time frame
The biggest problem is that of the video codecs which ultimately boils down to using interframe compression. This technique requires that a certain # of video frames be received and buffered before a final image can be produced. This requirement imposes a baseline amount of latency that can never be overcome by any means. It is a hard trade-off in information theory.
Something to consider is that there are alternative techniques to interframe compression. Intraframe compression (e.g. JPEG) can bring your encoding latency per frame down to 0~10ms at the cost of a dramatic increase in bandwidth. Other benefits include the ability to instantly draw any frame the moment you receive it, because every single JPEG contains 100% of the data. With almost all video codecs, you must have some prior # of frames in many cases to reconstitute a complete frame.
For certain applications on modern networks, intraframe compression may not be as unbearable an idea as it once was. I've thrown together a prototype using LibJpegTurbo and I am able to get a C#/AspNetCore websocket to push a framebuffer drawn in safe C# to my browser window in ~5-10 milliseconds @ 1080p. Testing this approach at 60fps redraw with event feedback has proven that ideal localhost roundtrip latency is nearly indistinguishable from native desktop applications.
The ultimate point here is that you can build something that runs with better latency than any streaming offering on earth right now - if you are willing to make sacrifices on bandwidth efficiency. My 3 weekend project arguably already runs much better than Google Stadia regarding both latency and quality, but the market for streaming game & video conference services which require 50~100 Mbps (depending on resolution & refresh rate) constant throughput is probably very limited for now. That said, it is also not entirely non-existent - think about corporate networks, e-sports events, very serious PC gamers on LAN, etc. Keep in mind that it is virtually impossible to cheat at video games delivered through these types of streaming platforms. I would very much like to keep the streaming gaming dream alive, even if it can't be fully realized until 10gbps+ LAN/internet is default everywhere.
Interframes are not a problem, as long as they only reference previous frames, not future ones.
I was able to get latency down to 50ms, streaming to a browser using MPEG1[1]. The latency is mostly the result of 1 frame (16ms) delay for a screen capture on the sender + 2-3 frames of latency to get through the OS stack to the screen at the receiving end. En- and decoding was about ~5ms. Plus of course the network latency, but I only tested this on a local wifi, so it didn't add much.
[1] https://phoboslab.org/log/2015/07/play-gta-v-in-your-browser...
It's funny you mention MPEG1. That's where my journey with all of this began. For MPEG1 testing I was just piping my raw bitmap data to FFMPEG and piping the result to the client browser.
I was never satisfied with the lower latency bound for that approach and felt like I had to keep pushing into latency territory that was lower than my frame time.
That said, MPEG1 was probably the simplest way to get nearly-ideal latency conditions for an interframe approach.
Wouldn't you then hit issues where a single dropped packet can cause noticable problems? In an intraframe solution if you lose a (part of a) frame, you just skip the frame and use the next one instead. But if you need that frame in order to render the next one, you either have to lag or display a corrupted image until your next keyframe.
I guess as long as keyframes are common and packet loss is low it'd work well enough.
Corrupted frames happen; they're not too bad. You can also use erasure coding.
Interesting. I guess I'll have to rewrite a lot of code if what you are saying is true.
You can also just configure your video encoder to not use B-frames. Then if you make all consecutive frames P frames then the size is very maintainable. It gets trickier if your transport is lossy since a dropped P frame is a problem but it's not an unsolvable problem if you use LTR frames intelligently.
All the benefits of efficient codecs, more manageable handling of the latency downsides.
The challenges you'll run into instantly with JPEG is that the file size increase & encoding/decoding time on large resolutions outstrips any benefits you get in your limited tests. For video game applications you have to figure out how you're going to pipeline your streaming more efficiently than transferring a small 10 kb image as otherwise you're transferring each full uncompressed frame to the CPU which is expensive. Doing JPEG compression on the GPU is probably tricky. Finally decode is the other side of the problem. HW video decoders are embarrassingly fast & super common. Your JPEG decode is going to be significantly slower.
* EDIT: For your weekend project are you testing it with cloud servers or locally? I would be surprised if under equivalent network conditions you're outperforming Stadia so careful that you're not benchmarking local network performance against Stadia's production on public networks perf.
I tested: localhost (no network packets on copper), within my home network (to router and back), and across a very small WAN distance in the metro-local area (~75mpbs link speed w/ 5-10 ms latency).
The only case that started to suck was the metro-local, and even then it was indistinguishable from the other cases until resolution or framerate were increased to the point of saturating the link.
One technique I did come up with to combat the exact concern raised above regarding encoding time relative to resolution is to subdivide the task into multiple tiles which are independently encoded in parallel across however many cores are available. When using this approach, it is possible to create the illusion that you are updating a full 1080/4k+ scene within the same time frame that a tile (e.g. 256x256) would take to encode+send+decode. This approach is something that I have started to seriously investigate for purposes of building universal 2d business applications, as in these types of use cases you only have to transmit the tiles which are impacted by UI events and at no particular frame rate.
Actually, there are commercial CUDA JPEG codecs (both directions) operating at gigapixels per second. It's not a question of speed, but rather the fact that you can at least afford to use H.264's I-frame-only codec for much lower bandwidth requirements.
JPEG is still going to be larger & lower quality than H264. I still fail to see the advantage.
~10x higher framerate?
Almost every hardware codec I've seen supports JPEG. MJPEG is certainly more rare than the more traditional video algorithms, but it certainly gets used.
You can also eliminate I-frames and have I-slices distributed among several P-frames, so that you don't have spikes in bandwidth (and possibly latency if the encoder needs more time to process an I-frames)
I think a larger issue is the focus on video as opposed to audio. Audio may be less sexy but it is far and away more important for most interpersonal communication (I'm not discussing gaming or streaming or whatever, but teleconferencing). Most of us don't care that much if we get super crisp, uninterrupted views of our colleagues or clients, but audio problems really impede discussion.
Video is related to this though. If audio is synced to the video then a delayed video stream also means a delayed audio stream.
In my approach, these would be 2 completely independent streams. I haven't implemented audio yet, but hypothetically you can continuously adjust the sample buffer size of the audio stream to be within some safety margin of detected peak latency, and things should self-synchronize pretty well.
In terms of encoding the audio, I don't know that I would. For video, going from MPEG->JPEG brought the perfect trade-off. For reducing audio latency, I think you would just need to be sending raw PCM samples as soon as you generate them. Maybe in really small batches (in case you have a client super-close to the server and you want virtually 0 latency). If you use small batches of samples you could probably start thinking about MP3, but raw 44.1KHz 16-bit stereo audio is only 1.44 mbps. Most cellphones wouldn't have a problem with that these days.
Edit: The fundamental difference in information theory regarding video and audio is the dimensionality. JPEG makes sense for video, because the smallest useful unit of presentation is the individual video frame. For audio, the smallest useful unit of presentation is the PCM sample, but the hazard is that these are fed in at a substantially higher rate (44k/s) than with video (60/s), so you need to buffer out enough samples to cover the latency rift.
Discord does something like what you describe. It's kind of awful for music(e.g. if it's a channel with a music bot) as you'll hear it speed up and slow down in an oscillating pattern. The same effect also appears in games if you should have a game loop that always tries to catch up to an ideal framerate by issuing more updates to match an average - the resulting oscillation as the game suddenly slows down and then jerks forward is hugely disruptive, so it's not really done this way in practice.
Oscillations are the main issue with "catch-ups" in synchronization, and dropping frames once your buffer is too far behind is often a more pleasant artifact. It's not really a one-size-fits-all engineering problem.
Audio conferencing at low latency is already solved by things like Mumble (https://www.mumble.info/). I think adding a video feed in complete parallel (as in, use mumble as-is, do the video in another process) with no regard for latency would be a pretty good first step to see what can be achieved.
Early versions of Youtube nailed this. The video would frequently pause, degrade, or glitch due to buffering delays but the audio would continue to play. This made all the difference in user perception: youtube felt smooth. Other streaming services would pause both video and audio which did not feel smooth at all. Maybe they had some QoS code in their webapp to prioritize audio?
one technique that could be used (to get high compression rates on compression applied to each frame) is to train a compression "dictionary" on the first few seconds/minutes of a data stream, and then use the dictionary to compress/decompress each frame.
Well, all the effort is regularly defeated by poor hardware - you can have 40ms latecy in the video call stack, but when people attach Bluetooth headphones which buffer everything for 300ms there's nothing really to be done.
(Be gentle on your coworkers and use cabled headphones.)
LLAC/LHDC LL bluetooth codec adds only 30ms.
AptX low latency codec adds only 40ms max.
Just buy headphones with good low latency support. They aren't even expensive anymore.
Bluetooth audio is a mess of compromise. The default sbc codec is basically fine for low latency but the parameters are all pretty terrible. Everyone uses the same few default parameters which neither give particularly high quality (especially for two way audio which was designed to be compatible with phone quality), nor low latency (especially for the high quality a2dp profile). One issue is that the designs/defaults haven’t really been updated since about 2000, and the parameters are very hard to change, typically the OS’s preference is hardcoded somewhere (also, whichever device initiates the connection gets to choose the parameters so even if you configured your computer to choose “better” parameters, it would all be for naught if you let the headphones connect to the computer rather than the other way round). The other issue is that Bluetooth is quite severely bandwidth constrained and higher bandwidth could theoretically give lower latency.
>LLAC/LHDC LL bluetooth codec adds only 30ms.
"only" is positive thinking.
I do play some rhythm games (LLSIF, deresute, mirishita) on Android. The difference between "only adds 30ms" and plugging my headphones directly to the headphone jack is the difference between unplayable and playable. The games do have a latency compensation setting (with a calibration procedure), but compensation is no substitute for the real thing: Low latency.
LLAC/AptX LL isn't adopted well on host device now, especially on Apple devices.
And even 30ms delay, using it on headphone/mic and both talker means 3022=120ms delay.
Okay, but I want to wear wireless headphones.
Why can't I have both? Wifi doesn't seem to have this latency problem.
How do you know? :)
The latency doesn't come from bluetooth radio part itself (there ARE low latency BT headphones after all).
It comes from the fact that all audio is encoded (usually into SBC or AAC or AptX), transmitted and then decoded in the headphones. And each of those steps has buffers. And those buffers are configured by the manufacturer.
The bigger the buffer, the more stable the audio connection - there's less stuttering, less dropouts. But every buffer in the chain adds latency.
So why can't you have both? You sure can. You just need to somehow find headphones and a PC that doesn't add latency to bluetooth. Sadly that's not something that's usually documented in technical specs.
Or use wireless mics that don't use bluetooth and are dedicated to low latency wireless audio. Like the ones they use for theatre: https://www.adorama.com/alc/how-to-choose-a-wireless-microph...
Wifi's latency has a high dispersion. I've seen absolutely terrible wifi latency, and latency that is under 1ms. wifi degrades gracefully, which makes it really tough to work with.
But pretty much all serious gamers use an ethernet connection because wifi is a pain in the ass. In fact, the first thing a support representative for any game will tell you when complaining about excessive lag is to try a wired connection.
WiFi has terrible latency. Try playing a multiplayer FPS with wired networking and compare with WiFi. Or simply use remote desktop with WiFi.
Whatever wifi you're using is probably overloaded. You can easily have a one millisecond ping to your access point.
I have an under 2 ms ping to my AP, but WiFi has terrible buffer bloat, so ping latency doesn't mean much actually.
Any idea what the state of the art is for reducing buffer bloat on access points?
And you can mitigate that by not using tons of bandwidth in the background while gaming.
Terrible latency _and_ packet drop.
I only use wifi where I cannot attach a cable. I will run 15m ethernet cable on an apartment's floor if I have to, in order not to have to use wifi.
I believe RF based wireless headphones (like my Arctis 7 headphones) don't have this latency in them due to not being Bluetooth based.
There is some patented codec I think that does allow low latency bluetooth streaming (forgot the name) but that's not heavily implemented in my experience.
Old-school BT headsets are low-latency enough, afaik. But yeah, just blasting the Opus directly from the network to the headphones would solve it, even re-coding in low-latency configuration only adds 5ms.
You probably mean AptX Low-Latency. I haven't seen it a lot and it's basically just AptX with tweaked buffer sizes.
> Wifi doesn't seem to have this latency problem.
Wifi is one of the best things you can do to add unreliability and latency.
There are hard limits at play. No matter what you do, you can't go from New York to London in less than ~20ms; add video/audio encoding, packet switching, decoding, etc. and it's easy to see why any latency under the 100ms mark at that spatial scale in a scalable, mainstream product would be close to a miracle.
The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.
See also this interesting comment about the feeling of "closeness" from phone copper wires:
https://news.ycombinator.com/item?id=22931809
Landlines were so fast and so "direct" in their latency (where distance correlates very directly with time, due to a lack of "hops") that local phone calls were faster than the speed of sound across a table, and for a bit after they came out--before people generally got used to seemingly random latency--local calls felt "intimate", like as if you were talking to someone in bed with their head right next to you; I also have heard stories of negotiators who had gotten really tuned to analyzing people's wait times while thinking that long distance calls were confusing and threw them off their game.
> it's easy to see why any latency under the 100ms mark at that spatial scale in a scalable, mainstream product would be close to a miracle.
It seems normal phones are able to do it, though. At least it seems normal phones suffer less latency problem.
In a way, simplicity in technology often means better performance.
Linux is ill-suited for realtime applications.
Google is well-aware of this, thus Fuchsia.
SeL4 would make a good base for such a device.
The media lab has done a ton of research on this. I seem to remember people being able to notice visual latency at 30ms and audio latency at 80-120ms (this is because light is faster than sound).
>and audio latency at 80-120ms
Any rhythm game player will disagree.
Some games (e.g. llsif, for android) have "perfect" window sized to 16ms (a video frame). Even with latency compensation, these are unplayable on bluetooth yet fine on headphone jack. As the game has calibration, the resulting offset is seen to be at least 30ms worse on bluetooth.
Interesting, would love to read more if specific papers/authors come to your mind. I suspect there's a big gap between e.g. "noticing the audio latency when audio is played as a result of pressing a button" vs "audio latency affecting the flow of a multiparty conversation".
it's probably the latter, because the former is about 5ms (which is equivalent to the statement, "how short of a time between sounds are they perceivable as separate" aka the lower frequency threshold of hearing). It's non obvious that they're the same limit.
> The thing is that when we talk in a room, sound will take <10ms to reach my ears from your mouth. This is what "enables" all of the human turn taking cues in conversation (eye contact, picking up whether a sentence is about to end/whether it's a good time to chime in/etc) - I've been looking for work from people who've tried to see at what point things start feeling really bad (is it 10ms, or 50ms?), but haven't found much so far. No matter what it is though, it's likely that long distance digital communications just cannot match it.
Digital communication could cheat, though!
There's a lot of latency hiding you can do, if you can predict well enough what's coming next. Humans are fairly predictable most of the time.
Where does Tonari actually put the camera? The perspective on the displayed image makes it look like the camera is ceiling mounted, but that would make the eye contact problem much worse than even Zoom.
If I had to guess at a possible future, I can imagine edge computing servers that connect over 5G or fiber to your device. On these edge computing servers, they predict using AI/ML what you, as a participant, could do (video including facial and hand gestures, audio including Toastmaster type fillers like ahh, umm) in the next 50-60ms or longer and transmit their guess using rendered video frames and audio in time for the other videoconferencing participants to see “no latency” interaction. Done right, it would seem real. Done wrong, definite Max Max Head Headroom feel.
Nitpick: “audiophile-quality sound” it seems, is becoming the new “military-grade encryption.”
I don’t have many other comments to make other than I am surprised rust-analyzer was only mentioned in passing.
As far as I'm concerned "audiophile" has been synonymous with "overpriced placebo" basically forever.
Beyond that I wish the article had explain a bit better why it chose these "better-than-std" crates. I'm actually using all the std variants in my projects, I'm curious to know if I'm missing out or if I just happen not to hit on their limitations.
> Beyond that I wish the article had explain a bit better why it chose these "better-than-std" crates.
At least for parking_lot, its README has a long list with its advantages over std: https://github.com/Amanieu/parking_lot/blob/master/README.md
I saw that but I was interested to know if TFA had decided to go with it because it looks better on paper or if it's because they hit a roadblock using the std counterparts and migrated to using those.
That being said since they're drop-in replacements for the most part I suppose I could just try to rebuild my project with this crate and see if I notice a difference performance-wise.
In our case, it wasn't a matter of hitting a specific roadblock as much as it was past experience in performance-sensitive projects, and knowing that if we start with those crates we'll probably be fine, and if we don't, we'll probably switch to them eventually for some reason.
With crossbeam for example, you can hit roadblocks with std since their channels are MPSC, whereas crossbeam supports MPMC channels (and is faster than std in every meaningful measurement last I checked).
That's great to know, thanks!
Reading the description it almost seemed too good to be true but if it's indeed objectively better in basically every situation I should probably give it a try.
Might as well pick another language for your project if the current one has such shit standard libraries and you're "learning it on the job".
What's wrong with learning a new technology on the job?
Rust leaves a lot of improvements to its standard library to the community, so these improvements start off as separate libraries for faster iteration. The most recent example I remember is the hashbrown crate replacing the standard HashMap.
Does that mean the other language's standard library is better? That it has better third party libraries?
You're right, that sounds way too fluffy.
To clarify, we're targeting "transparent" sounding audio, not "FLACs or bust" audio. Right now we send stereo 48kHz 96kb/s Opus (CELT, not SILK) that we found hit the voice transparency sweet-spot compared to the lossless audio source. We had used higher bitrates in the past, and could easily go back to them, but quality plateaued at around 96k in our experimentation.
More than choosing sane transparent-sounding encoding parameters, the biggest difference in fidelity by far was choosing the correct microphones and speakers for accurate reproduction of voices.
Voice does not extend above 22.05khz, so using sampling rates above 44.1khz is entirely objectively wasteful and useless, unless your codec only works at 48khz input or something.
Are you using 48khz for a specific reason?
Please read the official Opus FAQ to sampling rates: https://wiki.xiph.org/OpusFAQ#But_won.27t_the_resampler_hurt...
44.1 kHz is essentially deprecated on the hardware level since it's annoying to deal with the extra clock. It's a few cents for an extra crystal, way too expensive ;). 44100 also makes for very poor multipliers/dividers to other clocks since it includes 3²×5²×7² as factors. 48000 is much nicer with 3×5³.
The issue with 'military-grade' is that anyone in the military will attest it translates to: Cheapest possible thing that gets the job done.
Audiophile grade at least has roots in high fidelity.
> Audiophile grade at least has roots in high fidelity.
Does it though? Audiophiles generally seem to eschew fidelity in favour of something that sounds subjectively nice, including the psychoacoustic effects of spending a lot of money.
Eg. they seem very fond of "warmth". If you asked me to make something sound "warm", I'd be applying some soft clipping and dampening the top end, not eliminating sources of distortion.
Edit: If you actually wanted high fidelity, you'd use studio headphones / monitors, which are designed to be "unflattering", so you can be confident you'll hear any issues when mixing / mastering. People don't normally listen for pleasure with those, because they become fatiguing after a few hours.
Choosing equipment because you like the sound is a very reasonable thing to do, but it's not the same as pursuing fidelity.
There's all sorts of audiophiles out there. Some hold beliefs rooted in pseudoscience.
And some are all about accuracy and measurements.
For instance, I use Sennheiser HD600[0], which I strongly recommend, attached to Topping DX3 Pro (old model)[1], which I cannot recommend, as the v2 model shipping now is garbage[2], a consequence of a redesign to work around high fault rates. Mine is fine as problem units fail within weeks, and I've had it for years.
[0]: https://reference-audio-analyzer.pro/en/report/hp/sennheiser...
[1]: https://www.audiosciencereview.com/forum/index.php?threads/r...
[2]: https://www.audiosciencereview.com/forum/index.php?threads/m...
Our ears are incredibly sensitive sensors and I think attributing warmth to soft clipping and dampening the top end is not a complete picture.
Also warmth is just a single quality. I have a pair of very accurate “cold” headphones that I prefer for music and a pair of “warm” headphones for electronic music and gaming.
Past the headphones, it is not so much warmth as it is space in the sound for me. My headphone amplifier sounds effortless and that’s the best way I can describe the quality of what I hear.
But those characteristics are based on objective facts of sound reproduction that can be quantified.
The characteristic of warmth is related to amplification of certain harmonics as well as equalization in the signal. This is fairly well understood by now.
The audiophile definition of a "warm" sound signature has nothing to do with distortion and audiophile's do not "eschew fidelity" for different sound signatures.
> The audiophile definition of a "warm" sound signature
I don't really know what, if anything, that means. But if we're talking about fidelity, surely the ideal would be no sound signature? If a particular "sound signature" makes it sound "warm", surely it's decreasing the fidelity?
You're lack of knowledge of this matter is very evident and you're skepticism and confusion would be very easily cleared if you made an actual honest exploration into hi-fi audio
I'm an EE and have made an honest exploration into this topic many times, and yet still have no explanation of "warmth" beyond the addition of distortion resulting in even-ordered harmonics. Which is precisely a decrease in the SNR from input-to-output.
That might sound good! But it's a less-than-perfect reproduction of the source signal.
If there's a better explanation than what I've come across every time I've search for this, I'm all ears and honestly open to being corrected.
You've never listened to audiophile equipment have you? If apply "some soft clipping" it will sound bad, I guarantee you, no audiophile would like it.
> You've never listened to audiophile equipment have you?
You're saying that I ought to judge the merits of audiophile equipment by the subjective measure of whether I like the sound of it. Which is the metric I said audiophiles would favour.
> If apply "some soft clipping" it will sound bad
Soft clipping often sounds nice, which is why it's very commonly applied to music. You're saying that eg. the sound of a classic Vox amp is bad, which I guess you're free to believe if that's what your ears tell you, but it's certainly not an objective truth.
Because what you are describing is a simplistic picture, describing whole class of people as stupid simpletons who cannot tell low THD and low IMD audio from "soft clipping which sounds nice". If you are referring to vacuum tube amps, soft clipping is only partially the reason why they sound the way they do; in fact most of the time amps are not clipping and are outputting close to 1% of their their total power. Reasons why tube equipment sounds better/different from the solid state amps are a lot more complex than the "common wisdom" of soft clipping.
Similarly "medical grade" = "single use" in many actual medical contexts.
3DES is still military grade.
No it's not. It stopped being approved for usage by NIST a few years ago.
Really? 3DES still appears here, https://csrc.nist.gov/projects/block-cipher-techniques, with DES and Skipjack being called out as deprecated.
That page says that 3DES is prohibited from usage in new applications and is prohibited for encrypting more than 1 GB of data, since 2017.
The attached documents have additional information on implementation and (non) usage, including deadline to migrate legacy military systems. It's sadly quite cumbersome to go through the tens of PDF to find the relevant information.
112-bit keys are still allowed precisely because of 3DES.
It can join the ranks of meaningless phrases like "aircraft-grade aluminum", "chef-grade cookware", and "contractor-grade tools".
> Nitpick: “audiophile-quality sound” it seems, is becoming the new “military-grade encryption.”
It's too bad they didn't explain it. I expected they meant allowance for "full bandwidth" audio (possibly including music you can listen to).
Video conferencing systems generally use voice-only codecs compressed to shit, full of artifacts in the voice range and utterly dead outside of it.
To me, "military grade encryption" means following industry standard. "Audiophile quality" means higher quality than you need, care about, or can even tell apart from lower quality.
No, "military grade encryption" means nothing. If it referenced a standard, than that might mean something. I've worked on products for the military that still used single pass DES encryption. So that was military grade. It might as well have been ROT13.
especially because all VoIP codecs sound like shit. It's intelligible, but the bar isn't high for fidelity.
Whose ears and which military? :D
yeah audiophile can be so may things. To me it means 24bits or more.
More than 21-bits is meaningless. It's all hype beyond 24-bits.
Why 21-bits specifically?
https://web.archive.org/web/20200310174634/https://people.xi...
I believe that's the full dynamic range that human hearing can possibly process where it's a really tiny signal that a human can actually hear with noise underneath vs a really loud signal that is basically pain. Most humans don't have that range. Note that the issue is that the quiet signal needs to be above the noise--so whatever your signal is, the noise floor needs to be below the threshold of hearing given that signal (I believe that while for "normal" signals that noise floor needs to be more than -50 to -60dbm down for very quiet signals threshold of detection is only -20dbm further down).
The trick is that our hearing systems are logarithmic (we can't hear a quiet sound next to a loud sound--that's what compression relies on), so they map to floating point numbers better (ie. 16-bit floating point is way more than enough).
24-bits is effectively for recording engineers so they have lots of headroom and don't have to worry about clipping basically at all (6dbm per bit implies about 18dbm of extra headroom which is a LOT).
However, when you calculate non-linear audio effects, you want extra bit depth (generally floating point) because cancellation and multiplication in your intermediate results can really move your noise floor up into bits that humans can actually hear.
While I can't argue for 21 specifically, I definitely don't trust everything to use careful dithering and guarantee full quality in 16 bits. So in practice that's 24 bits at forty-something kilohertz.
You might be right, however 16-bit sounds really harsh to my ears, and 24-bits is the only widely used standard, better than 16-bit.
Do you mean “the expression ‘16-bit’ sounds harsh to my ears”, or do you mean that you can hear the difference between 16 and 24 bits per sample?
The effect of bit depth has little to do with how you perceive the sound; what adding more bits does is allowing for more dynamic range, i.e. more difference between the loudest possible and the quietest possible sound. More bits brings down the noise floor. This means that for example the final part of a fade-out retains more detail at 24 bits than at 16, but this difference is not something that you would be able to observe in normal listening conditions.
If you like to learn more about the effects of bit depth, I would recommend “Digital Show & Tell” by Xiph Mont at https://www.xiph.org/video/.
Is there any difference between those two expressions. Overall - yes you are right, 24 do sound better. Loss of details and replacement of them with digital (aggressive, non-random, correlated) noise indeed sounds harsh.
> 16-bit sounds really harsh to my ears, and 24-bits is the only widely used standard, better than 16-bit.
That really doesn't make any sense. The bit depth provides for a dynamic range, meaning the difference between the loudest and quietest sounds which can be encoded. 16 bits is enough to go from "mosquito in the room" to "jackhammer right in your ear". Congratulation, 24 bits let you go up to "head in the output nozzle of a rocket taking off" with room to spare, that's… not very useful?
Now what might make sense — aside from plain placebo — is a difference in mastering. For instance lots of SACD comparisons at the time were really comparing differences in mastering, with the SACD converted to regular CDDA turning out way superior to the CD version because the mastering of the CD was so much worse.
The "Loudness Wars" is an especially bad period of horrible mastering, and it went from the mid 90s to the early-mid 2010s (which doesn't mean that regular-CD has gone back to "super awesome", just that you're unlikely to have clipping throughout a piece these days).
What I said actually does make sense. First of all, if you are digitally lowering loudness of audio (say 4 times), you actually are losing precision, and if you later amplify again - you will never return these bits back. This is what is called headroom. So your typical multiply-by-a-floating-point volume control actually kills dynamic range of the sound. I for example never run my OS volume control and players volume knob at 100% (which would preserve the range), because the gain of my amp is simply to high, and even slightest movement of the amp knob will cause dramatic change in loudness. Therefore, I keep the digital volume controls at 25% (losing 2 bits on the way, but the audio is recorded at 16-bit - losing nothing), and then amplify with my amp. Voila - nothing lost in the process. Secondly, empirically, every time I switch sound cards to 24 bits it sounds better. I have noticeably less fatigue. Of course, someone may want to gaslight me (not deliberately, of course), attempting to force me to think it is a placebo, but I tried with many people, and all of them noted the difference.
So what you're saying "actually does make sense" so long as it's a completely different subject than what one would normally assume in context, without you having mentioned such.
When people talk about 24 bits (and >48kHz) in the context of "audiophilia", it's generally about the data at rest and "HD audio" (aka 24 bit music files and downloads). Not about the bit depth of the processing pipeline for which it's generally acknowledged that yes, >16 bit depth does make sense for the audio processing pipeline (as well as the original recording).
Extra bits won't hurt anyway. DAC are not ideal, and there is a comment above about the dynamic range. If your recording is not loud, you lose your dynamic range (say, if your loudest sound is only 40% of the full amplitude, you've already lost more than 1 bit), which can be partially recovered by a higher precision DACs. So it is true in both senses.
> Extra bits won't hurt anyway.
Nobody said it would hurt so I’m not sure why you’re pointing out the consensus like it’s some sort of profound statement.
> If your recording is not loud, you lose your dynamic range
If your sound engineer is wasting your dynamic range, maybe get a better sound engineer? And if they manage to fuck up something at the core of their job, there’s no reason they wouldn’t fuck up just as much with 24 bits to waste.
> So it is true in both senses.
In no meaning of “true” and “both” in common use.
> but I tried with many people, and all of them noted the difference.
Unless this was a double-blind study and the audio levels were exactly the same between runs, this is useless data. Even a 0.1dbSPL difference between runs is noticeable (people gravitate to louder sounds as better).
> every time I switch sound cards to 24 bits
This may be related to the sound card. I use an external DAC, not a soundcard, as most soundcards that come with computers are not up to par.
Changing 16 bits to 24 bits should not change the audio in a way that is discernible to the human ear.
I have actually never seen any proof double-blind study is the best way to do the audio comparison. I mean, yeah placebo effect does exist, but knowing what to look for in certain type equipment makes it a lot easier to find the phenomenon. Double-blind study, IMO has to be applied only after extensive amount of non-blind tests, Yes the final verdict has to be produced after the blind test, but people need to know what to look for. In any case, I was referring to long term listening fatigue, which has very little relation to loudness, and I'd argue the louder sounds should make you tired quicker.
> This may be related to the sound card. I use an external DAC, not a soundcard, as most soundcards that come with computers are not up to par.
For simplicity, I did not talk about them separately, BTW, following your logic there is no point in bying DAC, unless there was a double-blind study, comparing these DACs to cheaper sound-cards. Both are 16-bit/48000, are not they?
> Changing 16 bits to 24 bits should not change the audio in a way that is discernible to the human ear.
This a bold statement, which begs a proof itself.
> This a bold statement, which begs a proof itself.
Only if one doesn't understand what those bits mean or what they correspond to.
These bits are important for quantization, which is the process of converting analog sound into digital numbers. On a graph, X = time and Y = amplitude. The higher the bits, the higher the resolution.
A 16bit recording has 2^16 steps (discrete values) available for amplitude (65,536) and a 24bit recording is 2^24 or 16,777,216 steps.
So why is this important? Well, a 24-bit recording can more finely record differences in amplitude. Given that 1bit = 6dB: a regular 16-bit recording already has a dynamic range of 96dB. A 24-bit recording has a dynamic range of >144db. At ~125-130dB SPL is where hearing loss (permanent) begins.
You do not hear the difference because if you were listening to a 24-bit recording on a 24-bit capable system at sound levels loud enough to actually discern a difference, you would have permanently damaged your ears. Actually, I believe that applies to 20-bit, let alone 24-bit.
So why do 24-bit or higher recordings even exist? They are useful for people mixing and working with the raw audio, before it gets processed down to 16bit audio for distribution. At 24-bit resolution you have a larger amount of headroom before you start clipping, so it's easier to work with considering you have X amount of bits that are just part of the noise floor.
This is also assuming your input files are actually 24-bit to begin with. The vast majority of files are 16-bit because there is literally no point as a consumer to have larger file sizes for no humanly audible benefit.
44.1kHz 16-bit files are all that you need as a human consumer of audio. 48kHz has to do with video and is not better than 44.1kHz because you (a human) cannot hear the difference. 44.1kHz is 22.5kHz x 2. Humans hear sound from 20hz to 20kHz -at best-. This is assuming perfect hearing with no degradation. We sample at 44.1kHz due to the Nyquist-Shannon sampling theorem, and 22kHz gives us just a bit of headroom to apply filters to avoid aliasing. [2]
So I reiterate my initial assumption: flicking a switch to change from 16bit to 24bit should not magically change the quality of audio (in a humanly discernible manner). Assuming the file being played is 24bit lossless audio in the first place.
> BTW, following your logic there is no point in bying DAC
We're talking about dedicated external equipment vs an onboard soundcard+amp which are generally neglected. Not -all- onboard cards suck of course, the Realtek ALC1220 chip on my mobo seems to be comparable or better than entry level DACs from the specs I'm seeing. This is assuming no interference is happening, which is more likely to happen around unshielded electrical components. If you don't believe this is a thing, ask why the audio industry uses thick XLR [shielded AND grounded] cables as standard.
Certain headphones require equipment that can drive them properly, whether it's an onboard soundcard+amp or a DAC+amp. For example, my sennheiser hd600s are 300Ω but some models go up to 600Ω. And yes, the quality of the amp/preamp does make a huge difference.
If one can prove that a component is unable to drive a component, or is sub-par mathematically, one doesn't exactly need double abx trials. Those are for tests like "Monster says their $200 cable is better than <X> standard cable?", or "Is a McIntosh amp better than a $<amount> competitor?".
I don't need to do a double ABX study to realize that beats headphones are drastically worse in performance than sennheiser hd600s: [3], [4], [5]
[0]: https://www.mojo-audio.com/blog/the-24bit-delusion/
[1]: https://web.archive.org/web/20200202124704/https://people.xi...
[2]: https://en.wikipedia.org/wiki/44,100_Hz#Origin
[3]: https://reference-audio-analyzer.pro/en/report/hp/monster-be...
[4]: https://reference-audio-analyzer.pro/en/report/hp/sennheiser...
[5]: https://reference-audio-analyzer.pro/en/report/hp/audio-tech...
>16 bit enough
It is so believed (although there's a lack of supporting evidence, and knowledge that human hearing has excellent dynamic range), but only as long as the mastering work was well done. 24bit allows for much less destructive human error and is very welcome. Much more so than absurdly high sample rates (96KHz, reproducing sounds up to 48KHz as per Niquist), which are of dubious value.
>my sennheiser hd600s are 300Ω
At some frequencies. At some others, it's more like 600Ω. Impedance is seldom stable across the frequency range in headphones.
Amplifier design should account for this and still provide enough power[0].
Output impedance of headphone jacks should be low enough (1:10 is commonly cited, which means <2Ω in practice as 20-30Ω headphones are very common) relative to the low end of the headphone impedance range, in order to prevent the impairment of frequency response.
>Not -all- onboard cards suck of course
But most do. The design of audio circuitry in motherboards doesn't get that much attention. None of my motherboards have good sound. Flaws vary. Some are lowpassed (greedy anti-aliasing filter). Some are noisy. Most have excessive output impedance (typically more than 6Ω, and at times higher than 15Ω). None can output enough power[0] for hd600 (my favourite pair).
> That really doesn't make any sense. The bit depth provides for a dynamic range ... 16 bits is enough to go from "mosquito in the room" to "jackhammer right in your ear".
Dynamic range is not loudest sound / quitest sound ratio (as would one expect), but loudest sound / noise level ratio. Otherwise you would need to count additional bits to encode quietest sound with low enough quantization noise.
Threshold of hearing could be as low as -9 dB SPL, so one wound want noise level below that. Therefore with 96 dB dynamic range from 16 bits the loudest representable sound would be say 86 dB SPL. But symphonic orchestra music may have peaks way above 100 dB.
I think the bigger issue is likely to be a trash computer mic, a trash preamp/adc, trash dac, trash speakers, trash room. I don't care if at some point you're sampling and sending that signal at 1000-bit or whatever, it's still trash, just very accurately sampled trash.
I disagree, I do not own trash equipment. Every time I install Linux, i switch Pulseaudio settings from 16-bit to 24-bit; the difference is immediate, although subtle. Everyone I know who tried to do this, noted that listening fatigue is a lot lower with the new settings.
In my direct experience, everyone who claims this to me, so far, is unable to distinguish 16 bit and 24 bit recordings in an ABX.
The audiophile world would do well to adopt the concept of double-blind study.
I am talking about listening fatigue, first of all, which is a long-term effect. Second, I think double-blind test are worthless, if they are done in isolation; first, you need to run non-blind tests, let people play with audio equipment as much as they want, in any combination, completely open; only after that, when people have figured out what to look for, run the double blind test. Forcing unprepared people to go through very subtle test surely won't give useful results.
If you can’t distinguish A from B reliably, none of the rest matters at all. The idea that you have to “figure out what to look for” is nonsense if you cannot distinguish the two reliably.
“Listening fatigue” when you know which is which is simply placebo.
You should probably read more attentively. I did not say you do not need double blind test.I said, once you learn what to look for, only that there is a point to do the blind test.
I wish I read more things like this on hn. "We wanted to know and understand every line of code being run on our hardware, and it should be designed for the exact hardware we wanted"
"porting [webrtc-audio-processing] to Rust in the near-term is not likely (it's around 80k lines of C and C++ code)."
That's just one of their dependencies. It's possible to know every line without rewriting. And it's possible to rewrite and still not know every line.
They seem to strike a reasonable balance.
But that statement seems at odds with a dependency on the enormous WebRTC AudioProcessing C++ module. But then they also say they don't use WebRTC so maybe I misunderstand what's going on.
My understanding is that the quoted statement was explaining why they moved away from WebRTC.
We moved away from WebRTC completely for video, networking, and some audio. We still use webrtc-audio-processing for acoustic-echo-cancellation and some other niceties. Here is our Rust wrapper for that library:
I think it's unlikely they'd release and maintain a wrapper around something they stopped using.
As long as you're also comfortable reading the "over-engineering made our product inflexible, late to market, and too expensive" blog posts later.
I mean you could also have the "we used commodity everything, where first to market but the next folks did it better and cheaper because they could" posts - hindsight is 20/20.
I bet IBM didn't expect using off the shelf components would mean that the IBM PC was the standard for the next 30 years and it wouldn't be theirs.
-----
As someone who works at a company that relied heavily on video conferencing (half the devs off shore) - every single major solution absolutely sucks, they are flaky, unreliable, sound quality is poor, video rate is poor (and this is with fat pipes at both ends) and worst of all latency, latency when trying to have a round table conversation with people remotely is horrific, it is good to see someone pushing the limits, Skype et al haven't gotten much better in the last decade yet my internet connection at home/work is x50 times faster and even mid range business laptops have much improved graphics grunt.
If this actually works, I am desperately keen to get my hands on it. If you have the capacity for high bandwidth, why not use it? Zoom’s model must work on whatever crappy broadband people have in their home office. If you have gigabit, it doesn’t seem to make use of that extra capacity to improve video quality.
As for sound, I don’t think audiophile quality is necessary...
As for sound, I don’t think audiophile quality is necessary...
Given you'll need about 10Mbps upstream for 60fps 3K video it seems a little unreasonable not to add on a 320Kpbs (or more) audio stream.
It could make this useful for things like streaming music concerts.
Semi-related note: there's work being done at Stanford to make it possible for remote musicians to play together in an ensemble at low latencies.
JackTrip is the resulting software -- not end-user friendly, but apparently it works.
https://ccrma.stanford.edu/groups/soundwire/software/jacktri...
(Some basic numbers: sounds takes 1 ms to travel a foot, every ms is a foot of separation between musicians, 30ms of latency = 30 ft separation = the max for jamming. So 130ms is not low enough.)
Also, audio quality seems to be more important for the subjective experience than video quality, even in regular video content.
If you only need a P2P video stream https://github.com/CESNET/UltraGrid/wiki is amazing and lower latency
Let's turn that statement around and instead of thinking about audio bitrates, focus on experience. A great "audiophile" setup can make the performers sound there in the room with you. No matter how much BS the hobby spews, when you hear a really great setup, that guitar truly sounds 6 feet away from you.
Zoom calls do not sound there in the room with you. Microphones are terrible, there's compression artifacts, latency, packet loss, background noise, and tiny speakers. No one could possibly close their eyes and forget that the other person is not there in the room with them, on any POTS or VOIP technology that exists. But what if you could create an audio communications system with an actual illusion of auditory presence. Sounds amazing!
And given that this company is trying to create wall-screen, life size ultra-HD video conferences, I'm pretty sure that "audiophile" exactly what they're going for. Personally as a remote worker, I would absolutely swoon for this.
I love Rust, but them deciding to redesign/reimplement webrtc after being frustrated after a week seems like a prime candidate for not invented here syndrome with Rust being the justification. There is a reason webrtc is as big as it is, it’s a complex problem to solve.
Regarding the premise of high latency in webrtc: Google Stadia has ~160ms round trip latency at 4k from my Macbook to a data center, so it’s not like that’s unachievable.
Google is colocating in your basement.
After reading it, I'm still not entirely sure what's being done.
Is it live streaming or is it the transport?
Are they doing video encoding (the audio encoding seems to be done by that webrtc-audio thing)?
Have they chosen a progressive encoding format that compresses frames and pumps them out to the wire as soon as they're done?
Is TCP or UDP involved or a new Layer 3 protocol entirely?
Have I just missed all of those parts or were they really missing amid all the Rust celebration?
> After reading it, I'm still not entirely sure what's being done. > Is it live streaming or is it the transport?
tonari is the entire stack, similar in "feature scope" to WebRTC but with different goals and target environments.
> Are they doing video encoding (the audio encoding seems to be done by that webrtc-audio thing)?
Yep, this includes video encoding and transport. We don't use the WebRTC audio library for encoding or transport, just for echo cancellation and other helpful acoustic processing.
> Have they chosen a progressive encoding format that compresses frames and pumps them out to the wire as soon as they're done?
Yep, basically, if by that you mean we don't use B-frames or other codec features that would require buffering multiple video frames before receiving a compressed stream, so we're able to send out encoded frames as they arrive.
> Is TCP or UDP involved or a new Layer 3 protocol entirely?
We encapsulate our protocol in UDP since we operate on normal internet - a new protocol is out of the question without a huge lobbying force and 15 years of patience on your side.
> Have I just missed all of those parts or were they really missing amid all the Rust celebration?
We intentionally didn't get into the protocol details because we are saving that for a dedicated post (and code to back it up).
Thank you very much for the answers. Glad I wasn't too far off.
Looking forward to the technical post. If you're planning on releasing all of this royalty-free and opensource, you'd be quite a boon to the free and open internet. Getting this picked up by the likes of Mozilla and getting it into a browser would be amazing.
If anybody is looking for a low latency high bandwidth P2P video streaming solution there is https://github.com/CESNET/UltraGrid/wiki It can do less than 80ms of latency
How do you use this over network? I have installed it but its very unclear to me what I need to do in order to call my colleague in another city.
It seems that it can only connect to publicly visible hosts? Overall it looks like somebody should develop an application on top of this.
This is cool, thanks for the link. Is this Nvidia GPU only? Mighr give it a try at some point
You don't need the GPU, just depends on the type of compression you want. It supports intel VA-API as well as NVIDIA VDPAU
Gotta love a writeup with this line in it:
like Brian's 1970s-era MacBook Pro
That's a writer(s) who knows what it's like to read long (aka thorough) technical articles and not bore the readers to death.Great article!
> We just enforce rustfmt.
After interaction with both rustfmt and go fmt, I have concluded that .editorconfig is solving a problem that really shouldn't be solved. We went through the ordeal of defining our C# coding standards where I work and, let me tell you, people (myself included) care very deeply about their way of structuring code. And it's a bloody waste of their time.
Having the language designers say, "here is how our language should be structured" is a breath of fresh air.
Woah this portal thing into another place seems super exciting if they can really pull it off and maintain low latency in the real world.
My WebRTC projects haven't suffered that much from latency. The biggest source of delays is usually caused by encoding video for me. I've had to limit streams to 720p and 25fps to reduce the time spent on CPU encoding a vp8 stream. There are also bandwidth considerations (real time encoding = significantly less compression) but the end result is slightly less than 200ms one way latency (including input lag from mouse, 15ms network latency and display lag) without any special settings. All I'm doing is feeding a ffmpeg stream to kurento and letting it broadcast it via WebRTC. This is not a web conferencing application and it is also not using WebRTC via p2p. It's closer to conventional live streaming with a sane amount of latency (compared to up to 30s of latency you commonly see on twitch). Of course I personally would prefer it if the latency can be brought down even further. 100ms or lower is like the holy grail for me and only appears to be doable with codecs that aren't supported by WebRTC. However, people don't want to install apps just for my little service and I certainly won't encode every stream via several codecs just for the tiny minority of the user base that actually ends up using the app.
Very cool from a tech standpoint.
From a product point of view, I find it interesting that the illustrations/concept videos for these things always show people interacting very closely to the wall - e.g. playing chess, sitting around a table, etc.
https://tonari.no/static/media/family.48218197.svg
But in practice, people tend to keep their distance from it. E.g. the pictures of this setup tend to show people clustered in their own group on each side of the wall, with a solid 2-3 meters from the wall.
https://blog.tonari.no/images/ea56c74d-a55d-4183-9a7b-d69795...
It makes sense, it's awkward to be close to a large solid (emissive) surface, and humans instinctively get closer to their in group when faced with an out group. I wonder how the system could be designed to encourage participants being closer, if there is an advantage to that.
A practical problem to solve there: where do you put the cameras? I would actually prefer putting them behind the screen if possible - a few small pinholes wouldn't be that noticeable. If you could put multiple wide-angle cameras in multiple places, you could stitch them together in software and create a real feeling of closeness.
I'd sit closer but the picture then distorts and I am distorted for my conversation partner.
Why exactly do existing video streaming solutions use such small amounts of bandwidth and have terrible quality as a result? Does anyone have a deep dive into why this is the case? It seems that it would be a killer feature to make better utilization of bandwidth.
Even over wifi, speedtest shows 4ms/100mb/100mb on my internet connection, but Zoom, FaceTime, and others never use more than about 0.8Mbit/s for a video stream, and the resulting quality of audio and video is...understandably poor.
Latency too totally feels like a software problem, perhaps with too many layers of abstraction. (60fps->16ms for the camera, ~10ms for encoding with NVENC/equivalents, 35ms measured one-way latency from my laptop to my parents 4000km away, ~10ms decode, 16ms frame delay = 87ms one way). Maybe I'm asking for too much from non-realtime systems (I'm used to RTOS, extensive use of DMA, zero-copy network drivers, etc), but it seems that there is a lot of room to improve.
It's worth mentioning that in our case, a significant chunk of the latency in our 130ms measurement is just the input lag of our display that we currently use. We were surprised by how slow they can be.
OnLive "solved" encoder latency 15 years ago. You dont wait 16ms for the next frame. Instead you progressively start encoding after receiving first tens of lines. This way your encoded video stream lags just couple of milliseconds behind, same for decoding. You could crudely emulate this by dividing screen into 4 rows and sending 4 concurrent video stream, instant 1/4 latency drop.
Sure, many of the operations in the list can be pipelined as you mention. Something like G-Sync would also allow you to sync the destination display to the arrival of the (start of) frame.
The bottleneck is not on the CPU. I'm afraid this company may have wasted their time trying to reinvent WebRTC. If you really want to get realtime video, I think the best approach is a custom codec on CUDA or better yet custom hardware (FPGA). You can only go so far on general purpose hardware before you hit a wall and get Zoom/WebEx quality.
Can you recommend some resources for the current state of the art for low latency video? Somebody else in the comments posted https://github.com/CESNET/UltraGrid/wiki, but I’m curious to learn more.
Is or is not? I’m confused: if the bottleneck is not on the CPU what does CUDA solve?
The bottleneck is the video encoding/decoding/rendering, which is done on GPUs to begin with. Of course if it were done on CPU instead, then it would be significantly worse, but that's not where we're starting from. Improving stuff on the host side by, say, rewriting WebRTC in Rust won't improve the latency of your video by much or at all.
This is welcome news.
I have been itching to convert a small headshot videostream (thing under 100x100px) to audio, stream it over mumble and then convert it back to video, just to see what the latency is like. It would obviously be a big undertaking, but not as big as this methinks.
"We wanted to know and understand every line of code being run on our hardware, and it should be designed for the exact hardware we wanted."
This rings very true for every high-performance thing I've ever worked on, from games to trading systems.
Any suggestions on a group video conferencing tool for use on a local network (Ethernet) that's effective? Either self-hosted or online, just for personal usage to talk with others?
"A week of struggling with WebRTC’s nearly 750,000 LoC behemoth of a codebase revealed just how painful a single small change could be — how hard it was to test, and feel truly safe, with the code you were dealing with."
I totally feel you. It's impressive what the WebRTC implementation has achieved, but it's just not pleasant at all to work with it.
130ms is a world better than 500ms and a much welcome improvement, but it is still terrible.
Latency happens throughout the whole stack; Unfortunately much would need to be fixed outside this project to achieve any further significant improvement.
Operating System, firmware, blackbox hardware are some other non-negligible sources of latency. Everything adds up.
@dang - Suggest altering the title to say what it is "Achieving 3K, 60fps, 130ms Video Conferencing with Rust".
This is amazing! The first thing that popped to my mind seeing the life sized "portal" was the farcaster portals from the sci-fi novel Hyperion
Sounds impressive, but i'm dying to know: what video codec are they using?
I wonder how it compares with apple facetime on two new macbooks with ethernet connections on both sides.
They actually work on reducing latency and pushing high res video if your connection supports it.
That's a great idea, I've always preferred facetime at least for the video quality. We'll do a latency test sometime, I suspect it'll be quite good!
for crate in $(ls */Cargo.toml | xargs dirname); do
cargo build
Why do this instead of cargo --workspace build
Is it so you can time the individual crates?Yeah it looks like they wanted to know how long each crate took to build individually.
But as long as we're nitpicking, nobody should just pipe `ls` into `xargs` like this, since it fails if anything has spaces in it.
Instead, do:
Don't be that person who writes a script which won't tolerate spaces in filenames!for cargo_toml in */Cargo.toml; do crate="$(dirname "${cargo_toml}")" pushd $crate # ... doneAlternatively: Don't be that person who clones the repo at a path with spaces in!
Not having spaces in your directory names is certainly a good idea, but I'll be damned if I let any of my code have issues with them. Just because something's a good idea doesn't mean it should be a requirement :)
(The main reason for the advice of "Don't put spaces in paths" is really only because it breaks lots of poorly-written software... but that's not an excuse for your software to be poorly-written!)
Yep!
What's the codec stack for this? x264 --tune zerolatency + opus with opus_delay=20ms?
20ms is wasteful. Use minimum latency where SILK still works, afaik that's 7.5ms.
This assumes that video encoding latency is lower than the audio latency.
Shouldn't be more than a single frame, which is 16 2/3 ms for 60 fps. And for e.g. JPEG it can be even shorter, especially with a rolling shutter.
"we truly don't believe we could have achieved these numbers with this level of stability without Rust"
Oh please. This is just rust sensationalism. People don't truly believe rust is faster than C do they?
In some problems Rust is the fastest:
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
https://benchmarksgame-team.pages.debian.net/benchmarksgame/...
Every single one of those rust implementations is using unsafe {}, thus defeating the purpose of using rust in the first place. Run the same benchmark without unsafe {}.
>defeating the purpose of using rust in the first place
I don't think this is true; the whole point of Rust is that unsafe operations are explicit, not that you never do so.
Also, I looked at the first one, and it's only using unsafe on what are basically op code calls; I don't think it is realistic to complain about that.
Not true.
Of those 4 tasks, the rust programs for task 3 and task 4 do not use the keyword "unsafe".
For task 2, the spectral-norm Rust #6 program does use "unsafe" but #5 does not and it's almost as fast.
Developing stable complex software in C takes a hell of a lot more effort and skill than it does in Rust IMO.
I don't believe Rust is faster than C, but I would argue it's faster to develop new products in Rust vs. C, and easier to produce programs which don't have data races or invalid memory accesses.
Sure, if you wrap everything in unsafe and/or import third party libraries (with the assumption they are also safe).
I wonder how much bandwidth this uses. The less bandwidth it uses the higher the latency because of compression. Its much easier to get low latency video when you have large (Gbit+) links
Are they still using WebRTC, just their own implementation? Or have they switched to something else on the wire?
There's a section in the article about it: "In the beginning (or: why we're not WebRTC)"
I'm interested in what they are using if not WebRTC - there's several good options in this space (SRT would be my go-to choice), so it'd be really interesting to see if they rolled their own wire protocol or used something else.
They built it from scratch
Their blog post suggests they wrote something from scratch, but gives no clue as to what, whether they considered building on a more modern protocol specification than RTP (which is a couple decades old at this point), what they've taken from other more modern protocol specs if they didn't use one directly, or anything aside from that they wrote some code really.
would loved to see a demo
Awesome post
But does it have middle out compression?