MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training
arxiv.orgThis is an awesome paper, and the somewhat negative sentiment in the discussion here is surprising.
The ablation studies are well done, comprehensive and expensive to do. People will be using the conclusions from this for years, and that is much more impactful than if an upcoming Siri product ourperforms the GPT model at that same point in time.
A few really interesting points:
Synthetic datasets substantially (1%+) increase performance for Image Encoder Pre-training
Architecture of the Visual<->Language model connector doesn't seem to matter.
Interleaving text and image data improves few shot performance, but image captioning data improves zero-shot numbers.
The ideal mix of data types is 5:5:1 for Interleaved:Captions:Plain Text (!)
Synthetic captioning data helps substantially at this point too (up to 4% gain)
The appendices are amazing: lots of details about learning rates tried, batch sizes.
The "explain these figures" are really really good. See page 37.
The paper explores different design choices for various parts of the model and draws conclusions about the relative importance of optimizing each area (image encoder very important, vision-language connector less so).
The actual set of models produced (up to 30B parameters) seems secondary to the intent of the paper, and is more validation of the best design choices in each area.
This looks competitive against CLIP, and surprisingly great at VQA style prompts, but it doesn't seem like the paper supports comparing it to GPT-4. We don't see any tests for coding performance, math homework, legal document review, or any of the myriad other things that people use GPT-4 for on a daily basis.
Besides homework, all of these things seem to be professional uses of GPT-4. If they’re trying to bake this into a consumer platform like Siri, I don’t see why they’d need to focus on those use cases. Besides MDM/Enterprise, which will be curious if they try and attack this market or just their army of consumer devices.
They are going to have to focus on the use cases that most of their customers use LLMs for, regardless of whether it falls in the consumer or professional category or somewhere in between.
If all it does is improve Siri a bit without massively expanding the range of applications and APIs it will be a big disappointment.
I think what Apple presents in June will decide whether on-device AI will be seen as a viable alternative to cloud APIs.
Many users of Siri would be thrilled if all this did was made it decent at understanding what’s being asked of it and gave it the ability to ask clarifying questions, especially if it does so staying fully local.
I'd love to turn my lights on and not have my garage door open or thermostat suddenly change to 90 and my furnace fire up.
It's a really low bar right now :D
Good insight. My comment was based on the headline that says "...Competing with ChatGPT".
MM1 is a research paper, not a release of a competing product. I'm sure the paper is interesting and am looking forward to reading an analysis of it by someone who understands these things better than I do, but this is not that analysis, it's an extremely low-effort puff piece that is more interested in getting attention than in accurately describing a research paper.
I don't usually say this, but TFA frankly feels like it was written by AI:
> The release of MM1 by Apple contributes significantly to the artificial intelligence domain, offering a detailed roadmap for the development of future MLLMs. By sharing the insights and design principles gleaned from MM1, Apple not only challenges the current capabilities of models like ChatGPT but also invites the broader AI community to build upon their findings, potentially leading to more sophisticated and capable AI systems.
I believe most run-of-the-mill marketing language will sound like it is written in AI. The easiest thing to do for technology writing is to write the complete, factual article, then ask an LLM to dumb it down to whatever level you need for communication.
No, I agree this really does seem autogenerated, or at the very least written by somebody who doesn’t understand the topic at all and is going through the motions of padding things out to hit a hype / word count. It’s got that weird summary focusing on the wrong things and wild speculations dressed up as serious predictions vibe, like there are words saying things in places because there are supposed to be words there and not because it’s actually imparting useful information.
Out of curiosity, where are you seeing this? It's not in the abstract or the paper.
Some of these comments were originally made in response to this spammy submission:
Oh, thank you! I didn't know we'd been moved.
Ah! Makes sense now, thank you.
Biggest model is 30b MoE trained on 100b tokens, max sequence length 4096. A bit underwhelming compared to recent announcements like the open source Large World Model [1].
Absolutely no benchmarks against GPT4 present in the paper.
Notably they used instruction response pairs generated from GPT4 for supervised fine tuning. Which has always felt like an experimental hack to me, but that’s how many folks are bootstrapping smaller models these days, and the effectiveness is hard to argue with.
Apple’s axlearn framework was used which leverages JAX and XLA [2].
You seem to be missing what this submission is about. It's not an Apple press release about a competing model, it's a research paper that discusses different tradeoffs in architecture and data and how each part affects the results of the trained model. In an era where training a large model can be cost prohibitive, this insight is key — it tells you where to optimize and where to cut corners to get the most bang for your buck.
> Absolutely no benchmarks against GPT4 present in the paper.
Table 4 on page 14 shows comparisons to GPT4V
This has an unfortunate naming collision with the M/M/1 queue, a common stochastic model for the study of queueing theory.
The paper lists "first authors", "core authors", and "senior authors".
My dream is to one day be listed on a seminal paper as "secondary forum reply author".
Speaking as someone working in the field, I find it amusing how much researchers working on automating human work care about human credit assignment.
extremely underrated comment. nice one!
Similarly, I’d like the movie credit Second Assistant to the Second Second Assistant Director.
"Junior Assistant Vice-Dean" (or variants thereof) in academia. Those mostly exist to give a pay boost to administrators who've otherwise maxed out on pay.
I recall that my undergrad institution once invented a new deanship out of whole cloth for a coach who'd maxed out on the "professor" pay scale.
Even worse, the bastard didn't even win games!
In that case, I highly recommend watching the movie Synecdoche New York (2008).
PS Can I be your hairdresser?
I'll second that recommendation, and in that same sort of vibe I'd also recommend Birdman or (The Unexpected Virtue of Ignorance) (2014) and Station Eleven (2021-2022). They all have aspects of stories within stories, which is a trope that I particularly enjoy.
Holy inferiority complex batman!
You can aspire higher and just use one of these LLMs to be a "first author" in a published peer reviewed paper.
I wonder if this has anything to do with their acquisition of DarwinAI. After a decade of mediocrity, I'd love to see Siri get smarter. Any improvement would be welcome at this point.
Honest question: what do you (in the general sense, not specifically asking the parent) use Siri for? I think my main (only?) use case is setting a timer.
Maybe I find conversational UIs awkward, or maybe I just got jaded REALLY quickly from Siri’s lacking capabilities early on, but I have hardly used it in the decade or whatever that it’s been around.
I use it almost daily for something that is simple but under appreciated I don’t know why it’s not in every marketing video: “Siri, remind me tomorrow at 10am to do X”
I outsource so much of my memory to the phone via Siri ALL THE TIME. It’s so useful. Even for things in 20m. I’ll easily forget if I don’t do this, and it’s reliable so it gives me confidence. It also keeps the notification present until I actually do the thing, so I have a kind of string around my finger until the task is accomplished. I can also snooze that notification as needed to rebring it up at the right time.
Every time I do this around non-tech people they go “wow I didn’t know you could do that.” I swear it’s literally life changing, particularly for anyone over 30.
Especially with Shortcuts, Siri can have some pretty useful functionality. My personal big improvement I'd like to see is being able to better able to tap into those actions without having to set things up in advance.
In addition to Shortcuts, being an Apple thing, Siri naturally has native HomeKit integration which is powerful when combined with HomeAssistant.
I’m really hoping for something like that.
A year or so ago I remember someone pointing out in a podcast how LLMs are great at taking something like general language and turning it into a series of predefined commands (the stuff available to shortcuts). It would instantly make Siri much more useful.
I think Federico Viticci rigged up something similar or at least a powerful demo using Siri + Shortcuts + ChatGPT to be able to answer all sort of questions better than native Siri.
Yep. Reminders is #1 by far, followed by sending texts, turning lights on/off with HomeKit and timers which are similar.
I can’t imagine reminders w/o Siri because that’s how I add 90%+ of them. Grocery items, things to do at time X, or when I get to (or leave) work/home are the big ones.
Raising blinds, turning on/off lights, and unlocking the front door. It is convenient since I can do all those things with one command (raise all the blinds and turn off all the lights, or raise all the north blinds and lower the south ones), it would be a hard problem to create physical buttons to do what we needed without running around the room to hit various switches.
Google can also do this. Alexa has lots of problems, but it can raise a blind in a pinch. We also spent a ton on Lutron shades because we discovered that we were just managing them too much manually (Siri then is great for controlling that).
You can also ask Siri the weather in the morning, useful in figuring out how to dress the kid.
If Siri could do the following reliably (meaning not having to ask again, not having to repeat, having it work 99% of the time) it would be golden:
1. Find my phone via Siri on homepod
2. Set a simple timer
3. Add to a list
4. Send a text message to one of a few contacts
It can and sometimes does do all of those things, but horribly unreliably.
For me it really is extremely close to 100% for timers, I barely remember it being wrong and I use it several times per day. Finding my phone via the HomePod also works pretty much every time, may be 90% for me but it doesn’t recognize my wife so for her it basically never works. The others I don’t use enough. But timers and reminders work really well for me and it’s also what I need to most from an assistant.
I’ve seen similar. They really don’t have two person houses down pat - timers work great for me (as long as I never have to ask how much remaining; I’d die for a “count down from 30 seconds”) - but for the wife; nothing.
Since they removed “hey” and I got the latest phone, I’ve noticed many little situations where it’s faster to speak to the device than tap your way around. E.G. when it’s locked you can say, “Siri, open Spotify” and look at it for face unlock, boom. Random stuff. Also Alexa has surprised me lately, like a rational response to, “how many sandwiches is too many?”
Personally, I don't want Siri to be 'smarter', if smarter means it becomes an open-ended and unpredictable way to have an LLM guess what I meant. I'd like Siri to be more powerful, yes.
I like that I can model Siri as a decision tree with voice-activated input. Being able to configure it to do more things (for example, to put reminders in Things rather than Reminders), that would be useful. More discoverability would also be great (but this is Apple we're talking about, so good luck there). But for me personally, the most important feature is that Siri is predictable: once I figure out how to do something with it, asking again in mostly the same way will get the same result. If I want to talk to an LLM, I have ChatGPT on my phone.
I agree. The whole push to have Siri work on device was a noble one, but I’d rather have the option for a dumber on device Siri or a smarter in the cloud Siri.
Mediocrity is far too positive a word for the dumpster fire that is Siri.
I hear that a lot, and I have no desire to tell you your opinion’s wrong, but it doesn’t match my experience. Siri’s… fine, I guess, for what I ask of it like setting timers and reminders and such.
It’s not perfect, for sure:
Me: Hey Siri, turn off the kitchen lights.
Siri: I can’t process multiple requests.
Me: Hey Siri, turn off the kitchen lights.
Siri: OK.
But it works reliably enough that I use it all the time for the reminder and timer actions. Is it vastly worse for other people, and in what ways?
The characterization of "mediocre" is fair, but we're transiting a household to Siri from Alexa (because Alexa doesn't work locally, and because of Amazon's track record on privacy), and it's not noticeably worse.
The feeling I’ve heard from people is Alexa was way better than Siri at first.
Over time Siri got better. Not great but better. Alexa had mostly stayed the same or perhaps gotten a touch worse except for adding ads and other annoyances.
I’ve never used anything but Siri. It works decently, definitely has its moods/dumb-as-a-post moments. But I’ve learn what works well and for that it’s proven very useful.
Same transition some years ago. Siri is noticeably much, much worse to me. Borderline hopeless on a mixture of HomePod + HomePod Mini hardware.
If it’s going to take general artificial intelligent to get a voice assistant that can remember not one, but two entirely separate cooking timers, then so be it. Imagine the GPUs required!
I’m still baffled at Siri and Google assistant. Virtually zero innovation in a decade. I just want to be able to turn on BBC radio while my hands are wet, is that really so hard?!
> that can remember not one, but two entirely separate cooking timers
You're in luck! Siri will do that right now. Just tried it. Works.
OMG, 2 cooking timers?! Pinnacle tech right there.
Knowing Apple, I was expecting one base timer, with every other timer being a $200 upgrade.
https://www.tomsguide.com/how-to/how-to-set-up-and-manage-mu...
“How many timers can you have going at one time? […] …I had 26 timers going at once, and the only reason I didn't have more running was because I got bored.”
I wonder if the maximum number of timers is an 8 bit, 16 bit or 32 bit int.
Only one horrifically boring way to find out.
That’s not really Apple’s style. More along the lines of “HomePod mini 2 features double the RAM, allowing for exciting new features like multiple kitchen timers. Pre-orders start Friday.”
You should be able to do this with Siri. You can use a shortcut if it doesn't work out of the box.
It works out of the box as of iOS 16 or 17.
“Hey Siri set an egg timer for 4 minutes”
The interface for switching between multiple timers sucks on the watch, the whole app does now. I don’t know how it’s handled on HomePods, though you can see them somewhere in the home app (yeah that’s discoverable).
But it works fine. And the interface is good on the phone.
Google Assistant is pretty decent. But as someone who is pretty much locked into the Apple ecosystem, Siri needs a reboot from scratch.
It's been reportedly rewritten from scratch like five times, during which time people have not stopped posting claims that it's exactly the same as it was in 2010.
You mostly think there's no innovation because you speak English.
Trainig
Yes, the submitted title was "Apple announces MM1: Multimodal LLM Pre-trainig Report". We've reverted it now. But the greater problem wasn't the typo, it was the editorializing (from https://news.ycombinator.com/newsguidelines.html: "Please use the original title, unless it is misleading or linkbait; don't editorialize.")