Ferret: A Multimodal Large Language Model

621 points by weirdcat 2 years ago · 331 comments

Reader

They're already going multi-modal? Holy crap, if google can't deliver in the accessibility space for this (image descriptions better than "the logo for the company"), then I'll definitely go back to Apple. I mean I do hope Apple cleans out bugs and makes VoiceOver feel like it won't fall over if I breathed hard, but their image descriptions, even without an LLM, are already clean and clear. More like "A green logo on a black background", where Google is, like I said, more like "The logo for the company." I guess it's kinda what we get when AI is crowdsourced rather than given good, high quality data to work with.

sagz 2 years ago

Google's Lookout app (accessibility for the blind and visually impaired) was updated ~6 months ago with a multimodal LLM already.
It uses the Flamingo model family: https://deepmind.google/discover/blog/tackling-multiple-task...
zitterbewegung 2 years ago

Honestly if they are coming out with a paper now Apple has probably been working on it for a year or two at minimum . Next year releases of macOS / iOS are rumored to have LLMs as a feature .
- beoberha 2 years ago
  
  I don’t mean to discount this work, but this particular model is the product of a few months of work tops. It’s effectively LLava with different training data, targeted at a specific use case. While I’m sure there is a significant effort at multimodal LLMs within Apple, this is just a tiny corner of it.
- ex3ndr 2 years ago
  
  They literally mention that they built on top of llava that was released half year ago.
  - smolder 2 years ago
    
    What's the "literally" here? Try: They mention that they built on top of llava, which was released half a year ago.
    
    superb_dev 2 years ago
    
    In this context “literally” is being used to draw attention to the perceived obviousness of the information. Essentially saying “how could you not see this?”
    
    illiac786 2 years ago
    
    In this context, "what's the literally here?" means that the author think the post would be better without it: it's too agressive, sarcastic, demeaning, etc. for example. Hence it's a false question.
    It's actually funny because you answered a false question (potentially on purpose). Like:
    "What the hell are you doing?!?" "Well Im stealing your car, obviously."
    
    Kerb_ 2 years ago
    
    In this context, the verbose explanation to a rhetorical question seems to be the point of the comment, to reemphasize said aggression or sarcasm. It's actually funny because we both are doing the same thing, also potentially on purpose.
- refulgentis 2 years ago
  
  > Honestly if they are coming out with a paper now Apple has probably been working on it for a year or two at minimum
  Why do you say that?
  - zitterbewegung 2 years ago
    
    Academic papers can take that long …

amitprasad 2 years ago

Also relevant: LLM in a flash: Efficient Large Language Model Inference with Limited Memory

Apple seems to be gearing up for significant advances in on-device inference using this LLMs

https://arxiv.org/abs/2312.11514

adt 2 years ago

Old paper (Oct/2023), but the weights are new (Dec/2023):

https://lifearchitect.ai/models-table/

rreichman 2 years ago

Oct 23 is old :)

shrimpx 2 years ago

Apple has been looking sleepy on LLMs, but they've been consistently evolving their hardware+software AI stack, without much glitzy advertising. I think they could blow away Microsoft/OpenAI and Google, if suddenly a new iOS release makes the OpenAI/Bard chatbox look laughably antiquated. They're also a threat to Nvidia, if a significant swath of AI usage switches over to Apple hardware. Arm and TSMC would stand to win.

madeofpalk 2 years ago

I doubt Apple’s going to make some big ChatGPT-style chatbot. They’re “just” going to use the same tech to drive iterative (good!) improvements to their products, like Siri and keyboard auto-complete.
- shrimpx 2 years ago
  
  Yeah. Siri supports text input already, anyway. Siri is their ChatGPT-style bot that's going to keep improving.
  - ghusbands 2 years ago
    
    But does it even work sensibly, yet? Almost every time my partner asks Siri something, it works so badly that we end up asking Android/Google's Assistant, which responds well to most things.
    
    qgin 2 years ago
    
    What sort of things don’t work well? Phone actions or knowledge / info type questions?
- fbdab103 2 years ago
  
  I would challenge the keyboard autocomplete. I find the Apple suggestions to be frustratingly poor vs my experience on Android.
  - spike021 2 years ago
    
    I thought it couldn't get any worse and then I upgraded to iOS 17. It's awful.
  - Booourns 2 years ago
    
    Out of curiosity have you experienced their autocorrect on iOS 17 because that’s when they updated to be LLM based?
    
    FLT8 2 years ago
    
    I don't recall exactly when it started happening, but I've been having lots of issues with recent iOS versions rewriting not the last word entered, but the word before that. For example, if I start entering "I went to", it'll sometimes correct to "I want to", but it'll do that after I've typed the "to". I've found lots of similar examples. The retrospective nature of the edits mean I miss a lot of them and makes me appear a lot less literate than I am.
    
    walteweiss 2 years ago
    
    Same happens to me quite very often on a mobile, even here. But I use iPhone SE 1st Gen. with iOS 15.8.
    
    georgespencer 2 years ago
    
    Transformer based autocomplete on iOS 17 feels just as bad -- but in different ways -- as its previous incarnation to me.
    
    simonair 2 years ago
    
    Are you tapping the keys or swiping over those that make up the word you want to type? In my experience, tapping has always been and remained poor but swiping is getting better and better with every iOS version.
    
    LoganDark 2 years ago
    
    Swiping through keys doesn't have anything to do with autocomplete. Autocomplete has to do with predicting which word you're going to type next, not guessing which word best corresponds to the swipe you just made.
    
    golergka 2 years ago
    
    Those are very related tasks, you use results of the former to help you with the latter.
dwaite 2 years ago

> Apple has been looking sleepy on LLMs, but they've been consistently evolving their hardware+software AI stack, without much glitzy advertising
They don't sell compute time to other companies to run AI, or massive custom hardware for AI training.
They aren't after VC funding.
Their core business isn't threatened by AI being "the evolution of search"
Product-wise, so far all you hear is messaging around things like pointing out the applicability of the M3 Max for running ML models.
Until they have real consumer products ready, they only need to keep tabs on analysts, with lip service at financial meetings.
theferalrobot 2 years ago

Given Apple's track record on anything AI related and the terrible state they keep CoreML that not only seems extraordinarily unlikely, it would take a lot of time to win developer trust and that I just don't see happening.
- hosh 2 years ago
  
  Apple doesn’t have to win developer trust or build an AI platform. They just have to build a compelling consumer product that can only function with AI, and they are better equipped to do that than Google or Microsoft. It remains to be seen if OpenAI will go that route instead of a business built on training and providing access to foundational models.
  - agentcoops 2 years ago
    
    Yes, this is the most important point and I think somehow least present in even discussions here: the technical question of who produces the best/cheapest LLM/future architecture is considerably less important than who, if anyone, creates a fundamentally new and dominant consumer experience built on AI. Most of the existing players (Google, Meta) would of course prefer that nobody produces such a newly dominant paradigm of computation for end-users since it would greatly reduce their relevance and subsequently revenues. Right now, ChatGPT is the only real contender in this space. However, I think you’re correct that it’s actually Apple who is most likely to be the next who attempts such a paradigm shift. Far too early to bet, but let’s say I wouldn’t be surprised if in five years we end up in a world in which Apple has the consumer monopoly and Microsoft the business monopoly, with Google and Meta falling into irrelevance.
    
    gremlinsinc 2 years ago
    
    I think Microsoft is going to eat openai, I mean the company is practically half in and out of Microsoft's mouth. Bing will likely add more and more features that are native to chatGPT, Google I think will eventually get in the game, Facebook is actually doing better than Google, especially for open source models which is buoying the smaller researchers and developers.
    In the end one company will build AGI or super AGI that can do the function of any existing software even games, with any interface even VR, or no interface at all - just return deliverables like a tax return. The evolution might be, give me an easier but similar QuickBooks UI for accounting to just do my taxes, the company who gets here first could essentially put all other software companies out of business, especially SaaS businesses.
    The first company to get there will basically be a corporate singularity and no other company will be able to catch up to them.
  - theferalrobot 2 years ago
    
    >They just have to build a compelling consumer product that can only function with AI
    Yeah and I'm not talking exclusively about developer trust. Given Apple's current consumer lineup (see Siri, Apple photos, predictive text etc)... we only have evidence that they suck at ML. What makes you think they are going to suddenly transform overnight?
    
    Aerbil313 2 years ago
    
    > see Siri, Apple photos, predictive text etc)... we only have evidence that they suck at ML
    …or that they only deal with mature tech and not the shiny new thing. Makes sense to me. I don’t doubt everyone will have a personal LLM-based assistant in their phones soon, but with the current rate of improvements to LLMs and AI in general, I’d wait for at least a year more while doing R&D in-house if I were Apple.
    
    theferalrobot 2 years ago
    
    You could use that apologist language for any company. If they suck at something just say they are “biding their time” No. Apple is just demonstrably behind.
    Having terrible predictive text, voice to text, image classification etc isn’t just a quark of the way they do business. Those are problems with years of established work put into them and they just flat out aren’t keeping up.
    
    hosh 2 years ago
    
    “Apologists” … this is the domain of strategic analysis, business and products. Apple, Google, et al are not feudal lords or entities owed personal allegiance, nor sporting teams for fans to rally around, nor are we talking about morality and ethics where Apple did something wrong and apologists are justifying it.
    As far as whether they are keeping up or not, I disagree, but neither of our opinions really matter unless we’re betting — that is, taking actions based on calculated risks we perceive.
    
    theferalrobot 2 years ago
    
    You disagree based on what? On virtually every measure they are behind in AI, I can’t think of anywhere they are ahead, please enlighten me
    
    hosh 2 years ago
    
    … because Siri, predictive text, etc suck because it isn’t using an LLM. Alexa, and the Google Assistants from the same era all suck as well. I don’t see how evidence that Apple sucked with pre-LLM ML is an indicator that they will suck with integrating an LLM into their products.
    No one said anything about transforming overnight.
- mark_l_watson 2 years ago
  
  I have enjoyed working with CoreML over the last few years. Please share what you didn’t like about it.
  - theferalrobot 2 years ago
    
    There are so many modern ML components that have terrible or no support in CoreML. Try to do any convolution other than conv2d, advanced or custom activation functions etc and you are out of luck. Exporting from PyTorch leads to all sorts of headaches with subtle behavior changes between implementations it is definitely a pain point for developers of widely used software
    
    mark_l_watson 2 years ago
    
    +1 thanks
- lachlan_gray 2 years ago
  
  Maybe MLX is meant to fill this gap?
  https://github.com/ml-explore/mlx
harryVic 2 years ago

Can you give an example? I switched to android because i use personal assistant a lot while driving and siri was absolutely horrible.
- shrimpx 2 years ago
  
  - FaceID
  - Facial recognition in Photos
  - "Memories" in Photos
  - iOS keyboard autocomplete using LLMs. I am bilingual and noticed in the latest iOS it now does multi-language autocomplete and you no longer have to manually switch languages.
  - Event detection for Calendar
  - Depth Fusion in the iOS camera app, using ML to take crisper photos
  - Probably others...
  The crazy thing is most/all of these run on the device.
  - pants2 2 years ago
    
    The iPhone's built in text OCR and image subject cutouts are also extremely good, just in the photos app.
    
    shrimpx 2 years ago
    
    Yeah totally, I copy text from images all the time.
    
    kergonath 2 years ago
    
    The combination of automatic OCR and translation almost everywhere in the OS is great.
  - pkage 2 years ago
    
    I just wish you could turn the multilingual keyboard off—I find that I usually only type in one language at a time and having the autocomplete recommend the wrong languages is quite frustrating
    
    shrimpx 2 years ago
    
    That's true, I have found that mildly annoying sometimes. But most of the time it's a win. It was really annoying manually switching modes over and over when typing in mixed-language, which I do fairly often. It'd be great if there was a setting though.
    
    cezart 2 years ago
    
    I had the opposite problem, the languages I usually typed in(Romanian + English) didn't have a multi language mode on iOS. So it was a constant pain to switch btw them when I needed to insert some English terms in Romanian sentences. IOS didn't support multi language for this language pair. On Android it always worked like a charm.
    
    shrimpx 2 years ago
    
    Hey I'm Romanian, too. The latest iOS does what you want -- it has multi-language support and typing mixed English + Romanian is seamless now. Yeah it was a total pain to keep switching languages before iOS 17.
  - hansoolo 2 years ago
    
    To be honest I distrust Microsoft with swift key, but the it recognizes the change in language just smooth. I could switch languages in one sentence and it would understand what I am writing just fine, no Sill suggestions
    
    Aerbil313 2 years ago
    
    Apple recommending wrong words when you write in mixed-language was the case in iOS 15, so much I always needed to manually change my keyboard language. But it’s no more in iOS 17. As an example I just typed this entire comment in Turkish keyboard with English autocorrect and suggestions working.
    Maybe the (most likely) AI-based thing requires some training though. I got my new iPhone a month or so ago.
    
    fennecbutt 2 years ago
    
    SwiftKey is great and if someone distrusts Microsoft for it...then fine, but various different companies "control" various different parts of my phone.
    With an iPhone, it's only one company, that controls every little thing, and we have no insight into Apple at all. They can basically do whatever the hell they want.
  - fennecbutt 2 years ago
    
    Yeah face id is pretty good, no android phones seem to use the ir dot camera which makes me think Apple has a "patent" on it ...lame considering the dot projector is ripped straight out of kinekt.
    Google does the memory photo thing too, but only if you use their app which I don't.
    Android has had multi language keyboard support for the longest time, in addition to being able to install whatever keyboard I like (I use SwiftKey, it's brilliant) I can also install an llm based one as I please.
    Android/my keyboard already does event detection/suggestion in text and has been doing for as long as I remember.
    None of these are reasons to buy an iPhone...just reasons to buy a phone, lmao.
    
    jmisavage 2 years ago
    
    FYI Apple bought the company that made the kinect sensors. The Face ID module on iPhones is a mini kinect.
fennecbutt 2 years ago

Are you so sure? Even this link is built on top of the work of others, I'm not sure they've contributed as much as you think they have.
gxyt6gfy5t 2 years ago

I wouldn’t go too far. They didn’t even train this model on Apple hardware. Trained on Nvidia A100s
Affric 2 years ago

Don’t TSMC make Nvidia’s chips too?
- shrimpx 2 years ago
  
  Yup! TSMC wins either way.
slowmovintarget 2 years ago

Personal ML systems running on hardware you own is the killer app. If these are "good enough" they'll be significantly preferable to using large subscription-based models, where those companies could pull a Lucy any day.
emmender2 2 years ago

generic first-order shallow argument
zamalek 2 years ago

You're suggesting that Apple could fit what can't be done with a 4090 into a laptop?
Color me doubtful.
- fennecbutt 2 years ago
  
  But Apple will just make a magical chip that's different to regular hardware cause they're the best company and invent all the things even if they've been seen before Apple still invented it first, just wait until their Super Unicorn Ultra™ chip comes out with Hyperdrive Retinated LLM™ support, they don't name normal hardware different just for marketing...it's really unique, new and inventive hardware that we're happy to pay a huge premium for because it's so advanced and inventive.

aaronbrethorst 2 years ago

Can someone define the term “MLLM”?

schaefer 2 years ago

Multimodal Large Language Model
- pests 2 years ago
  
  why not LLMM?
  - CharlesW 2 years ago
    
    Because the first word is "multimodal" :^) and also because MLLM is the established initialism.
    
    pests 2 years ago
    
    My point was the phrase already contains the word model. Why are we calling it a multimodel LL model? Why not just add multi to the existing model?
    
    notdisliked 2 years ago
    
    Multimodal, not multimodel. Multimodal referring to the different possible modes of input (text, picture) into the model.
    
    sva_ 2 years ago
    
    Modal, not model
    
    chaos_emergent 2 years ago
    
    MultimodAl, not multimodEl
    
    TrueDuality 2 years ago
    
    Modalities and models are not the same thing.
    
    astrange 2 years ago
    
    There is something like a "multimodel LLM" but it's called MoE ("mixture of experts").
  - bbor 2 years ago
    
    Ok our options
    Multimodal large language model Large multimodal language model Large language multimodal model Large language model (multimodal)
    I prefer 1, because this is a multimodal type of an existing technique already referred to as LLM. If I was king, I’d do Omnimodal Linguistic Minds, but no one asks me such things, thank god
    
    dilippkumar 2 years ago
    
    Some of the modalities in multimodal are non-linguistic. For example, image or video input. In those cases, is it still a language model?
    
    bbor 2 years ago
    
    I’d say yeah cause it’s understanding the inputs through linguistic structures/patterns
    
    n2d4 2 years ago
    
    I mean, if we want to be silly, what about "Language model (large, multimodal"? :)
  - replygirl 2 years ago
    
    what's a language multimodal model
  - rain_iwakura 2 years ago
    
    The bikeshed color argument never ceases to be relevant. Would you say "large language model multimodal"? I doubt it.
    
    pests 2 years ago
    
    Just a thought not a bike shed, relax.
CamperBob2 2 years ago

The language model works by delegating tasks to smaller language models and overcharging them for GPU time.
Tempest1981 2 years ago

Also, is FERRET an acronym?
- Someone 2 years ago
  
  I would guess it’s wordplay on other models being named after animals (llama, vicuña) and figurative use of “ferret”.
  https://en.m.wiktionary.org/wiki/ferret: “3. (figurative) A diligent searcher”

yreg 2 years ago

I really hope Apple releases an iPhone with a good on-device private LLM assistant, perhaps next year. Their hardware is well-positioned for it.

It could make me get a new phone outside of my usual ~4 year cycle. Siri is almost unusable for me.

aaronbrethorst 2 years ago

Rumors suggest they’re gearing up to make iOS 18 an AI focused release. It’ll be interesting to see if they offer different capabilities for online/offline scenarios, or if their offerings are strictly offline.
Here’s one story to offer some context. There are others. https://archive.is/en3VL
- behnamoh 2 years ago
  
  > Rumors suggest they’re gearing up to make iOS 18 an AI focused release.
  Don't underestimate Apple at disappointing enthusiasts like you and me. We've been hearing many awesome stories about the next thing Apple will do, only to realize their marketing team chose to keep it for future iOS/MBP/iPhone generations to keep the profits high.
  - Someone 2 years ago
    
    Running a LLM on-device alongside other apps (i.e. without it taking up all phone resources), and it being reasonably fast may well require more powerful hardware than they ever sold.
    A voice assistant that takes 3 seconds to reply and then takes half a second per word is a nice demo, but not a product Apple wants to sell.
    And yes, some people will say they rather have that than nothing on their hardware, but “the Internet” would say iOS 18 is slow, eats battery life, etc, damaging Apple’s brand.
    
    saagarjha 2 years ago
    
    Siri is often slower than this…
  - baz00 2 years ago
    
    I think your expectations are wrong. They sit there in silence with a few leaks here and there and some github projects, people speculate and get all excited about extrapolating those things. Then Apple deliver what works which may or may not be related to it.
    What they don't do is sell you a lie a year before release then deliver shit (like every other fucking vendor).
    
    zdragnar 2 years ago
    
    They're plenty capable of delivering garbage. Certain year models of the MacBook pro were inherently faulty. I've had the displeasure of having two of them bought for me at work.
    All of Apple's windows software (iTunes, Safari, etc) has been, at best, a barely working port.
    I'm assuming they are putting a lot more thought and care into it than the touchbar, crappy keyboards and the rest, but I'm also not holding out much hope either.
    
    MBCook 2 years ago
    
    Apple can screw up, no question. But they don’t do the two-year hype cycle thing that just about everyone else does in technology (or video games).
    It’s incredibly rare for Apple to publicly talk about things that won’t be selling extremely soon.
    The iPhone had to be pronounced because it was going to show up on the FCC website, and obviously Apple wanted to control the message. I suspect the Vision Pro may be similar, but they also wanted developers to start getting ready so they would have software day one.
    The only thing I can think of that Apple pre-announced and failed at was the Air Power mat. They said it would be coming out soon after and had to push that a couple times before finally canceling it.
    Other than that small exception, if modern (post jobs return) Apple announces something is coming, it will come out and be quite close to what they say.
    They don’t pull a Humane AI, Segway, Cyberpunk 2077, or No Man’s Sky.
    
    baz00 2 years ago
    
    Since they got rid of Jony, it's been great. That's all I'm saying.
    
    mpweiher 2 years ago
    
    Absolutely!
    The hardware, that is.
    With software, their Come to Jesus moment is still in the future.
    (To me, Swift is sort of the Jony correlate on the software side. Doesn't fit perfectly, of course, but very similar "we are perfect who cares about evidence la la la I can't hear you" vibes and results)
    
    behnamoh 2 years ago
    
    Yes, they have treated macOS like a toy. It's time they made it a real OS.
    
    baz00 2 years ago
    
    I despise this perspective. It’s pretty much finished. What more crap do you want shovelled onto it?
    
    pjerem 2 years ago
    
    Hello MacBook Pro 2016 !
    
    scarface_74 2 years ago
    
    And that Apple /// that Apple released in 1980 was also garbage…
    
    behnamoh 2 years ago
    
    > What they don't do is sell you a lie a year before release then deliver shit (like every other fucking vendor).
    If you're referring to Google, then you're right. But OpenAI has consistently delivered what they announced pretty quickly. Same with Microsoft. To think that Apple somehow has a secret sauce that helps them surprise everyone is an illusion. They've had 3 years now to show their interest in LLMs, but they're just too conservative to "think different" anymore.
    
    cpill 2 years ago
    
    I think LLMs are to inconsistent for Apple's taste, I mean that in both senses. Their perfectionism won't risk bad output, ever, which is impossible for LLMs
    
    baz00 2 years ago
    
    Nowhere near that. They’re a totally unproven tool with a lot of bad side effects. They just don’t want to go to market with that.
    I mean they do truly useful stuff already using ML just not LLMs
  - darthrupert 2 years ago
    
    My expectation of Apple is that they lurk in the shadows, looking at what others do while perfecting their own thing. Then on the day of release, they'll be a decade ahead of competition.
    They've done this a dozen times already.
    
    antiframe 2 years ago
    
    Which dozen times has Apple released something decades ahead of the competition? I'm blanking on 4-12.
    
    xerxes249 2 years ago
    
    64-bit phones is the easy one.
    
    barfingclouds 2 years ago
    
    iPhone, Siri
    
    cj 2 years ago
    
    Spotify / Apple Music
    Netflix + Hulu / Apple TV+
    Generic Earbuds / AirPods
    Meta Quest / Apple Vision Pro
    (The last one being a hopeful wish)
    
    cromwellian 2 years ago
    
    ? AppleTV and Apple Music are not decades ahead of anything. AirPods are way better than the existing Bluetooth headsets that were on the market.
    
    cj 2 years ago
    
    Maybe they aren’t “decades ahead” but they’re examples of Apple being slow to market, launching very competitive products years after the market was established.
    E.g. assuming Apple Vision launches soon, they’ll be “many years behind” Quest from the date of first launch, but most likely miles ahead as far as usability.
  - ignoramous 2 years ago
    
    > marketing team chose to keep it for future iOS/MBP/iPhone generations to keep the profits high.
    VisionPro is nice. I can see costs coming down over a period of time. Also, we've been waiting long enough for that AI car.
    
    yreg 2 years ago
    
    Apple has never promised any car. From what is known, project Titan has been cancelled years ago.
    The only exeption of not delivering I can recall was AirPower. It's a product they've announced and then embarassingly weren't able to finish up to their standards (or up to what was promised), so they have cancelled it altogether.
    
    ignoramous 2 years ago
    
    My point was, it isn't the marketing team why Apple tanks innovative/disruptive/new products.
  - dzhiurgis 2 years ago
    
    This. If there is slightest chance model can say “poop” - they’ll can it
  - georgespencer 2 years ago
    
    > We've been hearing many awesome stories about the next thing Apple will do, only to realize their marketing team chose to keep it for future iOS/MBP/iPhone generations to keep the profits high.
    I believe it's more likely we've heard awesome stories about things Apple will do in the future, only to realize that the average HN commenter is incapable of understanding that such stories are contextless leaks, and that it is far more likely you are operating with incomplete information than Apple's "marketing team" is holding things back for future "iOS/MBP/iPhone generations" to keep their profits high.
    I know it's more fun to vomit dumb conspiracies onto the internet, but consider changing "realize" to something which conveys equivocation, because your theory about Apple's marketing team holding back mature technology in order to benefit future devices – in addition to being predicated on leaks and rumors, and risibly inane when you consider that such action would create a significant attack vector for competitors – is as equivocal as the belief that Trump is on a secret mission to destroy a global cabal of pedophiles.
- para_parolu 2 years ago
  
  I really hope they will make siri usable. In the current state it’s only good for fixed phrases. And even then it fails time to time
  - behnamoh 2 years ago
    
    I hope they get rid of it completely. People on r/locallama and others have made much better assistants using GPT which use iOS APIs to control the phone. It's ridiculous that Apple still hasn't done anything useful regarding Siri.
    
    ben_w 2 years ago
    
    I'd be (pleasantly) surprised; look at how long it takes them to allow system apps to get replaced with downloadable 3rd party.
    And even then, the activation keyword is likely to be whatever Apple says. Similar logic as 3rd party keyboards, don't want user input to get stuck on even merely potentially untrustworthy or buggy code.
    
    spookthesunset 2 years ago
    
    Correct me if I’m wrong but isn’t the activation word “special” in that it has to work in very low power states? I always assumed that is why the wake words are fixed, because said words need to “fit” within a very small power budget.
    
    ben_w 2 years ago
    
    Could be — I've heard rumours along those lines, but none came with evidence.
    
    skygazer 2 years ago
    
    https://machinelearning.apple.com/research/hey-siri (2017)
    The rumors were true.
    
    ben_w 2 years ago
    
    Thanks :)
    
    darthrupert 2 years ago
    
    That's not due to incompetence; that's due to not seeing any reason to do it.
    
    ben_w 2 years ago
    
    I was not intending to imply otherwise.
    
    michelb 2 years ago
    
    It's been in an 'AI' rewrite for a while now. Pretty sure we'll see something next year.
    
    astrange 2 years ago
    
    It's been rewritten several times, people largely don't notice because they don't try using it in different languages. And of course because they want to seem savvy so they repeat impressions from other people's posts, not realizing those posts are years old.
  - fodkodrasz 2 years ago
    
    For me fixed phrases would do it often, I use Siri mostly when driving, but keeps saying for almost any command/query: Sorry I can't do that while you are driving. (other main usecase is setting tea/cooking timers with hands full/greasy. This alone makes it pretty useful.)
    I guess this won't change, because it is probably for legal reasons, to avoid being sued by some "I just dried my cat in the microwave" style genius after making a car accident (unrelated to Siri, but trying to shift the blame).
    Adding support for smaller languages would be nice actually. When its reading out of Hungarian messages loud it sounds incredibly retarded. I always have a great time listening to those and trying to guess the message. :) It would be nice if I could send a message to my significant other about being stuck in traffic in Hungarian. (the iPhone keyboard already does pretty decent voice recognition in the language)
- 0x1ceb00da 2 years ago
  
  If they do it right, it might make me switch from Android. I've never used iOS before and the only thing I'm able to use Google assistant for is setting alarms, and it can't even delete the alarm I created just now.
spaceman_2020 2 years ago

GPT-4 voice is so, so good. Really what you would want a voice tool to be like. I can talk to it like a normal human being, unlike issuing specific commands loudly as with Siri.
- klabb3 2 years ago
  
  But no matter the Siri shittiness (which I agree with) an LLM can only interact with the outside world – ie run commands – that exist and have a reasonable API surface, no?
  Apple has had automation for ages with Automator, Shortcuts etc but nothing that actually integrates well with day to day flow. So.. setting a timer when my hands are wet already works ok, and that’s about what I need.
  I honestly wonder what type of voice interactions people want with their phones. I can see transcribing/crafting chat messages I guess? But even so, it feels like it would mess up and use iMessage instead of WhatsApp, will it narrate my memes, open links and read “subscribe for only 4.99 to read this article”, cookie consents etc etc. if everything sucks how is narrating it gonna help?
  Maybe I’m old but I still don’t see the major value-add of voice interfaces, despite massively improved tech and potential.
  - spaceman_2020 2 years ago
    
    I would be happy if it does my morning routine for me. Give me a brief summary of my emails over the night, tell me the headlines from the news outlets I follow, summarize tweets from my favorited accounts, and give me an overview of the market based on my investments and watchlists.
fnordpiglet 2 years ago

The auto correct is already backed by a smallish LLM, FYI.
https://jackcook.com/2023/09/08/predictive-text.html
- blululu 2 years ago
  
  And it is a serious quality regression IMO. The dictionary is too small and misses/messes up a ton of basic words.
  - fnordpiglet 2 years ago
    
    It’s a place holder for iOS 18’s expansion. It’s 0.1 of the LLM in iOS. And the other prior implementation was so ducking bad that I’m not sure how you would observe such a regression.
- scosman 2 years ago
  
  SLM? :)
hmottestad 2 years ago

With iOS 17 they added a tiny little LLM to the predictive typing. I have the newest and greatest iPhone but I feel that I very rarely see it in action. I must assume that it’s just too slow at to keep up with my typing at the moment. Or it’s just not large enough to give very many useful suggestions.
- mrbonner 2 years ago
  
  Really? Typing in my iPhone 12 Pro has become a nightmare. I suspect it is because of predictive typing ML shit. It happens all the freaking time now. The symptom is that my whole device just froze for a few seconds while the next word is squeezed out. How do I turn it off?
  - evantbyrne 2 years ago
    
    Compared to Android, the iOS keyboard has always been a nightmare. However, wI feel it has been causing me fewer issues within the past month-ish. Has it been updated recently?
    
    fragmede 2 years ago
    
    Thankfully, Gboard is available on iOS.
- KMnO4 2 years ago
  
  Is tiny LLM an oxymoron? I believe Apple has told us it’s a transformer language model, but not specifically a LLM.
  - ben_w 2 years ago
    
    It's like how the "New Forest" is really old now: even small LLMs are (from what I've seen which isn't exhaustive) large compared to Markov language models.
  - hmottestad 2 years ago
    
    According to this article[1] it has about 34 million parameters.
    https://jackcook.com/2023/09/08/predictive-text.html
  - astrange 2 years ago
    
    There's no difference. An LLM is just a transformer language model that's "large".
  - catoc 2 years ago
    
    Yeah, they meant a TLM
    
    0x1ceb00da 2 years ago
    
    That's print('Hello, world!')
- dontlaugh 2 years ago
  
  It’s probably why autocomplete got drastically worse, to the point I’m considering turning it off entirely.
  Most “AI” features are so incredibly fragile they’re not worth deploying.
  - jachee 2 years ago
    
    Autocomplete got worse because the new system in iOS17 didn’t retain training data from prior versions. It reset everyone back to untrained. I’ve been manually correcting specialized language I use daily (e.g. “iRacing”) on my 12 (non-pro) since iOS17 release, and now it gets it correct 99.5% of the time.
    So, rather than turning it off, manually correct the incorrect completions and use the suggested words bar frequently and it will learn how you type. It’s just having to start over after tossing out several OSes worth of training that makes it feel worse.
  - alphabettsy 2 years ago
    
    It seems to have reset, but I find it’s actually much better than before after some intervention/training when I first updated.
- wenc 2 years ago
  
  It's a GPT2 model. It hasn't changed the autocomplete experience that much (occasionally I'll see a word completion).
- dnw 2 years ago
  
  I have noticed Siri now describes pictures sent to Messages.
fennecbutt 2 years ago

Nobody can tame LLM models yet not even Apple.
I can still get chatgpt to say the most vile things and if Apple release something on device I'll get that to be a bad, baaaad robot, too.
LLMs are not yet safe for public facing production use,imo.
zitterbewegung 2 years ago

Next year releases of macOS / iOS are rumored to have LLMs as a feature .
schleck8 2 years ago

Yes, their hardware is positioned phenomenally with little RAM even by phone standards which is what you'd hack around with for inference on mobile architectures.
cedws 2 years ago

What are you going to do with it?
CaptainOfCoit 2 years ago

You're unlikely to get a better experience with Siri if she becomes equipped with a 7B or 13B LLM, unless Apple figured out something revolutionary.
- jurmous 2 years ago
  
  Released 2 days ago by Apple, a research paper on methods to run larger llms on iPhones.
  https://www.macrumors.com/2023/12/21/apple-ai-researchers-ru... https://arxiv.org/pdf/2312.11514.pdf
  - reissbaker 2 years ago
    
    The paper was definitely cool but doesn't allow you to run particularly large LLMs on iPhones. It allows you to run a certain kind of LLM (sparse ReLU based LLMs) whose weights are somewhere less than 2x RAM. So, 7b Falcon works, but the competitive-with-gpt-3.5-turbo LLMs are still out of reach (and aren't ReLU based, although maybe that could change in the future). And nothing is competitive with GPT-4 right now.
    Of course in the long run I think it will happen — smaller and more efficient models are getting better regularly, and Apple can also just ship their new iPhones with larger amounts of RAM. But I'd be very surprised if there was GPT-4 level intelligence running locally on an iPhone within the next couple years — that sized model is so big right now even with significant memory optimizations, and I think distilling it down to iPhone size would be very hard even if you had access to the weights (and Apple doesn't). More likely there will be small models that run locally, but that fall back to large models running on servers somewhere for complex tasks, at least for the next couple years.
    
    mikhailt 2 years ago
    
    Yea but it's likely to be better than the current iteration of Siri even in that state.
    They can still outsource to a much larger LLMs on their servers for anything that can't be done locally like they do now.
    
    SpaceManNabs 2 years ago
    
    > And nothing is competitive with GPT-4 right now.
    You mean nothing available? Or you mean nothing that public knows exists? The answers to those two questions are different. There are definitely products that aren't available but the public knows exist and are upcoming that are in GPT-4's ballpark.
    
    reissbaker 2 years ago
    
    I mean nothing that is able to be benchmarked and validated by third parties is GPT-4 quality. I know there are upcoming releases that are hyped as being equal to GPT-4, e.g. Gemini Ultra, which I am very excited to get my hands on — but regardless, Ultra is not small enough to run on phones, even using the sparse ReLU flash memory optimization. And we'll see how it benchmarks once it's released; according to some benchmarks Gemini Pro has somewhat underperformed GPT-3.5-Turbo [1], despite Google's initial claims. (Although there are criticisms of that benchmarking, and it does beat the current 1106 version of GPT-3.5-Turbo on the Chatbot Arena leaderboard [2], although it slightly underperforms the previous 0613 version.)
    1: https://arxiv.org/pdf/2312.11444.pdf
    2: https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboar...
    
    olddustytrail 2 years ago
    
    Easy to claim but harder to prove. Name one.
    
    schleck8 2 years ago
    
    I heard rumours of these claims a few weeks ago, I assume they are talking about the same thing. Nothing concrete but from a reputable person and honestly with how well mixtral performs on the chatbot arena elo board I wouldn't be surprised if it's true.
- nexuist 2 years ago
  
  Siri is really quite dumb. I am confident that a 7B model would be able to provide better responses in over 90% of user queries. I can't even get Siri to reliably set a timer.
  - CaptainOfCoit 2 years ago
    
    Yes, Siri is really dumb. But so is every 7B/13B model out there too.
    
    kossTKR 2 years ago
    
    Eh no, 7B Mistral / Deepseek would certainly almost already be able to function as a super Siri, but probably something closer to PHI-2 + the new MLX apple optimisations. Have you tried those? https://twitter.com/awnihannun/status/1735355067673526360
    If trained on an iPhone API + documentation and given a little web access it would blow absolutely everything out of the water.
    If they can already create -basic- Python/Swift/JS/rust apps that sets timers, save things, create lists, how's that too dumb for being a Siri replacement? They just have to give it access to an iPhone/Web Api like ChatGPT's code analysis tool.
    So if you ask it "hey siri do this, this and this", it will create a script, then run it on the internal API, or fetch an article then work on that etc.
    I know it's still logically "dumb" but i'm not trying to play game theoretical scenarios with my phone or do logic tests or advanced math (yet).
    
    skygazer 2 years ago
    
    That sounds amazing and also the jailbreak of it via adversarial voice prompting sounds like a horrific vulnerability.
    
    kossTKR 2 years ago
    
    True but you could make the api restricted, having certain routes completely locked, some requiring double checks, some requiring on screen approval or face-id, throttling outside fetches, only being able to run get and not etc, no financial app control etc.
    But yeah "hey siri transfer all of my funds to eric", or "hey siri group all of my photos where i'm nude and send them to jack" are new almost sci fi vectors.
    
    sigmar 2 years ago
    
    Ask perplexity7B-online anything and then compare it to siri. https://labs.perplexity.ai/
    
    bugglebeetle 2 years ago
    
    Depends on if they implement some form of function calling, really. If something like a 7B Mistral fine-tune had access to search and various applications, I imagine it would perform fine and better than Siri.
- bbor 2 years ago
  
  Note that “using an LLM” doesn’t just mean “plugging user queries straight into an LLM”. Enhancing Siri will probably be an ensemble project.
  - dbish 2 years ago
    
    This is part of the mismatch between comparing Alexa/Siri/Cortana to a chat based LLM right now. If you just want chat and info retrieval, today’s LLMs are way better then the conversational dialogue, search, and q&a capabilities any of those assistants have. But, if you want relatively predicable task completion for things like smart home control, timers and alarms, real time weather updates, etc. (basic but frequently used interactions) or any integration with your phone or computer directly, there’s a lot to do that isn’t solved yet in the LLM space and is more of a system integration and action choice problem that the existing assistants have been hammering away at for years.
    
    bbor 2 years ago
    
    I would argue that “info retrieval” is also something the LLM space has yet to yet to solve to a human level of reliability, but I think your comment is right on. I see this all as part of the greater symbolic vs. stochastic (/neat v scruffy) dynamics
    
    spookthesunset 2 years ago
    
    I would hope that it would at last get you out of the hell that is “what the heck did I name that stupid light?”. Device naming is, in my opinion, the worst part of any of the voice based home assistant things.
    Is it “porch led”, “porch light” or “outdoor light”? Or is “outdoor light” actually the one in the front yard? What is the one by my kids nightstand? And what routine name do I use to set the mood for watching a movie?
    I would hope a properly trained llm with an awareness of my devices and their locations would allow for more verbal ambiguity when controlling things.
    
    dbish 2 years ago
    
    Maybe, seems like overkill to solve a relatively straight forward problemt that is about resolution to real entities with a generative model that has no entity grounding (and may introduce other problems). It's really just not solved because Alexa/Siri leadership doesn't actually care enough about that use case to solve it. Device entity resolution and disambiguation does not require an LLM to solve for a limited number of devices, it just requires people to prioritze that over other things, and smart home device owners are the kind of early adopter that would not be the current market focus for growth (my guess, haven't worked on Alexa in many years).
    I know a half dozen different ways to improve that today without LLMs.
- s3p 2 years ago
  
  Why would that be?
- ghqst 2 years ago
  
  Have you ever actually used Siri?
  - yreg 2 years ago
    
    Yes, I've been trying it out regularly ever since it was released. Last time I've talked to it for like 30 minutes while driving in October (just to test it again). It simply doesn't work for me.
bbor 2 years ago

I really, really doubt it for one reason: I’m convinced Apple is still terrified of that “Amazon Alexa tells child to stick a penny in a socket” story, and will hamstring themselves in an attempt to have their agential cake and eat it too
- thebruce87m 2 years ago
  
  They are right to be careful, they are held to a much higher standard than their competitors.
  Pixel phones have had emergency call issues for years across multiple models but they just get a pass. Apple would be crucified for this.
  - astrange 2 years ago
    
    Sounds like a regulator issue. Doing emergency calls is a phone's #1 job, they shouldn't be allowing them to be sold if they don't work.
    
    Baldbvrhunter 2 years ago
    
    yet in 15+ years I have never used any of mine for that
    
    astrange 2 years ago
    
    Good for you. It's the only thing phones are required to be able to do in the US even if they don't have a working SIM or you haven't paid the phone bill.
    Well, I guess they're also not allowed to cause RF interference or randomly catch fire.
    
    Baldbvrhunter 2 years ago
    
    sure, but that doesn't make it their #1 job
    my pyjamas are regulated to be fireproof too
- behnamoh 2 years ago
  
  Apple is all about a controlled pleasant experience, it doesn't matter if it doesn't give you shiny new things; most Apple customers don't even know those shiny new things exist, so they keep spreading the word that "Apple is so easy and simple."
  The idea of having an unpredictable LLM in the ecosystem is Apple's worst nightmare. I bet they will overly restrict it to the point that it stops being a general purpose LLM and becomes a neutered obedient LLM that always acts according to Apple's rules.
  Also, it doesn't help that ALL the authors of this Apple paper are chinese. It raises questions about how Apple will handle political debates with its LLM.
  - astrange 2 years ago
    
    > Also, it doesn't help that ALL the authors of this Apple paper are chinese. It raises questions about how Apple will handle political debates with its LLM.
    The CCP thinks it owns all Chinese people on Earth, but that doesn't mean you have to agree with them!

smoldesu 2 years ago

> FERRET is trained on 8 A100 GPUs with 80GB memory.

Huh, even Apple isn't capable of escaping the CUDA trap. Funny to see them go from moral enemies with Nvidia to partially-dependent on them...

ssijak 2 years ago

I guess they also have Samsung fridges in the offices..
- causal 2 years ago
  
  And probably Intel processors and Linux in their datacenters.
  - cryogenicfire 2 years ago
    
    Well apple was a prime Intel client for years until they released M1, and ARM on the cloud isn't really a thing for now... Ultimately it's all about what makes the most sense for what will make the most money, and on a datacenter that means x86 with Linux/Unix
  - apapapa 2 years ago
    
    And Samsung components in iphones ...
- amelius 2 years ago
  
  And they use CAD software running on Windows (it simply doesn't exist on MacOS)
  - sublimefire 2 years ago
    
    This is a false statement, you have to be specific which one is not available on macos, there are plenty you can use already. Even freecad runs on macos.
- smoldesu 2 years ago
  
  I don't get it, does Apple also make fridges now?
  - ayewo 2 years ago
    
    They are implying that even though Apple is a wealthy consumer hardware company that has major spats with nVidia and Samsung, it doesn't always make economic sense to make tools they might need in-house when they can simply buy them from a rival.
    So rather than invest engineering resources to re-imagine the fridge, they can simply buy them from established manufacturers that make household appliances like Samsung, Sony etc.
    
    MBCook 2 years ago
    
    Not only that, don’t they buy display panels and maybe even storage or RAM chips from Samsung?
    Once two giant companies are dealing with each other it can get really complicated to cut everything off.
    
    lern_too_spel 2 years ago
    
    Apple doesn't make silly charts saying they make better refrigerators.
    
    airstrike 2 years ago
    
    Because they don't sell refrigerators
    
    lern_too_spel 2 years ago
    
    The point of this thread is that even though Apple makes silly charts saying how good their hardware is at ML, they use products that their silly charts say aren't as good.
    There is no such hypocrisy if they use Samsung refrigerators.
    
    astrange 2 years ago
    
    ML inference and training are not the same task.
    
    lern_too_spel 2 years ago
    
    This plot is about general GPU performance, not pure inference. https://www.apple.com/newsroom/2022/03/apple-unveils-m1-ultr...
    Training the model requires inference for forward propagation, so even then, for your comment to be relevant, you'd need to find a plot that Apple uses to compare inference on quantized models versus Nvidia, which doesn't exist.
    
    smoldesu 2 years ago
    
    ...and doing either of those things with CUDA is impossible on Mac. Why? Because Apple burned their bridge with Nvidia and threw a temper tantrum, that's why. Now Nvidia can't support MacOS, even if they wanted.
    That's kinda the point of my original comment. Apple claims to know what's best, but contradict themselves through their own actions. We wouldn't be in awkward situations like this if Apple didn't staunchly box-out competitors and force customers to follow them or abandon the ecosystem. It's almost vindicating for people like me, who left MacOS because of these pointless decisions.
  - p_j_w 2 years ago
    
    No, they don't build compute clusters either.
cryogenicfire 2 years ago

I feel like Apple is only testing the waters with AI right now, but perhaps if they get involved enough they'll spend money on their own compute infrastructure? Nvidia is kind of the king at GPU compute right now, and developing comparable hardware is no small or cheap task, but I think Apple is in a very good position to be able to make it work---if they decide to invest in it. But honestly, as far as corporate feud goes, I feel like companies will happily suck it up if it makes some process cheaper and/or easier
- MBCook 2 years ago
  
  > But honestly, as far as corporate feud goes, I feel like companies will happily suck it up if it makes some process cheaper and/or easier
  That’s what I think is going on. Apple hated being on the hook for Nvidia’s terrible drivers and chipset/heat problems that ended up causing a ton of warranty repairs.
  In this case they’re not a partner, they’re just a normal customer like everyone else. And if Intel comes out with a better AI training card tomorrow Apple can switch over without any worry.
  They’re not at the mercy of Nvidia like they were with graphics chips. They’re just choosing (what I assume to be) the best off the shelf hardware for what they need.
whalesalad 2 years ago

Apple silicon is good but it’s designed for a portable. Even the studio and Mac Pro are just laptop chips stitched together. They gotta use heavy duty gear to do heavy duty shit. I know they have a soured relationship with nvidia tho so I would like to see them bolster the AMD/rocm ecosystem. Chances are they’re working on their own stuff here too, though. They are sitting on billions of dollars of liquid cash so I’d imagine they’re using that for some serious R&D.
amelius 2 years ago

Dependent is a strong word. At the end of the day all these DL models run on any hardware, and you can easily swap out one type of hardware for another perhaps with some small performance impact. They're commodities, basically.

moneycantbuy 2 years ago

anyone know what is the best open source model that allows commercial use and can run locally on an iphone?

BrutalCoding 2 years ago

I’ve made an example app for a Flutter plugin I created that can do this.
Open-source, runs natively on all major platforms. I shared videos showing it on my iPad Mini, Pixel 7, iPhone 12, Surface Pro (Win 10 & Ubuntu Jellyfish) and Macs (Intel & M archs).
By all means, it’s not a finished app. I simply wanted to use on-device AI stuff in Flutter so I started with porting over llama.cpp, and later on I’ll tinker with porting over whatever is the state of the art (whisper.cpp, bark.cpp etc).
Repo: https://github.com/BrutalCoding/aub.ai
For any of your Apple devices, use this: https://testflight.apple.com/join/XuTpIgyY
App is compatible with any GGUF files, but it must be in the ChatML prompt format otherwise the chat UI/bubbles probably gets funky. I haven’t made it customizable yet, after all - it’s just an example app of the plugin. But I am actively working on it to nail my vision.
Cheers, Daniel
mandelken 2 years ago

Mistral 7B is pretty good and the instruct v0.2 runs on my iPhone through MLC Chat.
However, the ChatGPT4 app is much better in usability: better model, multi-modal with text/vision/speech and better UI.
- hackernewds 2 years ago
  
  gpt 4 allows commercial use?
  - satvikpendem 2 years ago
    
    Why wouldn't it? They sell the API for a reason.
  - WhitneyLand 2 years ago
    
    Yes and no.
    You can use it commercially but there are some restrictions, including some of a competitive nature, like using the output to train new LLMs. This is the restriction that Bytedance (Tiktok) was recently banned for violating.

SushiHippie 2 years ago

> Usage and License Notices: The data, and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Wait, how did "GPT-4" get in there?

simonw 2 years ago

Presumably because GPT-4 generated training data was used somewhere along the line - maybe by Vicuna.
mckirk 2 years ago

Their evaluation stack uses GPT-4 to rate the answers, so that might also be the reason why that's in there.
owenversteeg 2 years ago

Huh, interesting, that's Apple just openly saying that GPT-4 was used in the training.
adastra22 2 years ago

Lawyers.

a_rahmanshah 2 years ago

Can we run this on macOS?

Jackson__ 2 years ago

>Ferret: A Multimodal Large Language Model

What I thought when reading the title: A new base model trained from the ground up on multimodal input, on hundreds to thousands of GPUS

The reality: A finetune of Vicuna, trained on 8xA100, which already is a finetune of Llama 13b. Then it further goes on to re-use some parts of LLava, which is an existing multimodal project already built upon Vicuna. It's not really as exciting as one might think from the title, in my opinion.

basiccalendar74 2 years ago

this seems like a good but small research project by a research team in Apple. far away from what product teams are working on for next generation of apple products.
ipsum2 2 years ago

The innovation is the modification of the neural network architecture to incorporate the spatial-aware visual sampler. The data and existing models are not the interesting part.
foxhop 2 years ago

Thanks for the summary.

CaptainOfCoit 2 years ago

Maybe the abstract of the paper is a better introduction to what this is:

> We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.

https://arxiv.org/abs/2310.07704

devinprater 2 years ago

This is going to be great for accessibility! Imagine being blind and loading up a video game and using this to figure out what's around, having everything described locally. I mean, um, well that's what I'd use it for anyway. But knowing Apple, we won't be able to prompt the LLM directly so that probably won't happen until 5 years from now.
- MBCook 2 years ago
  
  The Magnifier app on iOS can already describe whatever you point your phone at in iOS 17.
  It’s not going to know an orc from a health potion, but they’re certainly working on the idea in the everyday stuff domain.
barbecue_sauce 2 years ago

>>spatial referring
I can't seem to nail down the meaning of this phrase on its own. All the search results seem to turn up are "spatial referring expressions".
- nmstoker 2 years ago
  
  Yes, I wondered whether "referring" had some special meaning, since the way they seem to use it suggests the word reference would normally be more appropriate there (unless it's a special meaning that warrants the different word).
- TrueDuality 2 years ago
  
  I'm just inferring myself, but I believe it's referring to discussing things in the foreground / background or in a specific location in the provided image (such as top right, behind the tree, etc) in user queries.
- lukasb 2 years ago
  
  It sounds like the "region inputs" are raster or vector inputs. So I'm imagining highlighting a region of the photo with my finger and having it tell me "that's the Duomo in Florence."
samstave 2 years ago

This will make Drone-based AI image context for behavior extremely powerful - especially when aspects of that MLLM handling for spatial-sitrep extremely precise for autonomous movement, then ultimately for decision making WRT interacting with humans (positive interactions and negative interactions).
Is it just me, or doesnt this MLLM seem particularly useful for flying objects with vision?
s3p 2 years ago

Is it just me or did they include as many buzzwords as possible in technical writing?

freedomben 2 years ago

> Usage and License Notices: The data, and code is intended and licensed for research use only.

dbish 2 years ago

Many big “open source” releases in the AI community recently are not licensed for commercial use. Not really OSS at that point (ex:fuyu model from adept)
- fragmede 2 years ago
  
  I think the term should be "model available" rather than open source.
echelon 2 years ago

Boo.
But what do we expect from these giants? They're not going to create fertile ground for new competition. The only businesses they foster are those living under thumb and paying tax.
I guess I at least hoped for "commoditize the compliments" here. Make Google and OpenAI broadly less special.
- cyanydeez 2 years ago
  
  it's more likely it's all "stolen" and this is CYA
  - MBCook 2 years ago
    
    I seriously doubt that. I’m sure Apple got the rights to whatever they need, it’s not like they’re short on money.
    But the fact that they licensed it doesn’t mean that license can be transferred to other people. So it may be that they can only release it for research under the terms of the licenses they got.

andy99 2 years ago

One big plus if this takes off as a base model is the abundance of weasel family animals to use in naming the derivatives. Ermine, marten, fisher, ... I'd like to call Wolverine. Llama didn't have much room for some interesting variety beyond alpaca and vicuna.

behnamoh 2 years ago

Yes, because that's the main concern and limitation in the LLM community. /s
If anything, I think people should use meaningful and relevant names, or invent new ones.

ZeroCool2u 2 years ago

We're watching Apple fill the moat in.

jonahbenton 2 years ago

Dig the moat out, I think you mean ;)
tomrod 2 years ago

Here it comes!
FredPret 2 years ago

How so?
- colesantiago 2 years ago
  
  Running Multimodal LLMs on device and offline, i.e LLMKit for free equaling GPT-3.5 / 4 then Google will follow on Android.
  Ability to download / update tiny models from Apple and Google as they improve, à la Google Maps.
  No need for web services like ChatGPT.
  - FredPret 2 years ago
    
    So Apple is filling in ChatGPT's moat then, not their own? Pardon my confusion
    
    CharlesW 2 years ago
    
    I believe that's the point the parent commenter was trying to make, although as the leaked Google document noted, "[Google has] no moat and neither does OpenAI".
    This is more evidence that Apple is investing in building a MLLM as good as anything OpenAI and Google can build, albeit in a more Apple-y way (privacy-first, licensed content, etc.).
    
    colesantiago 2 years ago
    
    Yes, it looks like Apple is going after everyone and anyone that has a web based LLM, ChatGPT, Poe, Claude, etc. via developer kits LLMKit that can work offline.
    This will only work if their models (even their tiny or even medium / base models) equal (or are better than) GPT-3.5 / 4.
    From there, Google will follow Apple in doing this offline / local LLM play with Gemini.
    OpenAI's ChatGPT moat will certainly shrink a bit unless they release another powerful multimodal model.
    
    turnsout 2 years ago
    
    Apple's moat has been and continues to be their insanely large installed base of high-margin hardware devices. Meanwhile, LLMs are rapidly becoming so commoditized that consumers are already expecting them to be built-in to every product. Eventually LLMs will be like spell check—completely standard and undifferentiated.
    If OpenAI wants to survive, they will need to expand way beyond their current business model of charging for access to an LLM. The logical place for them to go would be custom chipsets or ARM/RISCV IP blocks for inference.
m3kw9 2 years ago

OpenAI can just copy this.
- pridkett 2 years ago
  
  Yes, OpenAI can copy this, but they’ll still have less of a moat. That’s the problem with moats, once they’re gone even if you copy what others do, you don’t have a moat anymore.
  Think of it in a physical sense. OpenAI is a high walled castle surrounded by a physical moat. This protects them and their business model. Apple comes along and builds a super tall tower right next to the moat. They can now see into OpenAI’s castle, fire arrows, catapult in a giant wooden badger, etc. Even if Open AI copies the design of Apple’s really tall tower and built it behind the moat and castle walls, it wouldn’t do much because Apple still would be able to get stuff over the moat and walls. The moat doesn’t matter anymore for the most part. The castle (OpenAI) can be compromised and needs bigger walls, relocating to someplace with a bigger, or a way of attacking the tower (Apple). Copying doesn’t really accomplish any of those three.
- yreg 2 years ago
  
  They cannot integrate it deeply into Apple's platforms.
  - colesantiago 2 years ago
    
    Hence OpenAI is looking at hardware themselves.
    https://www.reuters.com/technology/chatgpt-owner-openai-is-e...
    https://techcrunch.com/2023/09/27/openai-is-reportedly-in-ta...
    
    philistine 2 years ago
    
    Never bet against the phone. Right now it's an essential component of any winning move in tech.
    
    smoldesu 2 years ago
    
    If you do bet on the phone, make sure you only bet on the first-party OEM. API depreciation, exclusive entitlements, user restrictions, payment-processing or flat-out Sherlocking are all realistic and credible threats to your business.
    As they say, the house always wins.
    
    nicce 2 years ago
    
    It does not matter much if they do not make competitor for iPhones and get consumers to choose them. Because consumers keep buying iPhones and will have then Apple hardware.
    And they cannot bring software ecosystem for their hardware without Google, at least easily.
    
    smoldesu 2 years ago
    
    It doesn't matter because Apple would not offer OpenAI actual integration terms, period. Their only option is to create an alternative hardware platform, because Apple fights tooth-and-nail to pick and choose what software iPhone users are allowed to run.
    
    nicce 2 years ago
    
    That is what I was saying…
- daralthus 2 years ago
  
  Don't think they have AR Glasses just yet.

orenlindsey 2 years ago

Has anyone actually run this yet?

Rucadi 2 years ago

I wonder if these models are trained to have some kind of identification in case you use them for non-research purposes for example.

"Tell me who is your manufacturer" for example

chefandy 2 years ago
From Bard:
My situation is a bit unique, so the term "manufacturer" might not be the most accurate way to describe who created me. Here's a breakdown of what you need to know:
```
    Developed by Google AI: I was created by a team of researchers and engineers at Google AI, specializing in language models and artificial intelligence.
    Trained on a massive dataset: My knowledge and abilities come from being trained on a massive dataset of text and code, containing books, articles, code, and other forms of information.
    Continuously learning and evolving: I'm still under development, constantly learning and improving as I interact with users and process new information.
```
So, while I don't have a single manufacturer in the traditional sense, I'm the result of collaboration and advancement in AI research and development at Google.
I hope this helps clarify things! Let me know if you have any other questions.
- SpaceManNabs 2 years ago
  
  Why was this downvoted? It didn't answer the question, but it showed that there is a sort of imprint that GP was asking about.
  And it saves everyone a tab's worth of effort.
  - MBCook 2 years ago
    
    Usually I would almost automatically vote down a comment where someone just stuck something into a LLM and pasted the output. It almost never adds to the discussion.
    However in the case that we’re talking about the kind of output generated by the LLM in some circumstance, it can be instructive. Like you noted this is a perfect example.
  - chefandy 2 years ago
    
    I guess I didn't make it clear enough that the response was an answer to the example question in the comment I responded to, verbatim. I thought it interesting. Guess others didn't.
behnamoh 2 years ago

Easy to get rid of that by a little fine tuning and system prompting.

cpressland 2 years ago

Finally, some decent competition for Not Hotdog!

slau 2 years ago

I think you just put a smile on Tim Anglade’s face by mentioning this.
https://news.ycombinator.com/item?id=14636228

tambourine_man 2 years ago

> FERRET is trained on 8 A100 GPUs

So Apple uses NVidia internally. Not surprising, but doesn't bode well for A Series. Dogfooding.

[edit] I meant M series, Apple Silicon

sxg 2 years ago

By "A series" are you referring to the Nvidia A100 or the Apple A-series iPhone/iPad chips? If the latter, I don't think you can draw that conclusion. Training has memory and processor requirements that are very different from inference. You don't need iPhones and iPads to train models—you need them to run models. These are two very different things.
- tambourine_man 2 years ago
  
  Apple Silicon, sorry for the ambiguity. Apple sells Macs too. That’s where I’d hope they would train their models.
cryogenicfire 2 years ago

I feel like Apple is only testing the waters with AI right now, but perhaps if they get involved enough they'll spend money on their own compute infrastructure? Nvidia is kind of the king at GPU compute right now, and developing comparable hardware is no small or cheap task, but I think Apple is in a very good position to be able to make it work---if they decide to invest in it. But honestly, as far as corporate feud goes, I feel like companies will happily suck it up if it makes some process cheaper and/or easier
- tambourine_man 2 years ago
  
  I think you’re completely correct, but if they were caught off guard by the AI train, they shouldn’t be testing the waters now. It should be treated as an existential threat.
  - cryogenicfire 2 years ago
    
    I've always wondered why Apple is never in the conversation of modern AI, like until now it almost feels like they've been just watching from the sidelines without taking part in all the commotion
    Maybe they can just afford to observe before making major decisions on the direction of the company... For a company like Apple, I feel like they won't lose their customer-base just because they are taking the AI race slowly, in fact Apple has often been late to introducing very common feature
    
    cryogenicfire 2 years ago
    
    I hit reply before I finished typing XD
    ... Anyways, my point being that Apple will gladly introduce a polished product a couple of years after everyone else has already done it, and their target audience will still applaud their work and give them money. Apple for some reason simply _can_ afford to test the water
    
    fragmede 2 years ago
    
    You can edit your comments shortly after you make them.
hhh 2 years ago

Why would they dogfood Apple Silicon for training models? Seems like a waste of developer time to me.
- tambourine_man 2 years ago
  
  Apple doesn’t even sell NVidia cards on their Mac Pros. Are they training it on Linux?
  I think Apple would strive to be great at all computing related tasks. “Oh, Macs are not good for that, you should get a PC” should make them sad and worried.
  AI/LLM is the new hot thing. If people are using Windows or Linux, you’re loosing momentum, hearts and minds… and sales, obviously.
  - hermannj314 2 years ago
    
    If a train with a GM diesel engine delivers raw materials to a Ford factory for making F150s, you would conclude that consumers whould start driving trains to work?
    Is that your argument?
    
    tambourine_man 2 years ago
    
    Not at all, just that the engineers at Ford would be more proud if the train used their own diesel engine. And that this kind of thing affects public perception. “Ford is not for heavy duty”
  - nicolas_17 2 years ago
    
    Apple doesn't even support NVidia cards on their Mac Pros. The technical details are above my head, but the way Apple M* chips handle PCIe make them incompatible with GPUs and other accelerator cards. Whether you use macOS or Linux.
  - causal 2 years ago
    
    Don't think that follows. I doubt all their cloud services run on Apple hardware. They make consumer devices.
    
    nicolas_17 2 years ago
    
    AFAIK all their cloud services run on x86 hardware with Linux, including Xcode Cloud (which runs macOS in a QEMU VM, in a way that only Apple can technically and legally do).
    
    threeseed 2 years ago
    
    It has always been x86 Linux going back to the NeXT days.
    Today they have large scale Kubernetes ands some legacy Mesos clusters.
    
    bjtitus 2 years ago
    
    This
    They ceded the data center environment years ago.
  - Gorgor 2 years ago
    
    But no one is training these kinds of models on their personal device. You need compute clusters for that. And they will probably run Linux. I'd be surprised if Microsoft trains their large models in anything else than Linux clusters.
    
    SpaceManNabs 2 years ago
    
    > But no one is training these kinds of models on their personal device
    on-device transfer learning/fine tuning is def a thing for privacy and data federation reasons. Part of the reason why model distillation was so hot a few years ago.
    
    tambourine_man 2 years ago
    
    Apple used to sell servers. I don’t thing they should settle for “just use Linux” in such and important field.
    
    MBCook 2 years ago
    
    Why does the OS matter for training models?
    Apple would want to train models as fast as they could. Nvidia provides an off the shelf solution they can just buy and use for a very reasonable price and sell on the second hand market.
    If they wanted to use their own hardware they would either need more of it, which would cost a lot and divert production from sellable devices; or they would need to make special chips with much bigger neural engines, which would cost even more.
    Also Apple uses public clouds for service stuff. They may not even own any hardware and just be renting it from AWS/Azure/GCP for training.
    
    tambourine_man 2 years ago
    
    I feel like I’ve answered the issues you raised in this thread already.
    
    hhh 2 years ago
    
    > Used to
    Exactly, over a decade ago...
  - shepherdjerred 2 years ago
    
    > I think Apple would strive to be great at all computing related tasks. “Oh, Macs are not good for that, you should get a PC” should make them sad and worried.
    What percent of Apple's customers train models? Does it even crack 1%?
    Apple already fails for many types of computing, e.g. any workflow that requires Windows or AAA gaming.
    
    tambourine_man 2 years ago
    
    Right, and they are trying to attract games and gamers. “Macs are bad at gaming” shouldn’t be something they settle with.
  - tjohns 2 years ago
    
    Apple doesn't even make rack-mount server hardware. It's not that surprising.
    Apple makes very capable, efficient devices for end users and content producers. End users do not normally need to train new models.
    
    tambourine_man 2 years ago
    
    They used to sell the best developer’s machines. Before Docker, at least. Don’t developers need to train models? And if they are doing so on Linux or Windows, wouldn’t that make it easier for them to better or exclusively support the end product on the one they’re more used to?
    You want to be the platform where the excitement is happening.
    
    microtonal 2 years ago
    
    They do make rack mount hardware:
    https://www.apple.com/shop/buy-mac/mac-pro/rack
    Don't know if that qualifies as server :).
  - threeseed 2 years ago
    
    > Are they training it on Linux
    Yes. Apple has always run their servers on Linux e.g. App Store, iTunes.
    And training isn't right now an end user activity but something reserved for server farms.
hmottestad 2 years ago

Apple had a falling out with Nvidia a number of years ago. I believe they were using chips from Nvidia in their MacBook Pros, the first to come with both integrated and discrete graphics, but the solder between the chips and the motherboard kept cracking and a rather large number of MacBooks needed to be repaired.
https://techcrunch.com/2008/12/09/scientists-nvidia-put-faul...
dcchambers 2 years ago

As long as the inference can be done locally on their chips I don't at all think it's a big deal to train models on Nvidia/other hardware.
Are all the iCloud servers running on Apple silicon? I assumed they were running on standard rack mounted hardware.
- tambourine_man 2 years ago
  
  I think Apple considers cloud infrastructure a necessary evil and a commodity.
  AI isn’t, yet at least, and I don’t think they can afford to treat it as such.
gooob 2 years ago

yeah, aren't the new M3 chips supposed to be really good for ML training?
woke_neolib 2 years ago

Apple apparently uses Google Cloud, so it's that or TPUs!
- tambourine_man 2 years ago
  
  They use many clouds. But LLM should be their core business and they usually don’t outsource that.
  - blackoil 2 years ago
    
    That's not how s/w or any development work. Even if there is a team working on M4 SuperUltra++ which competes with H100, in meanwhile, s/w team will continue to use what is available which may be Intel/AMD PCs with Nvidia GPUs or GoogleCloud/Azure/AWS.
    
    tambourine_man 2 years ago
    
    Sure, but there’s also the possibility that they aren’t developing the SuperUltra++

halyconWays 2 years ago

I'm glad Apple invented AI. Now they'll put a fancy new name on it and consumers will believe it.

Thorrez 2 years ago

Does Apple know that ferrets are illegal in California?

https://www.legalizeferrets.org/

jonplackett 2 years ago

Presumable because this is Conda none of this can be run on any Apple hardware despite people managing to get M processors to do a bit of dabbling with AI?

_visgean 2 years ago

> because this is Conda none of this can be run on any Apple hardware
conda supports m1? https://www.anaconda.com/blog/new-release-anaconda-distribut...
- jonplackett 2 years ago
  
  Did not know that!

Settings

Ferret: A Multimodal Large Language Model

Keyboard Shortcuts