Settings

Theme

Releasing 3B and 7B RedPajama

together.xyz

363 points by antimatter15 3 years ago · 114 comments

Reader

sphars 3 years ago

Slightly off-topic, but as the parent of a toddler, I got a bit of a chuckle out of the name. It's based off the children's book series of "Llama Llama Red Pajama"

  • petesergeant 3 years ago

    It had put me in mind of the Ogden Nash poem:

        The one-l lama,
        He's a priest.
        The two-l llama,
        He's a beast.
        And I will bet
        A silk pajama
        There isn't any
        Three-l lllama.
    • dllthomas 3 years ago

      "*The author's attention has been called to a type of conflagration known as a three-alarmer. Pooh."

  • elkos 3 years ago

    Thanks.

    As non-native English speaker (while though a parent of a toddler too) I wasn't familiar with the book series.

    • AuryGlenz 3 years ago

      As the father of an 18 month old daughter that likes the book, I have it memorized.

  • dllthomas 3 years ago

    I'm holding out for the MadAtMama model.

  • blurbleblurble 3 years ago

    Not off topic at all

  • innagadadavida 3 years ago

    Founder ex-apple Siri search. Had a baby a couple of years ago. Not too surprising to me :)

rawrmaan 3 years ago

There was a lot of detail and data in here, but it's not very useful to me because all of the comparisons are to things I have no experience with.

There's really only one thing I care about: How does this compare to GPT-4?

I have no use for models that aren't at that level. Even though this almost definitely isn't at that level, it's hard to know how close or far it is from the data presented.

  • Joeri 3 years ago

    None of the 3B and 7B models are at ChatGPT’s level, let alone GPT-4. The 13B models start doing really interesting things, but you don’t get near ChatGPT results until you move up to the best 30B and 65B models, which require beefier hardware. Nothing out there right now approximates GPT-4.

    The big story here for me is that the difference in training set is what makes the difference in quality. There is no secret sauce, the open source architectures do well, provided you give them a large and diverse enough training set. That would mean it is just a matter of pooling resources to train really capable open source models. That makes what RedPajama is doing, compiling the best open dataset, very important for the future of high quality open source LLM’s.

    If you want to play around with this yourself you can install oobabooga and figure out what model fits your hardware from the locallama reddit wiki. The llama.cpp 7B and 13B models can be run on CPU if you have enough RAM. I’ve had lots of fun talking to 7B and 13B alpaca and vicuna models running locally.

    https://www.reddit.com/r/LocalLLaMA/wiki/models/

    • nullsense 3 years ago

      LLaVA 13B is a great multimodal model that has first class support in oobabooga too.

      It's really fun to enable both the whisper extension and the TTS extension and have two-way voice chats with your computer while being able to send it pictures as well. Truly mind bending.

      Quantized 30B models run at acceptable speeds on decent hardware and are pretty capable. It's my understanding that the open source community is iterating extremely fast on small model sizes getting the most out of them by pushing the data quality higher and higher, and then they plan to scale up to at least 30B parameter models.

      I really can't wait to see the results of that process. In the end you're going to have a 30B model that's totally uncensored and is a mix of Wizard + Vicuna. It's going to be a veryyyy capable model.

      • stavros 3 years ago

        I usually even prefer GPT-3.5, as it's faster and much cheaper. GPT-4 is great for the hardcore logical reasoning, but when I want something that knows to turn my lights on and turn the TV to a channel, it's overkill.

    • Semaphor 3 years ago

      > The llama.cpp 7B and 13B models can be run on CPU if you have enough RAM.

      Bigger ones as well, you just have to wait longer. Nothing for real time usage, but if you can wait 10-20 minutes, you can use them on CPU.

      • int_19h 3 years ago

        It's not even that bad. Core i7-12700K with DDR5 gives me ~1 word per second on llama-30b - that is fast enough for real-time chat, with some patience. And things are even better on M1/M2 Macs.

        • Joeri 3 years ago

          The critical factor seems to be the ability to fit the whole model in RAM (--mlock option in oobabooga). With Apple's RAM prices most M1/M2 owners probably don't have the 32 GB RAM required to fit a 4bit 30B model.

          • Semaphor 3 years ago

            I have 64 GB RAM, but only a Ryzen 5 3600, and the larger models are very slow ;)

    • azinman2 3 years ago

      Do these red pajama models work with llama.cpp?

  • quickthrower2 3 years ago

    The bit I liked best was the response examples. Look at those. Clearly not as good as GPT-4 but good enough I feel that for say a scenario where you care about privacy or data provenance this would be a contender.

    For example a therapist, a search bot for you diary, a company intranet help bot. Anything where the prompt contains something you don’t want to send to a third party.

    • rawrmaan 3 years ago

      That's a great point, I definitely overlooked these. They look pretty good, too, and I agree with your use cases.

      Thanks!

  • blihp 3 years ago

    Then you probably don't care about this (yet)

    Assume a truly competitive model in the Open Source world is still a ways off. These teams and their infrastructure are still in their early days while OpenAI is more at the fine-tuning and polishing stage. The fact that these open teams are able to have something in the same universe in terms of functionality this fast is pretty amazing... but it will take time before there's an artifact that will be a strong competitor.

    • nullsense 3 years ago

      The pace of the progress the open source models are making is pretty astonishing. The smaller model sizes are cheap to train so there is a lot of iteration by many different teams. People are also combining proven approaches together. Then they're going to nail it and scale it. Will be very interesting to see where we are in 3 months time.

  • noman-land 3 years ago

    There's a nice chart in the leaked Google memos that compares some of the open models against ChatGPT and Bard so you can get a sense where these models land by comparing them to these.

    https://twitter.com/jelleprins/status/1654197282311491592

  • atleastoptimal 3 years ago

    > How does this compare to GPT-4?

    I'll give you the answer for every open source model over the next 2 years: It's far worse

    • MacsHeadroom 3 years ago

      If you'd said that about OpenAI's DALL-E 2 you'd have been wrong.

      I suspect Open Source LLMs will outpace the release version of GPT-4 before the end of this year.

      It's less likely they will outpace whatever version of GPT-4 is shipped later this year, but still very much possible.

      • Sugimot0 3 years ago
      • int_19h 3 years ago

        Open source LLMs might do that, but I very much doubt that those models will be small enough to run even on high-end consumer hardware (like say RTX 3090 or 4090).

        • regularfry 3 years ago

          The way they'll do it, if they do it at all, is to find a way to squeeze the capability into smaller models and get much faster at executing them. That's where the market forces are.

          That's exactly the core of the email that leaked out of Google: it's proving far better to be able to have lots of people iterating quickly (which necessarily means broad access to the necessary hardware) than to rely on massive models and bespoke hardware.

          I'd anticipate something along the lines of a breakthrough in guided model shrinking, or some trick in partial model application that lets you radically reduce the number of calculations needed. Otherwise whatever happens isn't as likely to come out of the open source LLM community.

          • visarga 3 years ago

            > it's proving far better to be able to have lots of people iterating quickly (which necessarily means broad access to the necessary hardware) than to rely on massive models and bespoke hardware

            Very true, but can't Google just wait and take from the open-source-LLM community the findings, then quickly update their models on their huge clusters? It's not like they will lose the top position, already done that.

            • regularfry 3 years ago

              Yes and no. Some of the optimisation techniques that are being researched at the moment use the output of larger models to fine-tune smaller ones, and that sort of improvement can obviously only be one-way. Same with quantising a model beyond the point where the network is trainable. But anything that helps smaller models run faster without appealing to properties of a bigger model that has to already exist? Absolutely yes.

    • detrites 3 years ago

      That seems way off the mark.

      Open source models can already approximate GPT-3.5 for most tasks on common home hardware, right now.

    • fortyseven 3 years ago

      Okay, so "ignore my out of touch opinion of language models". Got it.

andy_xor_andrew 3 years ago

This is beyond exciting. Welcome to the new reality!

On one hand, the resources required to run these models continues falling dramatically, thanks to the techniques discovered by researchers: GPTQ quantizing down to 4, 3, 2, even 1 bits! model pruning! hybrid vram offloading! better, more efficient architectures! 1-click finetuning on consumer hardware! Of course, the free lunches won't last forever, and this will level off, but it's still incredible.

And on the other side of the coin, the power of all computing devices continues its ever-upward exponential growth.

So you have a continuous lowering of requirements, combined with a continuous increase in available power... surely these two trends will collide, and I can only imagine what this stuff will be like at that intersection.

  • quickthrower2 3 years ago

    I would love to see an article on why quantising to low bits works. Seems counterintuitive to me. For example do that with a CD and it will sound awful. It took smarts to come up with mp3 format rather than just reduce number of bits.

    • int_19h 3 years ago

      A very broad answer is that large NNs are surprisingly resilient to inaccuracies, and it seems to be more pronounced as size grows larger. This is readily observable with LLaMA, where 4-bit quantization affects 7B worst of all.

      Furthermore, model size is still the most significant contributor to output quality. E.g. vanilla llama-30b at 4-bit has better perplexity than any llama-13b finetune at 8-bit. Thus, if 4-bit lets you fit a larger model into available (V)RAM, you're still better off.

      This is also why analog computing is seriously considered as a hardware architecture for LLMs: if you don't actually need bit-perfect matmul for things to work well, it can be done much simpler as an analog circuit, and then you can cram a lot more of them on the same chip. Any resulting quality loss would presumably be minor, and in any case would be more than compensated by the much larger model sizes allowed by such architecture.

    • magicalhippo 3 years ago

      Note, I'm not into ML though I've dabbled with NNs as a teen (before deep learning and all that).

      The weights scale the output values from the previous layer, and the weighted values are summed. So it seems to me, instead of having a high-precision weight scale a single output, if you cloned the node in the previous layer M times, you could still have sqrt(M) bits of precision with 1-bit weights (or M bits, my brain is in weekend mode).

      Thus a larger network with lower-precision weights should have the ability to have approximately the same precision as a smaller network with high-precision weights.

      The larger network has more interconnects though, so seems like it could allow for more interesting space to explore during training, leading to better results.

      Then again, I could be entirely wrong.

    • deepsquirrelnet 3 years ago

      A CD doesn’t work as an analogy. Think about it this way — if you build a model and don’t train it at all, it will still have the same number of parameters and take up the same amount of disk space.

      We’re finding out that many models are undertrained for their sizes, and a good option is to post process them into smaller models by teaching a smaller model to mimic their output. Quantization effectively cuts down the model size as well. No loss in quality means that the model has not been trained enough to take advantage of the depth of precision that is available.

    • specproc 3 years ago

      The analogy I'm currently favouring when talking to semi technical people is that LLMs are a map. We map words and phrases to a coordinate space.

      We can use GPS to locate anything down to a sliding scale of decimal precision. There are only so many digits you need to locate a city or even a house.

    • Der_Einzige 3 years ago

      I think a lot of it is that they are intentionally not measuring the "degradation" in quality experienced. I've noticed that 8 bit quantization of a model like dolly is significantly worse than the 32bit version of it. Seen similar results with using quantization with stable diffusion - the images really are worse, just so little at half percision that it's worth the trade-off.

      • readyplayeremma 3 years ago

        What size model are you quantizing and comparing? The interesting thing about quantization, is how the larger the number of parameters, the less of a difference it makes to quantize the weights, even to an extreme degree when working with the largest parameter models. For small models is can be a disaster though.

      • quickthrower2 3 years ago

        Thanks. I am more surprised it works at all.

        So do they use the weights that are say 32 bit floats and just round them to the nearest something putting them in a range 0-255? I guess I can see how it could work if weights are all close to zero, so -1 to 1 is mapped to 0-255.

        But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.

        • MacsHeadroom 3 years ago

          That commenter is just wrong. We have empirical tests of quality loss due to quantization and even down to 4bits the loss is so negligible no human would ever be able to detect it. The loss only even registers on the benchmarks after generating tens of thousands of full context generations.

          >So do they use the weights that are say 32 bit floats and just round them to the nearest

          That's how they used to do it, and still how 8bit quantization works. That's called "Round to Nearest" or RTN quantization. That's not how it works anymore though.

          The current algorithms (GPTQ, RTPQ, etc.) are more complex, including things like lining up the weights in order of least to greatest, placing them in bins (typically 32 or 128 weights per bin), and then computing an offset for each bin which is added to the RTN value. In some cases bins are identical and redundant and can be re-used without saving the same identical bin twice. These are just a few of the space saving measures which go into effective low-bit quantization without sacrificing quality.

          It's very similar to state of the art video codecs or image compression algorithms. A raw photograph taken by my digital camera is 60MB, but a PNG of the same photo is 30x smaller at 2MB without a single artifact. It should be no surprise that we can reduce models by 4x, 8x, or even more without sacrificing quality.

          • Der_Einzige 3 years ago

            I am not wrong, you are wrong. The fact is that NLP and other fields are FULL of people using automated benchmarks to claim that they are "state of the art". They are incentivized to downplay or trivialize any quality losses. Scores like ROUGE and BLEU are terrible and the whole community knows it, but they're still used because we have nothing "better".

            I can actually see jpg artifacts on the jpg variants of the png files that I generate in Stable Diffusion, and the impacts from quantization down to 3,2, even 1 bit are FAR more than the impacts of switching from png to jpg.

            Also, I actually have published peer reviewed research on LLMs and spend a majority of my time on this earth thinking about and coding for them. I know what I'm talking about and you shouldn't try to dismiss my criticisms so quickly.

            Even the coomers at civitai have done polls where their own users find dreambooth models better than lora models on average, likely because the likeness of a person can be more properly trained when heavier/stronger methods are utilized. Same dynamic here with quantization.

            Yes, as a model scales up in size quantization hurts it less. The claims made that extreme quantization is not noticable at all when the model is super large is just pathetically wrong.

        • alpaca128 3 years ago

          > But I would have though the model relied on the higher accuracy during training. So losing that would screw it up.

          Yes, during training, where you need to make tiny adjustments to weights. But as far as I understand it inference can still work well because of the sheer number of weights. Give a black-and-white image a high resolution and you can represent any shade of gray if you zoom out a bit.

      • quickthrower2 3 years ago
  • visarga 3 years ago

    At that intersection is the "Good Enough Model" that can solve 95% of our needs in full privacy and with complete customisability. The key point is being easy to run on every device. We'll still use proprietary, expensive models for the rest of 5%.

  • wcunning 3 years ago

    Do you have reading links on the consumer hardware fine tuning? I can’t find much from that description…

knaik94 3 years ago

I have been really impressed with the uncensored WizardLM I was playing with. Having a truely open uncensored model to work with is a really important research tool. Censorship of the training data and results in such a heavy handed way is not really possible without lowering the quality of all output.

As the resouces required to train and fine tune these models becomes consumer handware friendly, I think we'll see a shift towards a bunch of smaller models. Open models like these also mean the results of securty and capability research is publicly available. Models like this one and the Replit code model will become the new base all open source models are based on. I am really looking forward to the gptj 4bit, cuda optimized 7b models, the others I have tested run fast on 2070max q and 16gb ram, I was getting ~7tokens/second. Lora can work directly with 4bit quantized models. While ggml, cpu models are very strong, I don't believe we're move away from gpu accelarated training and fine tuning anytime soon.

  • regularfry 3 years ago

    The thing is that anything that benefits the bottom end also should reflect up and help the top end too, if they're paying attention.

practice9 3 years ago

Models replicating LLaMA are cool, but they are all missing proper multilingual support, which GPT-3.5 is quite good at.

  • mirekrusin 3 years ago

    IMHO multilingual support would just pollute precious available estate in those models. Why not use it in english and use another one for translation?

    • viraptor 3 years ago

      That would work if all information is available in English as the primary language. That's not the case though. You may be missing out on interesting information if you're skipping other languages.

    • espadrine 3 years ago

      It depends on your use.

      LLaMA’s main issue is that its license prevents commercial use.

      If you want to use a LLM inside of a product, you may need to internationalize it at some point, so multilingual support matters.

  • tyfon 3 years ago

    Llama 65B is actually quite decent in other languages. I can just barely fit it in memory though with my 128 gb ram. Usually I run the 8 bit quantized version that use 80, but even the 4 and 3 but are ok compared to the fp16 30B version.

ftxbro 3 years ago

With this one and mosaicml we now got so many of these consumer-gpu-sized models!

wtarreau 3 years ago

That's very interesting to perform basic tasks at reasonable speeds or to run on smaller systems. Unfortunately it's not of the many ones based on python and transformers, so all gained resources from the compact model are wasted by the heavy engine and ecosystem, and even a 4GB machine with 4G swap goes oom because the loaded data gets duplicated in memory using read() and malloc() :-(

Let's wait for someone to port it to a cheaper and more powerful C-based engine like llama-cpp.

nico 3 years ago

idea: linked parameters / models tree

build a model that can change the number of parameters in the vicinity of some meaning, effectively increasing the local resolution around that meaning

so parameter space becomes linked-parameter space, between models

links could be pruned based on activation frequency

another way of seeing the concept is a tree of models/llms

and one additional model/llm that all it does is manage the tree (ie. build it as it goes, use it to infer, prune it, etc)

Or is it too dumb what I’m saying?

ftxbro 3 years ago

So I tried RedPajama-INCITE-Instruct-7B-v0.1 and the AutoModelForCausalLM.from_pretrained(...) call takes two minutes every time. My GPU is big enough. I don't know why it's so slow. I feel like it's somehow precomputing stuff that can be used across queries, and I had hoped that this stuff would have already been precomputed on the disk and I could just load it up.

born-jre 3 years ago

i also wonder how powerful will 3b model will be ? can it act as a prompt router where it can make API call to ChatGPT or other specified model for actual processing. its probably possible to do this with langchain but i have not tried it yet.

ibitto 3 years ago

I am really interested in knowing what people are using these smaller models for. I have seen a lot of projects on top of GPT-3.5 / GPT-4, but I have yet to see any using these smaller models.

mirker 3 years ago

Does anyone have experience using these open source models in production?

  • flatiron 3 years ago

    Doubtful since they were released yesterday. That being said I will be deploying something to our lab this week to play with.

acapybara 3 years ago

I've been following the RedPajama project closely and I must say, it's quite an impressive undertaking. The fact that it's all open-source, and the collaboration between various institutions, is nothing short of amazing. This shows the power of the open-source community in action, with a bunch of smart people coming together to build something truly remarkable.

The 3B model, being super fast and accessible, is a game changer for a lot of us who may not have the latest hardware. I mean, running on an RTX 2070 that was released 5 years ago? That's pretty cool.

As for the 7B model, it's great to see that it's already outperforming the Pythia 7B. The bigger dataset definitely seems to be making a difference here. I'm eager to see how far this project goes, and what kinda improvements we can expect in the coming weeks with the new RedPajama dataset they're working on.

One thing I found interesting is the mention of differences between the LLaMA 7B and their replication. I'd love to learn more about those differences, as it could shed light on what's working well and what could be improved further.

  • SeanAnderson 3 years ago

    Sorry, excuse my ignorance, but why is having access to a 3B model a gamechanger?

    I played with a pirated 7B model a while back. My computer runs a 1080 TI - so it used to be good but now it's pretty old. The model ran with a reasonable number of tokens/sec, but the quality was just trash compared to what I'd grown used to with ChatGPT. It was a novelty I interacted with for just a single evening.

    I truly don't understand the use case for a 3B model with our current technologies.

    What are you going to use it for?

    • examplary_cable 3 years ago

      You can ultra fine tune those models ... look at vicune 13B, if you know how to prompt it well, you can get it to work as """"well"""" as ChatGPT. Running on local hardware .... I just got vicune 13b on gradio[1] to act as japanese kanji personal trainer, and I've only used a simple prompt: "I want you to act as a Japanese Kanji quiz machine. Each time I ask you for the next question, you are to provide one random Japanese kanji from JLPT N5 kanji list and ask for its meaning. You will generate four options, one correct, three wrong. The options will be labeled from A to D. I will reply to you with one letter, corresponding to one of these labels. You will evaluate my each answer based on your last question and tell me if I chose the right option. If I chose the right label, you will congratulate me. Otherwise you will tell me the right answer. Then you will ask me the next question. Avoid simple kanjis, let's go."

      [1] https://chat.lmsys.org/

      • wongarsu 3 years ago

        Sure, a 13B model can be fine-tuned to be pretty decent, which is quite remarkable compared to GPT3's 175B paramters. But a 3B model has 1/4th as many parameters as Vicune-13B, or about twice as many as GPT2. Can you really fine-tune that to do anything useful that wouldn't be better handled by a more specialized open-source model?

      • cced 3 years ago

        How can someone get into using these models? How does ‘tuning’ work? How might I go about using these models for doing things like say summarizing news articles or video transcriptions? When someone tunes a model for a task, what exactly are they doing and how does this ‘change’ the model?

        • examplary_cable 3 years ago

          (I'm not an expert)

          > How can someone get into using these models

          You can use gradio(online) or download(git will not download, it's too big, do it manually) the weights at https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main and then load the model in pytourch and try inference(text generation). But you'll need either a lot of RAM(16GB,32GB+) or VRAM(Card).

          > How might I go about using these models for doing things like say summarizing news articles or video transcriptions Again, you might try online or setup a python/bash/powershell script to load the model for you so you can use it. If you can pay I would recommend runpod for the shared GPUs.

          > When someone tunes a model for a task, what exactly are they doing and how does this ‘change’ the model? From my view ... not much ... "fine-tuning" means training(tuning) on a specific dataset(fine, as in fine-grained). As I believe(I'm not sure) they just run more epochs on the model with the new data you have provided it until they reach a good loss(the model works), that's why quality data is important.

          You might try https://github.com/oobabooga/text-generation-webui they have a pretty easy setup config. Again, you'll need a lot of RAM and a good CPU for inference on CPU or a GPU.

          https://huggingface.co/lmsys/vicuna-13b-delta-v1.1/tree/main

        • chaxor 3 years ago

          A newer but much better system actually reduces the model size while reducing the functionality of the system - similar to training a NN for a very specific task (as was typical several years ago), but now it can happen with far less data. https://arxiv.org/pdf/2305.02301.pdf This paper is quite fantastic, and will likely shape up to be a quite important glue task for LLM models to generate.

      • ym555 3 years ago

        While I recognize that this only one example of what you can do, you can just ask chatgpt to program you a traditional program that does something like this and not have to run a (pretty big/power-intensive/slow on most hardware) 3B/7B parameter model for simple tasks like these.

        Yeah it wouldn't be as flexible as a LLM (for example synonyms won't work), but I doubt that for this particular task it'll be that big of problem, and you can ask it to tweak the program in various ways (for example introducing crude spaced-repetition) making it arguably better than the AI solution which takes sometime to prompt engineer and will never be "perfect".

        I don't really know how much better fine-tuning makes these models, so I can't think of anything that they can actually be used for where they aren't worse than traditional programs, maybe as an AI in games? for example making them role-play as a historical figure in Civilization 6.

        • examplary_cable 3 years ago

          My example here was silly and I admit. But the point was that this simple task cab become more "nuanced"(Aside from ChatRWVK-raven, no other model quite "works" like Vicuna or "tuned LLama"), it can, given the correct prompt act as someone in a fictional work which might help you learn the language better by increase conversational time(most important metric, I'm talking comprehensible input here) by the virtue of being more enjoyable.

          Overall I like the progress: LLama releases -> LLama fine turned on larger models gets similar performance to ChatGPT on lower parameters(more efficient) -> People can replicate LLama's model without anything special, effectively making LLMs a "Commodity" -> You are Here.

    • ttt3ts 3 years ago

      Finetuning which can easily be done on consumer hardware and can give these models a lot more power for specific applications.

      Also, ChatGPT just can't do a lot of things because of their "rules". I was doing question answering about products on Amazon with ChatGPT and refused to answer any questions about underwear, certain books/videos, etc

    • elorant 3 years ago

      Depends on what you want it for. Chatting isn't the only application. For text summarization a model like Vicuna-13b has similar performance to ChatGPT 3.5. Fine-tuned models like the one in this thread might perform way better than the initial ones that leaked from Meta. The important thing is that there's constant progress in this area from the Open Source community and we're about to see amazing things in the future.

    • barbariangrunge 3 years ago

      I'm in the market for a laptop. If I was crazy and wanted to run or train models like these, what kind of resources would I need?

      Would the way the m2 MacBooks share memory be an advantage, or would the lack of cuda support be a killer? Can you do anything with 16GB, or do you need 128gb or something like that? How large are the datasets?

      I've only used scikit-learn and pandas so far, I'm not very familiar with neural networks yet

      • zamnos 3 years ago

        It's not crazy to want to train or run models like these, it's actually quite popular right now! :) The question for you to answer is how handy with scikit-learn and pandas are you, and how much do you want to be on the bleeding edge of things? Most stuff is coming out for CUDA first, since that's what the industrial grade GPUS (A100s) use, so with Apple Arm you either have to wait for someone to port it, or port it yourself.

        On the other hand, getting > 8 GiB VRAM on a laptop GPU is rare; you're definitely not getting 128 GiB VRAM, so Apple Arm, with 32 or 64 GiB or RAM (get 128 if you can afford it) is going to get you more gigabytes of usable RAM for training/inference.

        • barbariangrunge 3 years ago

          Yeah. It seems to me that it's really hard to get more than 10-14 GB of VRAM without using some sort of hyper expensive cluster. What would it cost if you wanted to do it with Nvidia? Being able to share ordinary ram with the GPU in a Mac could maybe be a unique value proposition

          • int_19h 3 years ago

            RTX 3090 or 4090 gets you 24Gb of VRAM, which is enough to run llama-30b (quantized to 4-bit with groupsize of 1024 or higher) at speeds comparable to ChatGPT. You can also get two and run the model split across them, although pumping data back and forth slows things down.

            A brand new RTX A6000 (48Gb VRAM) is probably the largest you can get in a single card that can run in a regular PC. It can be had for $4-5k and is sufficient for llama-65b.

            Beyond that, yeah, you're looking at dedicated multi-GPU server hardware.

          • dragonwriter 3 years ago

            > It seems to me that it’s really hard to get more than 10-14 GB of VRAM without using some sort of hyper expensive cluster.

            Both consumer and workstation (the latter may be cheaper per RAM, but with fewer shaders) 16-24 GB GPUs (RTX 3080Ti/3090/4090/A4000/A4500/A5000), including in laptops, are not hard to find (pricey, but not “hyperexpensive clusters”), and its not until you jump above a single 48 GB RTX A6000 that you need a “cluster”.

    • youssefabdelm 3 years ago

      Completely agree. Perhaps they were planning to fine-tune it for something though.

    • acapybara 3 years ago

      Hey SeanAnderson, good question! While parameter count is certainly an important factor in model performance, it's not the only one. The RedPajama project is taking a more nuanced approach to understanding what makes a model perform well, and their focus on smaller models like the 3B is a big part of that.

      Sure, you may have played with a 7B model in the past, but that doesn't mean there's no use case for a smaller model like the 3B. In fact, having a performant, smaller model is a game changer for a lot of applications that don't require the massive scale of the larger models. Plus, smaller models are generally faster and more accessible, which is always a plus.

      • wokwokwok 3 years ago

        > In fact, having a performant, smaller model is a game changer for a lot of applications that don't require the massive scale of the larger models.

        So we are all in agreement here that a 3B model is fundamentally inferior to a larger model?

        Not that it doesn’t have uses; not that there’s no value in research in small models.

        Just, honestly, that these smaller models don’t have the capabilities of the larger models.

        It’d be good to be a direct acknowledgment of that, because it seems like you’re going out of your way to promote the “it’s fine to have a small model”; and it is, roughly speaking. Parameter count isn’t everything. Small models are accessible, you can easily fine tune them. They are interesting.

        …but, they are not as good, as far as I’m aware, in terms of output, in terms of general purpose function, as larger models.

        • tomrod 3 years ago

          For your first point where you are attempting to impose agreement, I believe the other commentator is saying that tradeoffs are non-negligible between the two.

          Sounds like the difference between edge and centralized ML scoring.

        • deepsquirrelnet 3 years ago

          There is no “one size fits all” here. A bigger model is just a bigger hammer, that in many uses is too bulky and slow to be a proper solution.

          At my job, I can’t casually fire up 8xA100 80gb instances. And if I could, the performance wouldn’t have the throughput I require to be useful. Big models are operationally much more expensive.

          The smallest/fastest model that is accurate enough for your use case is ideal.

          • wokwokwok 3 years ago

            > The smallest/fastest model that is accurate enough for your use case is ideal.

            Sure.

            …but it’s also fair to say that the smallest model that can fit your use case will be bounded by the parameter count.

            No amount of training data can make 100 param model do text summarisation.

            If you have a 3B param model, and you want a chat-GPT to embed in your app, do you think it’ll do?

            I don’t.

            The output is not at that quality level, because it’s too small.

            Not everyone needs that; but these 3B / 7B models don’t have the capability to do everything.

        • chaxor 3 years ago

          Of the goal is to use it to access a large knowledge base (like Google, but with better semantic searching), then it doesn't matter as much. There are some cases where it still matter due to not making some connections (for example, you may want an answer to something and not realize it due to your ignorance - a smaller model will get that a few percentage less times).

          But ultimately small models are very good for most things, and much more preferable (to run at the home to organize your digital life, with a small SBC or old computer)

      • robertlagrant 3 years ago

        > Hey SeanAnderson, good question! While parameter count is certainly an important factor in model performance, it's not the only one. The RedPajama project is taking a more nuanced approach to understanding what makes a model perform well, and their focus on smaller models like the 3B is a big part of that.

        Sure, you may have played with a 7B model in the past, but that doesn't mean there's no use case for a smaller model like the 3B. In fact, having a performant, smaller model is a game changer for a lot of applications that don't require the massive scale of the larger models. Plus, smaller models are generally faster and more accessible, which is always a plus.

        It's hard to pick out the actual answer: what is the application that this is good at? What has their "more nuanced" approach to understanding performance increased this model's performance at doing?

      • hhh 3 years ago

        is this comment generated by an LLM?

  • Sunhold 3 years ago

    Took me a bit to realize this comment was written by an LLM.

    • awegio 3 years ago

      How did you realize it here? This user has multiple comments in this thread but this one actually sounds more normal than the others.

      I find it very uncanny to see comments like this that sound like ChatGPT but are surprisingly relevant to the discussion.

      • fphhotchips 3 years ago

        I didn't realise it was written by an LLM but it did come off as weird to me because it borrows phrases (most obviously the bit about a "2070 released 5 years ago") from the press release itself.

      • zvolsky 3 years ago

        It is the vacuous word fluff. My best guess is that it is a genuine human comment rephrased using a language model.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection