Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model

502 points by nekofneko 5 months ago · 250 comments

Reader

Tepix 5 months ago

Huggingface Link: https://huggingface.co/moonshotai/Kimi-K2.5

1T parameters, 32b active parameters.

License: MIT with the following modification:

Our only modification part is that, if the Software (or any derivative works thereof) is used for any of your commercial products or services that have more than 100 million monthly active users, or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.

endymi0n 5 months ago

One. Trillion. Even on native int4 that’s… half a terabyte of vram?!
Technical awe at this marvel aside that cracks the 50th percentile of HLE, the snarky part of me says there’s only half the danger in giving something away nobody can run at home anyway…
- johndough 5 months ago
  
  The model absolutely can be run at home. There even is a big community around running large models locally: https://www.reddit.com/r/LocalLLaMA/
  The cheapest way is to stream it from a fast SSD, but it will be quite slow (one token every few seconds).
  The next step up is an old server with lots of RAM and many memory channels with maybe a GPU thrown in for faster prompt processing (low two digits tokens/second).
  At the high end, there are servers with multiple GPUs with lots of VRAM or multiple chained Macs or Strix Halo mini PCs.
  The key enabler here is that the models are MoE (Mixture of Experts), which means that only a small(ish) part of the model is required to compute the next token. In this case, there are 32B active parameters, which is about 16GB at 4 bit per parameter. This only leaves the question of how to get those 16GB to the processor as fast as possible.
  - WhitneyLand 5 months ago
    
    Its often pointed out in the first sentence of a comment how a model can be run at home, then (maybe) towards the end of the comment it’s mentioned how it’s quantized.
    Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible.
    The degree of quality loss is not often characterized. Which makes sense because it’s not easy to fully quantify quality loss with a few simple benchmarks.
    By the time it’s quantized to 4 bits, 2 bits or whatever, does anyone really have an idea of how much they’ve gained vs just running a model that is sized more appropriately for their hardware, but not lobotomized?
    
    zozbot234 5 months ago
    
    > ...Back when 4k movies needed expensive hardware, no one was saying they could play 4k on a home system, then later mentioning they actually scaled down the resolution to make it possible. ...
    int4 quantization is the original release in this case; it's not been quantized after the fact. It's a bit of a nuisance when running on hardware that doesn't natively support the format (might waste some fraction of memory throughput on padding, specifically on NPU hw that can't do the unpacking on its own) but no one here is reducing quality to make the model fit.
    
    WhitneyLand 5 months ago
    
    Good point thanks for the clarification.
    The broader point remains though which is, “you can run this model as home…” when actually the caveats are potentially substantial.
    It would be so incredibly slow…
    
    FuckButtons 5 months ago
    
    From my own usage, the former is almost always better than the latter. Because it’s less like a lobotomy and more like a hangover, though I have run some quantized models that seem still drunk.
    Any model that I can run in 128 gb in full precision is far inferior to the models that I can just barely get to run after reap + quantization for actually useful work.
    I also read a paper a while back about improvements to model performance in contrastive learning when quantization was included during training as a form of perturbation, to try to force the model to reach a smoother loss landscape, it made me wonder if something similar might work for llms, which I think might be what the people over at minimax are doing with m2.1 since they released it in fp8.
    In principle, if the model has been effective during its learning at separating and compressing concepts into approximately orthogonal subspaces (and assuming the white box transformer architecture approximates what typical transformers do), quantization should really only impact outliers which are not well characterized during learning.
    
    WhitneyLand 5 months ago
    
    Interesting.
    If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?
    
    petu 5 months ago
    
    You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.
    With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.
    
    dabockster 5 months ago
    
    Hanlon's razor.
    "Never attribute to malice that which is adequately explained by stupidity."
    Yes, I'm calling labs that don't distill smaller sized models stupid for not doing so.
    
    codexon 5 months ago
    
    Didn't this paper demonstrate that you only need 1.58 bits to be equivalent to 16 bits in performance?
    https://arxiv.org/abs/2402.17764
    
    Ey7NFZ3P0nzAe 5 months ago
    
    This technique showed that there are ways during training to optimize weights to neatly quantize while remaining performant. This isn't a post training quantization like int4.
    
    WhitneyLand 5 months ago
    
    For Kimi quantization is part of the training also. Specifically they say they use QAT, quantization aware training.
    That doesn't mean training with all integer math, but certain tricks are used to specifically plan for the end weight size. I.e. fake quantization nodes are inserted to simulate int4.
    
    WhitneyLand 5 months ago
    
    Iirc the paper was solid, but it still hasn’t been adopted/proven out at large scale. Harder to adapt hardware and code kernels to something like this compared to int4.
    
    RandomTeaParty 4 months ago
    
    just call it one trit
    
    Gracana 5 months ago
    
    The level of deceit you're describing is kind of ridiculous. Anybody talking about their specific setup is going to be happy to tell you the model and quant they're running and the speeds they're getting, and if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around.
    
    jasonjmcghee 5 months ago
    
    > if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around
    Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.
    8-bit vs 4-bit can be almost imperceptible or night and day.
    This isn't something you'd necessarily see playing around, but when trying to do something specific
    
    selfhoster11 5 months ago
    
    Except the parent comment said you can stream the weights from an SSD. The full weights, uncompressed. It takes a little longer (a lot longer), but the model at least works without lossy pre-processing.
  - 1dom 5 months ago
    
    > The model absolutely can be run at home. There even is a big community around running large models locally
    IMO 1tln parameters and 32bln active seems like a different scale to what most are talking about when they say localLLMs IMO. Totally agree there will be people messing with this, but the real value in localLLMs is that you can actually use them and get value from them with standard consumer hardware. I don't think that's really possible with this model.
    
    zamadatix 5 months ago
    
    Local LLMs are just LLMs people run locally. It's not a definition of size, feature set, or what's most popular. What the "real" value is for local LLMs will depend on each person you ask. The person who runs small local LLMs will tell you the real value is in small models, the person who runs large local LLMs will tell you it's large ones, those who use cloud will say the value is in shared compute, and those who don't like AI will say there is no value in any.
    LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.
    
    1dom 5 months ago
    
    > LLMs which the weights aren't available are an example of when it's not local LLMs, not when the model happens to be large.
    I agree. My point was that most aren't thinking of models this large when they're talking about local LLMs. That's what I said, right? This is supported by the download counts on hf: the most downloaded local models are significantly smaller than 1tln, normally 1 - 12bln.
    I'm not sure I understand what point you're trying to make here?
    
    zamadatix 5 months ago
    
    Mostly a "We know local LLMs as being this, and all of the mentioned variants of this can provide real value regardless of which is most commonly referenced" point. I.e. large local LLMs aren't only something people mess with, they often provide a lot of value for a relative few people rather than a little value for a relative lot of people as small local LLMs do. Who thinks which modality and type brings the most value is largely a matter of opinion of the user getting the value, not just the option which runs on consumer hardware or etc alone.
    You're of course accurate that smaller LLMs are more commonly deployed, it's just not the part I was really responding to.
    
    zozbot234 5 months ago
    
    32B active is nothing special, there's local setups that will easily support that. 1T total parameters ultimately requires keeping the bulk of them on SSD. This need not be an issue if there's enough locality in expert choice for any given workload; the "hot" experts will simply be cached in available spare RAM.
    
    spmurrayzzz 5 months ago
    
    When I've measured this myself, I've never seen a medium-to-long task horizon that would have expert locality such that you wouldn't be hitting the SSD constantly to swap layers (not to say it doesn't exist, just that in the literature and in my own empirics, it doesn't seem to be observed in a way you could rely on it for cache performance).
    Over any task that has enough prefill input diversity and a decode phase thats more than a few tokens, its at least intuitive that experts activate nearly uniformly in the aggregate, since they're activated per token. This is why when you do something more than bs=1, you see forward passes light up the whole network.
    
    zozbot234 5 months ago
    
    > hitting the SSD constantly to swap layers
    Thing is, people in the local llm community are already doing that to run the largest MoE models, using mmap such that spare-RAM-as-cache is managed automatically by the OS. It's a drag on performance to be sure but still somewhat usable, if you're willing to wait for results. And it unlocks these larger models on what's effectively semi-pro if not true consumer hardware. On the enterprise side, high bandwidth NAND Flash is just around the corner and perfectly suited for storing these large read-only model parameters (no wear and tear issues with the NAND storage) while preserving RAM-like throughput.
    
    spmurrayzzz 5 months ago
    
    I've tested this myself often (as an aside: I'm in said community, I run 2x RTX Pro 6000 locally, 4x 3090 before that), and I think what you said re: "willing to wait" is probably the difference maker for me.
    I can run Minimax 2.1 in 5bpw at 200k context fully offloaded to GPU. The 30-40 tk/s feels like a lifetime for long horizon tasks, especially with subagent delegation etc, but it's still fast enough to be a daily driver.
    But that's more or less my cutoff. Whenever I've tested other setups that dip into the single and sub-single digit throughput rates, it becomes maddening and entirely unusable (for me).
    
    RandomTeaParty 4 months ago
    
    What is bpw?
    
    spmurrayzzz 4 months ago
    
    Bits per weight, its an average precision across all the weights. When you quantize these models, they don't just used a fixed precision size across all model layers/weights. There's a mix and it varies per quant method. This is why you can get bit precision that arent "real" in a strict computing sense.
    e.g. A 4-bit quant can have half the attention and feed forward tensors in Q6, and the rest in Q4. Due to how block-scaling works, those k-quant dtypes (specifically for llama.cpp/gguf) have larger bpw than they suggest in their name. Q4 is around ~4.5 bpw, and Q6 is ~6.5.
    
    1dom 5 months ago
    
    I never said it was special.
    I was trying to correct the record that a lot of people will be using models of this size locally because of the local LLM community.
    The most commonly downloaded local LLMs are normally <30b (e.g. https://huggingface.co/unsloth/models?sort=downloads). The things you're saying, especially when combined together, make it not usable by a lot of people in the local LLM community at the moment.
    
    GeorgeOldfield 5 months ago
    
    do you guys understand that different experts are loaded PER TOKEN?
  - dabockster 5 months ago
    
    You can run AI models on unified/shared memory specifically on Windows, not Linux (unfortunately). It uses the same memory sharing system that Microsoft originally had built for gaming when a game would run out of vram. If you:
    - have an i5 or better or equivalent manufactured within the last 5-7 years
    - have an nvidia consumer gaming GPU (RTX 3000 series or better) with at least 8 GB vram
    - have at least 32 GB system ram (tested with DDR4 on my end)
    - build llama-cpp yourself with every compiler optimization flag possible
    - pair it with a MoE model compatible with your unified memory amount
    - and configure MoE offload to the CPU to reduce memory pressure on the GPU
    then you can honestly get to about 85-90% of cloud AI capability totally on-device, depending on what program you interface with the model.
    And here's the shocking idea: those system specs can be met by an off the shelf gaming computer from, for example, Best Buy or Costco today and right now. You can literally buy a CyberPower or iBuyPower model, again for example, download the source, run the compilation, and have that level of AI inference available to you.
    Now, the reason why it won't work on Linux is that the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement. Which Nvidia hasn't done yet. You can code it somewhat into source code, but it's still super unstable and flaky from what I've read.
    (In fact, that lack of unified memory tech on Linux is probably why everyone feels the need to build all these data centers everywhere.)
    
    alimbada 5 months ago
    
    > Now, the reason why it won't work on Linux is that the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement. Which Nvidia hasn't done yet. You can code it somewhat into source code, but it's still super unstable and flaky from what I've read.
    So it should work with an AMD GPU?
    
    dabockster 4 months ago
    
    > the Linux kernel and Linux distros both leave that unified memory capability up to the GPU driver to implement
    Depends on if AMD (or Intel, since Arc drivers are supposedly OSS as well) took the time to implement that. Or if a Linux based OS/distro implements a Linux equivalent to the Windows Display Driver Model (needs code outside of the kernel and specific to the developed OS/distro to do).
    So far, though, it seems like people are more interested in pointing fingers and sucking up the water of small town America than actually building efficient AI/graphics tech.
  - PlatoIsADisease 5 months ago
    
    >The model absolutely can be run at home.
    There is a huge difference between "look I got it to answer the prompt: '1+1='"
    and actually using it for anything of value.
    I remember early on people bought Macs (or some marketing team was shoveling it), and proposing people could reasonably run the 70B+ models on it.
    They were talking about 'look it gave an answer', not 'look this is useful'.
    While it was a bit obvious that 'integrated GPU' is not Nvidia VRAM, we did have 1 mac laptop at work that validated this.
    Its cool these models are out in the open, but its going to be a decade before people are running them at a useful level locally.
    
    esafak 5 months ago
    
    Hear, hear. Even if the model fits, a few tokens per second make no sense. Time is money too.
    
    hex4def6 5 months ago
    
    If I can start an agent and be able to walk away for 8 hours, and be confident it's 'smart' enough to complete a task unattended, that's still useful.
    At 3 tk/s, that's still 100-150 pages of a book, give or take.
    
    esafak 5 months ago
    
    True, that's still faster than a human, but they're not nearly that reliable yet.
    
    tempoponet 5 months ago
    
    Maybe for a coding agent, but a daily/weekly report on sensitive info?
    If it were 2016 and this technology existed but only in 1 t/s, every company would find a way to extract the most leverage out of it.
    
    michaellee8 5 months ago
    
    If they figured out it can be this useful in 2016 running 1 t/s, they would make it run at least 20 t/s by 2019
    
    esafak 5 months ago
    
    But it's 2026 and 'secure' (by executive standards) hosted options exist.
    
    dabockster 5 months ago
    
    > 'secure' (by executive standards)
    "Secure" in the sense that they can sue someone after the fact, instead of preventing data from leaking in the first place.
  - side_up_down 5 months ago
    
    I'd take "running at home" to mean running on reasonably available consumer hardware, which your setup is not. You can obviously build custom, but who's actually going to do that? OP's point is valid
  - dev_l1x_be 5 months ago
    
    How do you split the model between multiple GPUs?
    
    evilduck 5 months ago
    
    With "only" 32B active params, you don't necessarily need to. We're straying from common home users to serious enthusiasts and professionals but this seems like it would run ok on a workstation with a half terabyte of RAM and a single RTX6000.
    But to answer your question directly, tensor parallelism. https://github.com/ggml-org/llama.cpp/discussions/8735 https://docs.vllm.ai/en/latest/configuration/conserving_memo...
- wongarsu 5 months ago
  
  Which conveniently fits on one 8xH100 machine. With 100-200 GB left over for overhead, kv-cache, etc.
  - storystarling 5 months ago
    
    The unit economics seem pretty rough though. You're locking up 8xH100s for the compute of ~32B active parameters. I guess memory is the bottleneck but hard to see how the margins work on that.
    
    kristianp 4 months ago
    
    Yes, it only makes sense economically if you have batching over many users.
- mrinterweb 5 months ago
  
  VRAM is the new moat, and controlling pricing and access to VRAM is part of it. There will be very few hobbyists who can run models of this size. I appreciate the spirit of making the weights open, but realistically, it is impractical for >99.999% of users to run locally.
- segmondy 5 months ago
  
  I run KimiK2 at home, Most of it on system ram with a few layers offloaded to old 3090s. This is a cheap budget build.
  Kimi-K2-Thinking-UD-Q3_K_XL-00001-of-00010.gguf Generation - 5,231 tokens 604.63s 8.65 tokens/s
  - mapkkk 5 months ago
    
    Could I trouble you for the specifics of your build? I'd love to see if it would be a viable upgrade for me.
    I currently have a 3970x with a bunch of 3090s.
    
    segmondy 5 months ago
    
    4 3090s, epyc MB with 8 channel memory, 7352 cpu, slow 2400mhz ddr4 rams.
- Davidzheng 5 months ago
  
  that's what intelligence takes. Most of intelligence is just compute
- the_sleaze_ 5 months ago
  
  3,998.99 for 500gb of RAM on amazon
  "Good Luck" - Kimi <Taken voice>
redox99 5 months ago

Cursor devs, who go out of their way to not mention their Composer model is based on GLM, are not going to like that.
- msp26 5 months ago
  
  Source? I've heard this rumour twice but never seen proof. I assume it would be based on tokeniser quirks?
Imustaskforhelp 5 months ago

Hey have they open sourced all Kimi k2.5 (thinking,instruct,agent,agent swarm [beta])?
Because I feel like they mentioned that agent swarm is available their api and that made me feel as if it wasn't open (weights)*? Please let me know if all are open source or not?
- XenophileJKO 5 months ago
  
  I'm assuming the swarm part is all harness. Well I mean a harness and way of thinking that the weights have just been fine tuned to use.
  - mccoyb 5 months ago
    
    It's not in the harness today, it's a special RL technique they discuss in https://www.kimi.com/blog/kimi-k2-5.html (see "2. Agent Swarm")
    I looked through the harness and all I could find is a `Task` tool.
dheera 5 months ago

> or more than 20 million US dollars (or equivalent in other currencies) in monthly revenue, you shall prominently display "Kimi K2.5" on the user interface of such product or service.
Why not just say "you shall pay us 1 million dollars"?
- vessenes 5 months ago
  
  ? They prefer the branding. The license just says you have to say it was them if you make > $250mm a year on the model.
- viraptor 5 months ago
  
  Companies with $20M revenue will not normally have spare $1M available. They'd get more money by charging reasonable subscriptions than by using lawyers to chase sudden company-ending fees.
  - laurentb 5 months ago
    
    it's monthly :) $240M revenue companies will absolutely find a way to fork $1M if they need to. Kimi most likely sees the eyeballs of free advertising as more profitable in the grander scheme of things
- clayhacks 5 months ago
  
  I assume this allows them to sue for different amounts. And not discourage too many people from using it.

bertili 5 months ago

The "Deepseek moment" is just one year ago today!

Coincidence or not, let's just marvel for a second over this amount of magic/technology that's being given away for free... and how liberating and different this is than OpenAI and others that were closed to "protect us all".

segmondy 5 months ago

There's been so many moments that folks not really heavy into LLM have missed, DeepSeekR1 was great, but so was all the "incremental" improvements, v3-0324, v3.1, v3.1-terminus, and now v3.2-speciale. With that this is the 3rd great Kimi model, then GLM has been awesome, since 4.5, with 4.5, 4.5-air, 4.6, 4.7 and now 4.7 flash. Minimax-M2 has also been making waves lately. ... and i'm just talking about the Chinese model without adding the 10+ Qwen models. Outside of Chinese models, mistral-small/devstral, gemma-27b-it, gpt-oss-120b, seed-os have been great, and I'm still talking about just LLM, not image, audio or special domain models like deepseek-prover and deepseek-math. It's really a marvel what we have at home. I cancelled OpenAI and Anthropic subscription 2 years ago once they started calling for regulation of open models and I haven't missed them one bit.
- igravious 4 months ago
  
  What's your hardware/software setup?
jimmydoe 5 months ago

It’s not coincidence. Chinese companies tend to do big releases before Chinese new year. So expect more to come before Feb 17.
motoboi 5 months ago

What amazes me is why would someone spend millions to train this model and give it away for free. What is the business here?
- whizzter 5 months ago
  
  Chinese state that maybe sees open collaboration as the way to nullify any US lead in the field, concurrently if the next "search-winner" is built upon their model the Chinese worldview that Taiwan belongs to China and Tiamen Square massacre never happened.
  Also their license says that if you have a big product you need to promote them, remember how Google "gave away" site searche widgets and that was perhaps one of the major ways they gained recognition for being the search leader.
  OpenAI/NVidia is the Pets.com/Sun of our generation, insane valuations, stupid spend, expensive options, expensive hardware and so on.
  Sun hardware bought for 50k USD to run websites in 2000 are less capable than perhaps 5 dollar/month VPS's today?
  "Scaling to AGI/ASI" was always a fools errand, best case OpenAI should've squirreled away money to have a solid engineering department that could focus on algorithmic innovations but considering that Antrophic, Google and Chinese firms have caught up or surpassed them it seems they didn't.
  Once things blows up, those closed options that had somewhat sane/solid model research that handles things better will be left and a ton of new competitors running modern/cheaper hardware and just using models are building blocks.
  - zozbot234 5 months ago
    
    > "Scaling to AGI/ASI" was always a fools errand
    Scaling depends on hardware, so cheaper hardware on a compute-per-watt basis only makes scaling easier. There is no clear definition of AGI/ASI but AI has already scaled to be quite useful.
    
    greazy 5 months ago
    
    > There is no clear definition of AGI
    ? Yes there's a clear defintion
    https://en.wikipedia.org/wiki/Artificial_general_intelligenc...
    I think you mean there is no agreed upon definition? Even that is dubious statement.
  - dev_l1x_be 5 months ago
    
    > Taiwan belongs to China
    So they are on the same page as the UN and US?
    The One China policy refers to a United States policy of strategic ambiguity regarding Taiwan.[1] In a 1972 joint communiqué with the PRC, the United States "acknowledges that all Chinese on either side of the Taiwan Strait maintain there is but one China and that Taiwan is a part of China" and "does not challenge that position."
    https://en.wikipedia.org/wiki/One_China https://en.wikipedia.org/wiki/Taiwan_and_the_United_Nations
    
    9cb14c1ec0 5 months ago
    
    The One China policy is a fiction of foreign policy statecraft, designed to sideline the issue without having to actually deal with it. It is quite clear that apart from the official fiction there is a real policy that is not One China. This is made clear by the weapons sales to Taiwan that specifically calibrated to make a Chinese military action harder.
    
    igravious 4 months ago
    
    If it's a fiction https://www.congress.gov/crs-product/IF12503 then it's a very deceitful and and elaborate one.
    
    pqtyw 5 months ago
    
    Existence of an independent and effectively sovereign state on the island of Taiwan (however one calls it) is a fact. Whatever doublespeak governments of other countries or international organizations engage in due to political reasons does not change that.
    
    igravious 4 months ago
    
    No, you are misinformed. The facts are that both the Chinese mainland authority (the CPC) and the government of Chinese Taipei each lay claim to the entirety of the mainland+island landmass. Get that? They both regard themselves and themselves alone as being sovereign over the unified territory.
    That is why Chinese Taipei is not independent and why Beijing has said that if they threaten independence (that is to say, that if Chinese Taipei relinquishes its claim over the mainland -- that it in a way secedes from the union state) that the mainland would move to take full political control over the island.
    It's a highly unusual situation -- there are countries which have governments in exile, this is not that. You can think of this like one country with two separate competing governments which irreconcilable claims. Clearly given the mainland's economic (and consequent military rise) Chinese Taipei cannot take over the mainland and again considering the power differential they would be insane to declare full independence. Why is why this bizarre situation persists.
    Furthermore third-party countries which recognise Chinese Taipei's independence is shrinking and shrinking (and so it should be) because properly speaking the sane thing to do is recognise that there is only one China (because both the the authority of mainland China (the CPC) and the government of Taiwan assert that there is but one China) but deal with each authority separately. Economically speaking and politically speaking this is how things play out.
    Because the situation is so unusual I think it is fair to say that most people just assume Taiwan is a breakaway province. And it's not surprising that they do because that's what happens normally in these situations. That would be the case if Taiwan amended it's constitution to explicitly relinquish its claim over the mainland but they are not going to do that unless they are prodded to by an outside third-party or if they suddenly get a wave of nationalism or if they suddenly become collectively suicidal. So this highly unusual and bizarre status quo persists.
    This is my interaction with ChatGPT (obviously it would be pointless talking to DeepSeek, etc about this)
    ===
    ME: does chinese taipei lay claim to mainland china?
    ChatGPT: Yes — but it depends on *which “Chinese Taipei” you mean*:
    ### \*Chinese Taipei as the Republic of China (Taiwan)\* \* The government based in Taiwan is officially the \*Republic of China (ROC)\*. Under its \*old constitution\* (dating back to when the ROC ruled all of China), it *theoretically* still contains language implying sovereignty over all of China — including the mainland. In practice, modern Taiwanese governments (especially under the Democratic Progressive Party) do \*not actively assert a real claim to govern mainland China\* and focus on governing Taiwan and nearby islands. ([House of Commons Library][1])
    * So while there is *historical constitutional language* about the ROC’s territorial scope that could be interpreted to include mainland China, *Taipei today does not pursue actual governance or enforcement of such a claim* in international relations.
    ### \*People’s Republic of China (PRC) claims Taiwan\* \* The \*PRC\* (mainland China) claims that Taiwan is part of China and should eventually be “reunified,” insisting on a \*One-China principle\* that includes Taiwan under its sovereignty. This is the basis of most countries’ diplomatic relations with Beijing instead of Taipei. ([House of Commons Library][1]) ### Key nuance \* Taiwan (the ROC government) effectively functions as a \*separate political entity with its own government, military, and democratic system\* and does \*not control or claim to govern mainland China’s territory in practice\* even if older constitutional wording could be interpreted that way. ([House of Commons Library][1])
    * Meanwhile, the PRC claims Taiwan as part of its territory, a position Taiwan strongly disputes. ([taipeitimes.com][2])
    [1]: https://commonslibrary.parliament.uk/research-briefings/cdp-... "International status of Taiwan - House of Commons Library"
    [2]: https://www.taipeitimes.com/News/front/archives/2025/03/08/2... "Taiwan rejects China’s UN claim - Taipei Times"
    ME: What does "it theoretically still contains language implying" mean? It either does or it doesn't.
    ChatGPT: You’re right to push back — that phrasing was imprecise. Let’s be exact.
    ### Short answer \*Yes, the Republic of China (ROC) constitution *does* still claim all of China, including the mainland.\*
    That is a matter of constitutional fact, not theory.
    ### What the constitution actually says \* The \*ROC Constitution (1947)\* was written when the ROC claimed to be the sole government of *all* China.
    * It *defines the state’s territory as China*, without listing boundaries or excluding the mainland. * *There has never been a constitutional amendment redefining the ROC’s territory as limited to Taiwan, Penghu, Kinmen, and Matsu.*
    So legally speaking: > \*Mainland China remains part of ROC territory under the constitution.\* ### Where the confusion comes from The confusion arises because of \*later constitutional add-ons\*, not because the claim disappeared. Since the 1990s, Taiwan adopted the \*Additional Articles of the Constitution\*, which: \* Suspend application of many provisions *with respect to the mainland*
    * Define elections, offices, and governance as applying only to the “*free area of the Republic of China*”
    Crucially: \* These articles \*do not renounce the mainland\*
    * They merely acknowledge that the ROC *does not currently govern it*
    So the situation is: | Aspect | Status | | ---------------------------------- | ------------------------------------- | | Constitutional territorial claim | \*All of China (including mainland)\* | | Practical governance | \*Taiwan and nearby islands only\* | | Active effort to reclaim mainland | \*No\* | | Formal constitutional renunciation | \*No\* | ### Why people hedge with language People (especially diplomats and scholars) often say things like *“theoretically implies”* to avoid the blunt conclusion that: \* \*Taiwan’s constitution still claims mainland China\*, which is politically awkward given Taiwan’s current identity and international position. But legally, there’s no ambiguity: \* \*It either does or it doesn’t — and it does.\* ### One-sentence summary \*Chinese Taipei (the ROC) constitutionally claims mainland China, but politically and practically does not pursue or assert that claim.\* If you want, I can also explain \*why Taiwan has never amended the constitution to drop the claim\*, because that’s a fascinating mix of law, identity, and geopolitics.
    ===
    (Sorry for the dreadful formatting). I am more than happy to be corrected on this.
    This dark side of all this is that the original inhabitants of the island are not "Chinese" (Han) and the island has had both Japanese and Chinese rule if I'm not mistaken: https://en.wikipedia.org/wiki/Taiwanese_indigenous_peoples
    
    pqtyw 4 months ago
    
    None of this changes the fact that there is an independent country on the island of Taiwan. It really doesn't matter who you or someone else is calling it, that's just an observable fact.
  - two_tasty 5 months ago
    
    I love how Tiananmen square is always brought up as some unique and tragic example of disinformation that could never occur in the west, as though western governments don't do the exact same thing with our worldview. Your veneer of cynicism scarcely hides the structure of naivety behind.
    
    igneo676 5 months ago
    
    The difference is that, in the west, there's an acceptable counter narrative. I can tell you that Ruby Ridge and Waco never should've happened and were examples of government overreach and massacre of it's own citizens. Or <insert pet issue with the government here>
    You can't with Tiananmen square in China
    
    mannanj 5 months ago
    
    I still see/hear cynicism with a hidden structure of naivety behind.
    
    echion 4 months ago
    
    > Tiananmen square
    The Tiananmen Square killings happened independent of one's "worldview".
    What's an example of "the exact same thing" done to OpenAI models by western governments?
- Balinares 5 months ago
  
  Speculating: there are two connected businesses here, creating the models, and serving the models. Outside of a few moneyed outliers, no one is going to run this at home. So at worst opening this model allows mid-sized competitors to serve it to customers from their own infra -- which helps Kimi gain mindshare, particularly against the large incumbents who are definitely not going to be serving Kimi and so don't benefit from its openness.
  Given the shallowness of moats in the LLM market, optimizing for mindshare would not be the worst move.
- tokioyoyo 5 months ago
  
  Moonshot’s (Kimi’s owner) investors are Alibaba/Tencent et al. Chinese market is stupidly competitive, and there’s a general attitude of “household name will take it all”. However getting there requires having a WeChat-esque user base, through one way or another. If it’s paid, there’ll be friction and it won’t work. Plus, it undermines a lot of other companies, which is a win for a lot of people.
- ggdG 5 months ago
  
  I think this fits into some "Commoditize The Complement" strategy.
  https://gwern.net/complement
- deskamess 5 months ago
  
  I think there is a book (Chip War) about how the USSR did not effectively participate in staying at the edge of the semiconductor revolution. And they have suffered for it.
  China has decided they are going to participate in the LLM/AGI/etc revolution at any cost. So it is a sunk cost, and the models are just an end product and any revenue is validation and great, but not essential. The cheaper price points keep their models used and relevant. It challenges the other (US, EU) models to innovate and keep ahead to justify their higher valuations (both monthly plan, and investor). Once those advances are made, it can be bought back to their own models. In effect, the currently leading models are running from a second place candidate who never gets tired and eventually does what they do at a lower price point.
  - kaibee 5 months ago
    
    In some way, the US won the cold war by spending so much on military that the USSR, in trying to keep up, collapsed. I don't see any parallels between that and China providing infinite free compute to their AI labs, why do you ask?
- culi 5 months ago
  
  All economically transformative technologies have done similar. If it's privatized, it's not gonna be transformative across the industry. The GPS, the internet, touchscreens, AI voice assistants, microchips, LCDs, etc were all publicly funded (or made by Bell Labs which had a state-mandated monopoly that forced them to open up their patents).
  The economist Mariana Mazzucato wrote a great book about this called The Entrepreneurial State: Debunking Public vs. Private Sector Myths
- overfeed 5 months ago
  
  > What amazes me is why would someone spend millions to train this model and give it away for free. What is the business here?
  How many millions did Google spend on Android (acquisition and salaries), only to give it away for free?
  Usually, companies do this to break into a monopolized market (or one that's at risk of becoming one), with openness as a sweetener. IBM with Linux to break UNIX-on-big-iron domination, Google with Android vs. iPhone, Sun with OpenSolaris vs. Linux-on-x86.
- YetAnotherNick 5 months ago
  
  Hosting the model is cheaper per token, the more batched token you get. So they have big advantage here.
- testfrequency 5 months ago
  
  Curious to hear what “OpenAI” thinks the answer to this is
- WarmWash 5 months ago
  
  It's another state project funded at the discretion of the party.
  If you look at past state projects, profitability wasn't really considered much. They are notorious for a "Money hose until a diamond is found in the mountains of waste"
PlatoIsADisease 5 months ago

I am convinced that was mostly just marketing. No one uses deepseek as far as I can tell. People are not running it locally. People choose GPT/Gemini/Claude/Grok if you are giving your data away anyway.
My biggest source of my conspiracy is that I made a reddit thread asking a question: "Why all the deepseek hype" or something like that. And to this day, I get odd, 'pro deepseek' comments from accounts only used every few months. Its not like this was some highly upvoted topic that is in the 'Top'.
I'd put that deepseek marketing on-par with an Apple marketing campaign.
- logicprog 5 months ago
  
  I don't use DeepSeek, but I prefer Kimi and GLM to closed models for most of my work.
- mekpro 5 months ago
  
  Except that, In OpenRouter, Deepseek always maintain in Top 10 Ranking. Although I did not use it personally, i believe that their main advantage over other model is price/performance.
  - culi 5 months ago
    
    Fifth in market share in fact!
    https://openrouter.ai/rankings
    There are a lot of applications where you really just want a cheap and efficient model that's still somewhat competitive and that's exactly the niche DeepSeek fulfills the best.
catigula 5 months ago

I mean, there are credible safety issues here. A Kimi fine-tune will absolutely be able to help people do cybersecurity related attacks - very good ones.
In a few years, or less, biological attacks and other sorts of attacks will be plausible with the help of these agents.
Chinese companies aren't humanitarian endeavors.

jumploops 5 months ago

> For complex tasks, Kimi K2.5 can self-direct an agent swarm with up to 100 sub-agents, executing parallel workflows across up to 1,500 tool calls.

> K2.5 Agent Swarm improves performance on complex tasks through parallel, specialized execution [..] leads to an 80% reduction in end-to-end runtime

Not just RL on tool calling, but RL on agent orchestration, neat!

storystarling 5 months ago

1,500 tool calls per task sounds like a nightmare for unit economics though. I've been optimizing my own agent workflows and even a few dozen steps makes it hard to keep margins positive, so I'm not sure how this is viable for anyone not burning VC cash.
- zozbot234 5 months ago
  
  "tool call" is just a reference to any elementary interaction with the outside system. It's not calling third-party APIs or anything like that.
  - storystarling 5 months ago
    
    True, but that's still 1,500 inference cycles. Even without external API fees, the latency and compute burden seems huge. I don't see how the economics work there without significant subsidies.
    
    darrinm 5 months ago
    
    FWIW many tool calls can be and often are made in one inference cycle.
XCSme 5 months ago

> Kimi K2.5 can self-direct an agent swarm
Is this within the model? Or within the IDE/service that runs the model?
Because tool calling is mostly just the agent outputting "call tool X", and the IDE does it and returns the data back to AI's context
- mzl 5 months ago
  
  An LLM model only outputs tokens, so this could be seen as an extension of tool calling where it has trained on the knowledge and use-cases for "tool-calling" itself as a sub-agent.
  - XCSme 5 months ago
    
    Ok, so agent swarm = tool calling where the tool is a LLM call and the argument is the prompt
    
    IanCal 5 months ago
    
    Yes largely, although they’ve trained a model specifically for this task rather than using the base model and a bit of prompting.
    
    dcre 5 months ago
    
    Sort of. It’s not necessarily a single call. In the general case it would be spinning up a long-running agent with various kinds of configuration — prompts, but also coding environment and which tools are available to it — like subagents in Claude Code.
mohsen1 5 months ago

Parallel agents are such a simple, yet powerful hack. Using it in Claude Code with TeammateTool and getting lots of good results!
- esperent 5 months ago
  
  > TeammateTool
  What is this?
  - frimmy 5 months ago
    
    https://x.com/kieranklaassen/status/2014830266515382693 - agent swarms tool shipping w/ cc soon..
  - jlu 5 months ago
    
    claude code hidden feaure currently under a feature flag:
    https://github.com/mikekelly/claude-sneakpeek

culi 5 months ago

I posted this elsewhere but thought I'd repost here:

* https://lmarena.ai/leaderboard — crowd-sourced head-to-head battles between models using ELO

* https://dashboard.safe.ai/ — CAIS' incredible dashboard

* https://clocks.brianmoore.com/ — a visual comparison of how well models can draw a clock. A new clock is drawn every minute

* https://eqbench.com/ — emotional intelligence benchmarks for LLMs

* https://www.ocrarena.ai/battle — OCR battles, ELO

* https://mafia-arena.com/ — LLMs playing the social deduction game Mafia

* https://openrouter.ai/rankings — marketshare based on OpenRouter

vinhnx 5 months ago

One thing caught my eyes is that besides K2.5 model, Moonshot AI also launched Kimi Code (https://www.kimi.com/code), evolved from Kimi CLI. It is a terminal coding agent, I've been used it last month with Kimi subscription, it is capable agent with stable harness.

GitHub: https://github.com/MoonshotAI/kimi-cli

forgotpwd16 5 months ago

>Kimi Code CLI is not only a coding agent, but also a shell.
That's cool. It also has a zsh hook, allowing you to switch to agent mode wherever you're.
- vinhnx 5 months ago
  
  It is, Kimi Code CLI supports Zed' Agent Client Protocol (http://agentclientprotocol.com/), so it can acts as an external agent that could run in any ACP-compatible client, eg: Zed, Jetbrain, Toad CLI, Minano Notebook. Also, it supports Agent Skills. Moonshot AI developers are actively update the agent and every active. I really like their CLI.
esafak 5 months ago

Does it support the swarm feature? Does Opencode?
- canadiantim 4 months ago
  
  https://github.com/code-yeongyu/oh-my-opencode
Imanari 5 months ago

How does it fare against CC?
- vuldin 5 months ago
  
  Anecdotally, I've cancelled my Claude Code subscription after using Kimi K2.5 and Kimi CLI for the last few days. It's handled everything I've thrown at it. It is slower at the moment, but I expect that will improve.

Alifatisk 5 months ago

Have you all noted that the latest releases (Qwen3 max thinking, now Kimi k2.5) from Chinese companies are benching against Claude opus now and not Sonnet? They are truly catching up, almost at the same pace?

conception 5 months ago

https://clocks.brianmoore.com
K2 is one of the only models to nail the clock face test as well. It’s a great model.
- culi 5 months ago
  
  Kimi 2 is remarkably consistently the best. I wonder if it's somehow been trained specifically on tasks like these. It seems too consistent to be coincidence
  Also shocking is how the most common runner up I've seen is DeepSeek
- michaelcampbell 4 months ago
  
  It's better than most, but not 100%. As I see this the clock hands are all correct, but the numbers only go 1-8.
- DJBunnies 5 months ago
  
  Cool comparison, but none of them get both the face and the time correct when I look at it.
  - conception 5 months ago
    
    Refresh. It’s not every time but k2 hits a perfect clock for me about 7/10 or so.
WarmWash 5 months ago

They distill the major western models, so anytime a new SOTA model drops, you can expect the Chinese labs to update their models within a few months.
- zozbot234 5 months ago
  
  This is just a conspiracy theory/urban legend. How do you "distill" a proprietary model with no access to the original weights? Just doing the equivalent of training on chat/API logs has terrible effectiveness (you're trying to drink from a giant firehose through a tiny straw) and gives you no underlying improvements.
- Alifatisk 5 months ago
  
  Yes, they do distill. But just saying all they do is distill is not correct and actually kind of unfair. These Chinese labs have done lots of research in this field and publish it to the public, some of not majority contribute with open-weight models making a future of local llm possible! Deepseek, Moonshot, Minimax, Z.a, Alibabai (Qwen).
  They are not just leeching here, they took this innovation, refined it and improved it further. This is what the Chinese is good at.
- Balinares 5 months ago
  
  Source?
esafak 5 months ago

They are, in benchmarks. In practice Anthropic's models are ahead of where their benchmarks suggest.
- HNisCIS 5 months ago
  
  Bear in mind that lead may be, in large part, from the tooling rather than the model
zozbot234 5 months ago

The benching is sus, it's way more important to look at real usage scenarios.

Reubend 5 months ago

I've read several people say that Kimi K2 has a better "emotional intelligence" than other models. I'll be interested to see whether K2.5 continues or even improves on that.

flexagoon 5 months ago

I love the Kimi response style. It's much more concise, without all the unnecessary "great question!"s and other annoying AI stuff
Alifatisk 5 months ago

Yup, I experience the same. I don't know what they do to achieve this but it gives them this edge, really curious to learn more about what makes it so good at it.
- in-silico 5 months ago
  
  A lot of people point to the Muon optimizer that Moonshot (the creators of Kimi) pioneered. Compared to the standard optimizer AdamW, Muon amplifies low-magnitude gradient directions which makes the model learn faster (and maybe gives Kimi its unique qualities).
  Muon paper: https://arxiv.org/abs/2502.16982
  - Alifatisk 5 months ago
    
    Wow! Thank you
storystarling 5 months ago

yes, though this is highly subjective - it 'feels' like that to me as well (comapred to Gemini 3, GPT 5.2, Opus 4.5).
mohsen1 5 months ago

I'll test it out on mafia-arena.com once it is available on Open Router

2001zhaozhao 5 months ago

The directionally interesting part is that according to the announcement, K2.5 seems to be trained specifically to create sub-agents and work in an agent swarm usefully. The key part is that you don't need to manually create or prompt sub-agents, K2.5 creates them automatically, so from the looks of things it's similar to Claude Code dynamic sub-agents except the model is trained to scale to many more agents autonomously.

I wonder whether Claude is doing the same kind of training and it's coming with the next model, and that's why the agent swarm mode in Claude Code is hidden for now. We might be getting very very good agent orchestrators/swarms very soon.

zmmmmm 5 months ago

Curious what would be the most minimal reasonable hardware one would need to deploy this locally?

NitpickLawyer 5 months ago

I parsed "reasonable" as in having reasonable speed to actually use this as intended (in agentic setups). In that case, it's a minimum of 70-100k for hardware (8x 6000 PRO + all the other pieces to make it work). The model comes with native INT4 quant, so ~600GB for the weights alone. An 8x 96GB setup would give you ~160GB for kv caching.
You can of course "run" this on cheaper hardware, but the speeds will not be suitable for actual use (i.e. minutes for a simple prompt, tens of minutes for high context sessions per turn).
simonw 5 months ago

Models of this size can usually be run using MLX on a pair of 512GB Mac Studio M3 Ultras, which are about $10,000 each so $20,000 for the pair.
- PlatoIsADisease 5 months ago
  
  You might want to clarify that this is more of a "Look it technically works"
  Not a "I actually use this"
  The difference between waiting 20 minutes to answer the prompt '1+1='
  and actually using it for something useful is massive here. I wonder where this idea of running AI on CPU comes from. Was it Apple astroturfing? Was it Apple fanboys? I don't see people wasting time on non-Apple CPUs. (Although, I did do this for a 7B model)
  - mholm 5 months ago
    
    The reason Macs get recommended is the unified memory, which is usable as VRAM for the GPU. People are similarly using the AMD Strix Halo for AI which also has a similar memory architecture. Time to first token for something like '1+1=' would be seconds, and then you'd be getting ~20 tokens per second, which is absolutely plenty fast for regular use. Token/s slows down at the higher end of context, but it's absolutely still practical for a lot of usecases. Though I agree that agentic coding, especially over large projects, would likely get too slow to be practical.
    
    PlatoIsADisease 5 months ago
    
    We are getting into a debate between particulars and universals. To call the 'unified memory' VRAM is quite a generalization. Whatever the case, we can tell from stock prices that whatever this VRAM is, its nothing compared to NVIDIA.
    Anyway, we were trying to run a 70B model on a macbook(can't remember which M model) at a fortune 20 company, it never became practical. We were trying to compare strings of character length ~200. It was like 400-ish characters plus a pre-prompt.
    I can't imagine this being reasonable on a 1T model, let alone the 400B models of deepseek and LLAMA.
    
    Gracana 5 months ago
    
    With 32B active parameters, Kimi K2.5 will run faster than your 70B model.
    
    simonw 5 months ago
    
    Here's a video of a previous 1T K2 model running using MLX on a a pair of Mac Studios: https://twitter.com/awnihannun/status/1943723599971443134 - performance isn't terrible.
    
    PlatoIsADisease 5 months ago
    
    Is there a catch? I was not getting anything like this on a 70B model.
    EDIT: oh its a marketing account and the program never finished... who knows the validity.
    
    simonw 5 months ago
    
    I don't think Awni should be dismissed as a "marketing account" - they're an engineer at Apple who's been driving the MLX project for a couple of years now, they've earned a lot of respect from me.
    
    PlatoIsADisease 5 months ago
    
    Given how secretive Apple is, oh my, its super duper marketing account.
    
    mholm 5 months ago
    
    Jeff Geerling and a few others also got access to similarly specced mac clusters. They replicated this performance.
    The tooling involved has improved significantly over the past year.
    
    zozbot234 5 months ago
    
    Not too slow if you just let it run overnight/in the background. But the biggest draw would be no rate limits whatsoever compared to the big proprietary APIs, especially Claude's. No risk of sudden rugpulls either, and the model will have very consistent performance.
  - simonw 5 months ago
    
    MLX uses the GPU.
    That said, I wouldn't necessarily recommend spending $20,000 on a pair of Mac Studios to run models like this. The performance won't be nearly as good as the server-class GPU hardware that hosted models run on.
  - tucnak 5 months ago
    
    Mac studio way is not "AI on CPU," as M2/M4 are complex SoC, that includes a GPU with unified memory access.
    
    PlatoIsADisease 5 months ago
    
    If it worked IRL for anything useful, I'd be more interested in the technical differences. But it was a mere toy for a few tests at my fortune 20 company.
    Language is full of issues of particulars vs universals, and you could debate if its just an integrated GPU with different marketing.
    Whatever the case, we couldn't use it in production, and NVIDIAs stock price reflects the reality on the ground.
    
    tucnak 5 months ago
    
    Well, I've been using a fine-tuned variant of Gemma 3 model since it came out, and some embedding models, on a laptop. It's not "useless" by any means, in fact it still beats the latest Claude for my use-case in Ukrainian. Not to mention that if you travel by train a lot, you will find it quite useful. I own a Mac studio M2 Max (96 GB) variant at home, and I'm routinely using the larger models for the kind of stuff I don't wish to share with model providers.
    My 2 cents
tosh 5 months ago

I think you can put a bunch of apple silicon macs with enough ram together
e.g. in an office or coworking space
800-1000 gb ram perhaps?

Topfi 5 months ago

K2 0905 and K2 Thinking shortly after that have done impressively well in my personal use cases and was severely slept on. Faster, more accurate, less expensive, more flexible in terms of hosting and available months before Gemini 3 Flash, I really struggle to understand why Flash got such positive attention at launch.

Interested in the dedicated Agent and Agent Swarm releases, especially in how that could affect third party hosting of the models.

msp26 5 months ago

K2 thinking didn't have vision which was a big drawback for my projects.

throwaw12 5 months ago

Congratulations, great work Kimi team.

Why is that Claude still at the top in coding, are they heavily focused on training for coding or is it their general training is so good that it performs well in coding?

Someone please beat the Opus 4.5 in coding, I want to replace it.

pokot0 5 months ago

I don't think that kind of difference in benchmarks has any meaning at all. Your agentic coding tool and the task you are working on introduce a lot more "noise" than that small delta.
Also consider they are all overfitting on the benchmark itself so there might be that as well (which can go in either directions)
I consider the top models practically identical for coding applications (just personal experience with heavy use of both GPT5.2 and Opus 4.5).
Excited to see how this model compares in real applications. It's 1/5th of the price of top models!!
Balinares 5 months ago

I replaced Opus with Gemini Pro and it's just plain a better coder IMO. It'll restructure code to enable support for new requirements where Opus seems to just pile on more indirection layers by default, when it doesn't outright hardcode special cases inside existing functions, or drop the cases it's failing to support from the requirements while smugly informing you you don't need that anyway.
MattRix 5 months ago

Opus 4.5 only came out two months ago, and yes Anthropic spends a lot of effort making it particularly good at coding.
symisc_devel 5 months ago

Gemini 3 pro is way better than Opus especially for large codebases.
- rubslopes 5 months ago
  
  Do you use it only for code editing, or also for running bash commands? My experience is that it is very bad at the latter.
- redox99 5 months ago
  
  My experience is the total opposite.

spaceman_2020 5 months ago

Kimi was already one of the best writing models. Excited to try this one out

Alifatisk 5 months ago

To me, Kimi has been the best with writing and conversing, its way more human like!

simonw 5 months ago

Pretty cute pelican https://tools.simonwillison.net/svg-render#%3Csvg%20viewBox%...

simonw 5 months ago

Oops, here's a working link: https://gist.github.com/simonw/32a85e337fbc6ee935d10d89726c0...
mythz 5 months ago

doesn't work, looks like the link or SVG was cropped.
bavell 5 months ago

No pelican for me :(

hmate9 5 months ago

About 600GB needed for weights alone, so on AWS you need an p5.48xlarge (8× H100) which costs $55/hour.

Barathkanna 5 months ago

A realistic setup for this would be a 16× H100 80GB with NVLink. That comfortably handles the active 32B experts plus KV cache without extreme quantization. Cost-wise we are looking at roughly $500k–$700k upfront or $40–60/hr on-demand, which makes it clear this model is aimed at serious infra teams, not casual single-GPU deployments. I’m curious how API providers will price tokens on top of that hardware reality.

wongarsu 5 months ago

The weights are int4, so you'd only need 8xH100
a2128 5 months ago

You don't need to wait and see, Kimi K2 has the same hardware requirements and has several providers on OpenRouter:
https://openrouter.ai/moonshotai/kimi-k2-thinking https://openrouter.ai/moonshotai/kimi-k2-0905 https://openrouter.ai/moonshotai/kimi-k2-0905:exacto https://openrouter.ai/moonshotai/kimi-k2
Generally it seems to be in the neighborhood of $0.50/1M for input and $2.50/1M for output
reissbaker 5 months ago

Generally speaking, 8xH200s will be a lot cheaper than 16xH100s, and faster too. But both should technically work.
- pama 5 months ago
  
  You can do it and may be ok for single user with idle waiting times, but performance/throughput will be roughly halved (closer to 2/3) and free context will be more limited with 8xH200 vs 16xH100 (assuming decent interconnect). Depending a bit on usecase and workload 16xH100 (or 16xB200) may be a better config for cost optimization. Often there is a huge economy of scale with such large mixture of expert models so that it would even be cheaper to use 96 GPU instead of just 8 or 16. The reasons are complicatet and involve better prefill cache, less memory transfer per node.
bertili 5 months ago

The other realistic setup is $20k, for a small company that needs a private AI for coding or other internal agentic use with two Mac Studios connected over thunderbolt 5 RMDA.
- Barathkanna 5 months ago
  
  That won’t realistically work for this model. Even with only ~32B active params, a 1T-scale MoE still needs the full expert set available for fast routing, which means hundreds of GB to TBs of weights resident. Mac Studios don’t share unified memory across machines, Thunderbolt isn’t remotely comparable to NVLink for expert exchange, and bandwidth becomes the bottleneck immediately. You could maybe load fragments experimentally, but inference would be impractically slow and brittle. It’s a very different class of workload than private coding models.
  - bertili 5 months ago
    
    People are running the previous Kimi K2 on 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s. Its still premature, but not a completely crazy proposition for the near future, giving the rate of progress.
    
    NitpickLawyer 5 months ago
    
    > 2 Mac Studios at 21tokens/s or 4 Macs at 30tokens/s
    Keep in mind that most people posting speed benchmarks try them with basically 0 context. Those speeds will not hold at 32/64/128k context length.
  - zozbot234 5 months ago
    
    If "fast" routing is per-token, the experts can just reside on SSD's. the performance is good enough these days. You don't need to globally share unified memory across the nodes, you'd just run distributed inference.
    Anyway, in the future your local model setups will just be downloading experts on the fly from experts-exchange. That site will become as important to AI as downloadmoreram.com.
  - YetAnotherNick 5 months ago
    
    Depends on if you are using tensor parallelism or pipeline parallelism, in the second case you don't need any sharing.
  - omneity 5 months ago
    
    RDMA over Thunderbolt is a thing now.
- embedding-shape 5 months ago
  
  I'd love to see the prompt processing speed difference between 16× H100 and 2× Mac Studio.
  - zozbot234 5 months ago
    
    Prompt processing/prefill can even get some speedup from local NPU use most likely: when you're ultimately limited by thermal/power limit throttling, having more efficient compute available means more headroom.
  - Barathkanna 5 months ago
    
    I asked GPT for a rough estimate to benchmark prompt prefill on an 8,192 token input. • 16× H100: 8,192 / (20k to 80k tokens/sec) ≈ 0.10 to 0.41s • 2× Mac Studio (M3 Max): 8,192 / (150 to 700 tokens/sec) ≈ 12 to 55s
    These are order-of-magnitude numbers, but the takeaway is that multi H100 boxes are plausibly ~100× faster than workstation Macs for this class of model, especially for long-context prefill.
    
    ffsm8 5 months ago
    
    You do realize that's entirely made up, right?
    Could be true, could be fake - the only thing we can be sure of is that it's made up with no basis in reality.
    This is not how you use llms effectively, that's how you give everyone that's using them a bad name from association
- zozbot234 5 months ago
  
  That's great for affordable local use but it'll be slow: even with the proper multi-node inference setup, the thunderbolt link will be a comparative bottleneck.

Jackson__ 5 months ago

As your local vision nut, their claims about "SOTA" vision are absolutely BS in my tests.

Sure it's SOTA at standard vision benchmarks. But on tasks that require proper image understanding, see for example BabyVision[0] it appears very much lacking compared to Gemini 3 Pro.

[0] https://arxiv.org/html/2601.06521v1

nostrebored 5 months ago

Gemini remains the only usable vision fm :(

striking 5 months ago

https://archive.is/P98JR

pu_pe 5 months ago

I don't get this "agent swarm" concept. You set up a task and they boot up 100 LLMs to try to do it in parallel, and then one "LLM judge" puts it all together? Is there anywhere I can read more about it?

vessenes 5 months ago

You can read about this basically everywhere - the term of art is agent orchestration. Gas town, Claude’s secret swarm mode, or people who like to use phrases like “Wiggum loop” will get you there.
If you’re really lazy - the quick summary is that you can benefit from the sweet spot of context length and reduce instruction overload while getting some parallelism benefits from farming tasks out to LLMs with different instructions. The way this is generally implemented today is through tool calling, although Claude also has a skills interface it has been trained against.
So the idea would be for software development, why not have a project/product manager spin out tasks to a bunch of agents that are primed to be good at different things? E.g. an architect, a designer, and so on. Then you just need something that can rectify GitHub PRs and bob’s your uncle.
Gas town takes a different approach and parallelizes on coding tasks of any sort at the base layer, and uses the orchestration infrastructure to keep those coders working constantly, optimizing for minimal human input.
- IanCal 5 months ago
  
  I'm not sure whether there are parts of this done for claude but those other ones are layers on top of the usual LLMs we see. This seems to be a bit different, in that there's a different model trained specifically for splitting up and managing the workload.
Rebuff5007 5 months ago

I've also been quite skeptical, and I became even more skeptical after hearing a tech talk from a startup in this space [1].
I think the best way to think about it is that its an engineering hack to deal with a shortcoming of LLMs: for complex queries LLMs are unable to directly compute a SOLUTION given a PROMPT, but are instead able to break down the prompt to intermediate solutions and eventually solve the original prompt. These "orchestrator" / "swarm" agents add some formalism to this and allow you to distribute compute, and then also use specialized models for some of the sub problems.
[1] https://www.deepflow.com/
rvnx 5 months ago

You have a team lead that establishes a list of tasks that are needed to achieve your mission
then it creates a list of employees, each of them is specialized for a task, and they work in parallel.
Essentially hiring a team of people who get specialized on one problem.
Do one thing and do it well.
- XCSme 5 months ago
  
  But in the end, isn't this the same idea with the MoE?
  Where we have more specialized "jobs", which the model is actually trained for.
  I think the main difference with agents swarm is the ability to run them in parallel. I don't see how this adds much compared to simply sending multiple API calls in parallel with your desired tasks. I guess the only difference is that you let the AI decide how to split those requests and what each task should be.
  - zozbot234 5 months ago
    
    Nope. MoE is strictly about model parameter sparsity. Agents are about running multiple small-scale tasks in parallel and aggregating the results for further processing - it saves a lot of context length compared to having it all in a single session, and context length has quadratic compute overhead so this matters. You can have both.
    One positive side effect of this is that if subagent tasks can be dispatched to cheaper and more efficient edge-inference hardware that can be deployed at scale (think nVidia Jetsons or even Apple Macs or AMD APU's) even though it might be highly limited in what can fit on the single node, then complex coding tasks ultimately become a lot cheaper per token than generic chat.
    
    XCSme 5 months ago
    
    Yes, I know you can have both.
    My point was that this is just a different way of creating specialised task solvers, the same as with MoE.
    And, as you said, with MoE it's about the model itself, and it's done at training level so that's not something we can easily do ourselves.
    But with agent swarm, isn't it simply splitting a task in multiple sub-tasks and sending each one in a different API call? So this can be done with any of the previous models too, only that the user has to manually define those tasks/contexts for each query.
    Or is this at a much more granular level than this, which would not be feasible to be done by hand?
    I was already doing this in n8n, creating different agents with different system prompts for different tasks. I am not sure if automating this (with swarm) would work well in my most cases, I don't see how this fully complements Tools or Skills
    
    zozbot234 5 months ago
    
    MoE has nothing whatsoever to do with specialized task solvers. It always operates per token within a single task, you can think of it perhaps as a kind of learned "attention" for model parameters as opposed to context data.
    
    XCSme 5 months ago
    
    Yes, specific weights/parameters have be trained to solve specific tasks (trained on different data).
    Or did I misunderstand the concept of MoE, and it's not about having specific parts of the model (parameters) do better on specific input contexts?
jonkoops 5 months ago

The datacenters yearn for the chips.

erichocean 5 months ago

Running on Apple Silicon: https://x.com/awnihannun/status/2016221496084205965

enricoros 5 months ago

CCP-bench has gotten WAY better on K2.5!

https://big-agi.com/static/kimi-k2.5-less-censored.jpg

teiferer 5 months ago

Can we please stop calling those models "open source"? Yes the weights are open. So, "open weight" maybe. But the source isn't open, the thing that allows to re-create it. That's what "open source" used to mean. (Together with a license that allows you to use that source for various things.)

Onavo 5 months ago

No major AI lab will admit to training on proprietary or copyrighted data so what you are asking is an impossibility. You can make a pretty good LLM if you train on Anna's Archive but it will either be released anonymously, or with a research only non commercial license.
There aren't enough public domain data to create good LLMs, especially once you get into the newer benchmarks that expect PhD level of domain expertise in various niche verticals.
It's also a logical impossibility to create a zero knowledge proof that will allow you to attribute to specific training data without admitting to usage.
I can think of a few technical options but none would hold water legally.
You can use a Σ-protocol OR-composition to prove that it was trained either on a copyrighted dataset or a non copyrighted dataset without admitting to which one (technically interesting, legally unsound).
You can prove that a model trained on copywrited data is statistically indistinguishable from one trained on non-copywrited data (an information theoretic impossibility unless there exist as much public domain data as copywrited data, in similar distributions).
You can prove a public domain and copywrited dataset are equivalent if the model performance produced is indistinguishable from each other.
All the proofs fail irl, ignoring the legal implications, because there's less public domain information, so given the lemma that more training data == improved model performance, all the above are close to impossible.

DeathArrow 5 months ago

Those are some impressive benchmark results. I wonder how well it does in real life.

Maybe we can get away with something cheaper than Claude for coding.

oneneptune 5 months ago

I'm curious about the "cheaper" claim -- I checked Kimi pricing, and it's a $200/mo subscription too?
- NitpickLawyer 5 months ago
  
  On openrouter 2.5 is at 0.60/3$ per Mtok. That's haiku pricing.
  - storystarling 5 months ago
    
    The unit economics seem tough at that price for a 1T parameter model. Even with MoE sparsity you are still VRAM bound just keeping the weights resident, which is a much higher baseline cost than serving a smaller model like Haiku.
- mrklol 5 months ago
  
  They also have a $20 and $40 tier.
  - esafak 5 months ago
    
    https://www.kimi.com/code
  - Alifatisk 5 months ago
    
    If you bargain with their bot Kimmmmy (not joking), you can even get lower pricing.
    
    mohsen1 5 months ago
    
    tell me more...
    
    Alifatisk 5 months ago
    
    Go to kimi chat, there will come up multiple suggestions of use cases. One of them will be the bargain robot. If you download their mobile app, the challenge to bargain will probably popup too!
    Depending on how well you bargain with the robot, you can go as low as 0,99$ (difficult). Either way, their moderate plan doesn’t have to be 20$. The agent wants a good reason for why it should lower the price for you.
    Here’s the direct link to Kimmmmy:
    https://www.kimi.com/kimiplus/sale
    I’ll send an invite link too if you don’t mind:
    https://www.kimi.com/kimiplus/sale?activity_enter_method=h5_...
    
    mohsen1 5 months ago
    
    omg this is so funny!

stopachka 5 months ago

Is there a startup that takes models like this, and effectively gives you a secure setup, where you have (a) a mobile app that (b) talks to some giant machine that only you have access too.

If a 10K computer could run this, it may be worth it to have a "fully on prem" version of ChatGPT running for you.

dev_l1x_be 5 months ago

I had these weird situations like some models are refusing to use SSH as a tool. Not sure if it was the coding tool limitation or it is baked into in some of the models.

monkeydust 5 months ago

Is this actually good or just optimized heavily for benchmarks? I am hopefully its the former based on the writeup but need to put it through its paces.

kurtis_reed 5 months ago

Quite good in my testing

jdeng 5 months ago

Glad to to see open source models are catching up and treat vision as first-class citizen (a.k.a native multimodal agentic model). GLM and Qwen models takes different approach, by having a base model and a vision variant (glm-4.6 vs glm-4.6v).

I guess after Kimi K2.5, other vendors are going to the same route?

Can't wait to see how this model performs on computer automation use cases like VITA AI Coworker.

https://www.vita-ai.net/

pplonski86 5 months ago

There are so many models, is there any website with list of all of them and comparison of performance on different tasks?

Reubend 5 months ago

The post actually has great benchmark tables inside of it. They might be outdated in a few months, but for now, it gives you a great summary. Seems like Gemini wins on image and video perf, Claude is the best at coding, ChatGPT is the best for general knowledge.
But ultimately, you need to try them yourself on the tasks you care about and just see. My personal experience is that right now, Gemini Pro performs the best at everything I throw at it. I think it's superior to Claude and all of the OSS models by a small margin, even for things like coding.
- Imustaskforhelp 5 months ago
  
  I like Gemini Pro's UI over Claude so much but honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.
  - wobfan 5 months ago
    
    > honestly I might start using Kimi K2.5 if its open source & just +/- Gemini Pro/Chatgpt/Claude because at that point I feel like the results are negligible and we are getting SOTA open source models again.
    Me too!
    > I like Gemini Pro's UI over Claude so much
    This I don't understand. I mean, I don't see a lot of difference in both UIs. Quite the opposite, apart from some animations, round corners and color gradings, they seem to look very alike, no?
    
    Imustaskforhelp 5 months ago
    
    Y'know I ended up buying Kimi's moderato plan which is 19$ but they had this unique idea where you can talk to a bot and they could reduce the price
    I made it reduce the price of first month to 1.49$ (It could go to 0.99$ and my frugal mind wanted it haha but I just couldn't have it do that lol)
    Anyways, afterwards for privacy purposes/( I am a minor so don't have a card), ended up going to g2a to get a 10$ Visa gift card essentially and used it. (I had to pay a 1$ extra but sure)
    Installed kimi code on my mac and trying it out. Honestly, I am kind of liking it.
    My internal benchmark is creating pomodoro apps in golang web... Gemini 3 pro has nailed it, I just tried the kimi version and it does have some bugs but it feels like it added more features.
    Gonna have to try it out for a month.
    I mean I just wish it was this cheap for the whole year :< (As I could then move from, say using the completely free models)
    Gonna have to try it out more!
coffeeri 5 months ago

There is https://artificialanalysis.ai
- XCSme 5 months ago
  
  There are many lists, but I find all of them outdated or containing wrong information or missing the actual benchmarks I'm looking for.
  I was thinking, that maybe it's better to make my own benchmarks with the questions/things I'm interested in, and whenever a new model comes out run those tests with that model using open-router.
- pplonski86 5 months ago
  
  Thank you! Exactly what I was looking for

mangolie 5 months ago

they cooked

lrvick 5 months ago

Actually open source, or yet another public model, which is the equivalent of a binary?

URL is down so cannot tell.

Tepix 5 months ago

It's open weights, not open source.
typ 5 months ago

The label 'open source' has become a reputation reaping and marketing vehicle rather than an informative term since the Hugging Face benchmark race started. With the weights only, we cannot actually audit that if a model is a) contaminated by benchmarks, b) built with deliberate biases, or c) trained on copyrighted/privacy data, let alone allowing other vendors to replicate the results. Anyways, people still love free stuff.
- Der_Einzige 5 months ago
  
  Just accept that IP laws don't matter and the old "free software" paradigm is dead. Aaron Swartz died so that GenAI may live. RMS and his model of "copyleft" are so Web 1.0 (not even 2.0). No one in GenAI cares AT ALL about the true definition of open source. Good.
  - duskdozer 5 months ago
    
    Good?

billyellow 5 months ago

Cool

rvz 5 months ago

The chefs at Moonshot have cooked once again.

Settings

Kimi Released Kimi K2.5, Open-Source Visual SOTA-Agentic Model

Keyboard Shortcuts