I was able to convert the real 40B model with my change here to reduce memory during HF conversion (only loads a single part into RAM at a time): jploski#1
It required some work to get inference to actually run. I had to increase ctx_size:
ctx_size += ((size_t)3) * 1024 * 1024 * 1024;
Also, uhh... GGML_MAX_NODES at 4096 didn't quite cut it. Nor did 65535, I eventually just set it to 262144 and was able to run the model. Unfortunately, the output didn't make much sense:
main: seed = 1686733539
falcon_model_load: loading model from '/home/nope/personal/ai/models/falc40b.ggml' - please wait ...
falcon_model_load: n_vocab = 65024
falcon_model_load: n_embd = 8192
falcon_model_load: n_head = 128
falcon_model_load: n_head_kv = 8
falcon_model_load: n_layer = 60
falcon_model_load: ftype = 2008
falcon_model_load: qntvr = 2
falcon_model_load: ggml ctx size = 28175.96 MB
falcon_model_load: memory_size = 480.00 MB, n_mem = 122880
falcon_model_load: ............................................................ done
falcon_model_load: model size = 27436.06 MB / num tensors = 484
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 10
main: token[0] = 7107, Once
main: token[1] = 2918, upon
main: token[2] = 241, a
main: token[3] = 601, time
main: token[4] = 23, ,
main: token[5] = 629, there
main: token[6] = 398, was
main: token[7] = 241, a
main: token[8] = 1278, little
main: token[9] = 27224, fox
Once upon a time, there was a little fox and, I’re
' to’ ' .
it that,. is
,, of . for.' '- you,. we the- the
1 of a
. the
Although it didn't work, even with the crazy number of nodes it wasn't really that slow. It was about the same as a 65B Q4_K_M LLaMA model with llama.cpp.
The mini-Shakespeare model seems fine:
main: seed = 1686733831
falcon_model_load: loading model from '/home/nope/personal/ai/models/falcsp.ggml' - please wait ...
falcon_model_load: n_vocab = 65024
falcon_model_load: n_embd = 256
falcon_model_load: n_head = 4
falcon_model_load: n_head_kv = 2
falcon_model_load: n_layer = 4
falcon_model_load: ftype = 2009
falcon_model_load: qntvr = 2
falcon_model_load: ggml ctx size = 3105.91 MB
falcon_model_load: memory_size = 8.00 MB, n_mem = 8192
falcon_model_load: .... done
falcon_model_load: model size = 25.89 MB / num tensors = 36
extract_tests_from_file : No test file found.
test_gpt_tokenizer : 0 tests failed out of 0 tests.
main: number of tokens in prompt = 1
main: token[0] = 4031, Now
Now, Clarence, my lord, I am a
the great men: I will to do this day you
In time they may live in men of tears are
Shall be not what we have fought in. What is this
and come to you? I have not made mine eyes,
Which now sent for, or I am so fast?
Your friends shall be revenged on thee, hoar!
And that you must, sirs, that you must do,
My friend to thee that news, with your love,
My father's wife and love for this day.
You are not hot, lords, and what I am not?
To take this, good sweet friend, I am not my life,
I warrant, as I, to have a little thing, my lord,
What you can stay with this good night do you all your tongue?
O, if not my fair soul to my brother, how well,
Where is
main: mem per token = 290292 bytes
main: load time = 266.82 ms
main: sample time = 64.16 ms
main: predict time = 240.96 ms / 1.20 ms per token
main: total time = 576.08 ms
Both models were quantized to Q5_0.