@allozaur Currently there is no truncation support (it just stops when we reach the maximum context), and we are sending the request as text. I think now that it would make more sense to always tokenize the notebook content and send tokens instead of text, and use the detokenize llama method afterwards. This is done this way on text-generation-webui to more easily truncate the prompt.
For the truncation strategy, I'm thinking about having two different parameters, truncate_to and truncate_from, which could accept either a token amount or a percentage. This way we could limit the number of times that the entire prompt is reprocessed. If the context_size and max_size parameters are available, truncate_from would default to context_size - max_size, and truncate_to would be something like 75% maybe.
Adding a new setting category in the settings for the notebook would be needed, with other options like a stop_text parameter which could be also interesting.
But this PR is already quite big so it might make sense to add those features in a follow-up PR. What do you think? Should I add them here (I could work on it from Tuesday) or is it better in another PR?