Kalosm 0.2 - Floneum Blog

Kalosm v0.2.0 Tasks, Evaluation, Prompt Auto-Tuning, Regex Validation, Surreal Database Integration, RAG improvements, Performance Improvements, and More!

We're excited to announce the release of Kalosm v0.2.0! This release includes a number of new features, improvements, and bug fixes including:

Tasks and Agents
Task Evaluation
Prompt Auto-Tuning
Regex Validation
Surreal Database Integration
RAG improvements
Performance Improvements
New Models

Tasks and Agents!

Kalosm now includes utilities for running, evaluating, and improving tasks and agents.

Let's build a simple task and agent to demonstrate the new functionality.

Or you can use more complex constraints to

Tasks can efficiently reuse the session between runs, which can significantly speed up the process:

The third question required a more complex calculation and took more tokens to solve, but the time to solve the question was still significantly faster than the first question. The session from the first question was reused for the second and third questions which made the second and third questions run faster.

Evaluation Abstraction: Introducing an evaluation abstraction, providing enhanced functionality. (#113)

Prompt Auto-Tuning: Prompts can now be automatically tuned for better performance. (#132)

Lets take a look at the new prompt auto-tuning feature with an example. As part of the RAG improvements, kalosm includes a task that generates hypothetical questions about a text for an embedding model that can be used to find similar documents based on the meaning of the text for the section. We can tune that task to find the best examples for the task with a PromptAnnealer:

Here is the best set of examples that the prompt annealer found for the task:

Input	Output
While traditional databases rely on a fixed schema, NoSQL databases like MongoDB offer a flexible structure, allowing you to store and retrieve data in a more dynamic way. This flexibility is particularly beneficial for applications with evolving data requirements.	"How does MongoDB differ from traditional databases?
Blockchain technology, beyond cryptocurrencies, is being explored for applications like smart contracts. Smart contracts are self-executing contracts with the terms of the agreement directly written into code.	"How is blockchain technology utilized in the concept of smart contracts?

Feeding those two examples into the task achieves a similarity score of 0.71 for all of the other examples compared to choosing two random examples from the task which only achieves a similarity score of 0.62 for all of the other examples.

Regex Validation

Some feedback we got from the initial release of kalosm, was that constraints for constrained generation was too complex. Constraints in Kalosm serve two purposes:

Validation: Constraints can be used to validate the output of the model. The model will only output text that can be parsed by the constraints. This lets you ensure that the output of the model is in the format you expect.

For example, you may want to force the model response to always start with a prefix that guides the model:

Parsing: Constraints can be used to parse the output of the model. This can be extremely useful when you want to generate a specific structure from an LLM without writing separate logic for validation and parsing.

For example, you may want to generate a list of 10 numbers:

If you only need to validate the output of the model, the existing constraints can be more complex than what you need. In this release, we've added support for regex validation. This makes it easier to validate the output of the model without handling parsing:

Surreal Database Integration

Vector databases can be very useful when combined with LLMs. They can be used to store and retrieve similar documents based on the meaning of the text, not just the words used. However, vector databases only handle a very limited number of use cases. In this release, we've added support for Surreal DB for more traditional database use cases. Surreal DB can be embedded into your application and used to store and retrieve data locally as well as over the network.

Kalosm 0.2 allows you to create tables within Surreal DB that are indexed by vectors. You can then insert documents (or other embeddings) into the table and query the table for similar documents based on the meaning of the text.

Let's take a look at how you can use the Surreal DB integration to store and retrieve similar documents based on the meaning of the text:

RAG improvements

RAG (Retrieval-Augmented Generation) is a powerful tool for generating text with up-to-date or proprietary information. Retrieval-augmented generation generally follows the following steps:

Gather context from some local files, your database, or web data. In kalosm, you can retrieve data from any source that implements IntoDocument or IntoDocuments. You can gather your sources from local documents, a search term, specific web page, an RSS feed, or even a custom web crawler.
Insert that context into a searchable database. Kalosm includes a vector database that can be used to store and retrieve similar documents.

Vector databases use an embedding model which generates a vector for a chunk of text (typically smaller than the entire document). The vector is then stored in the database. When you want to retrieve similar documents, you can embed a query and search for similar vectors in the database.

The vectors represent the meaning of the text, so you can search for similar documents based on the meaning of the text, not just the words used.

Use the context to generate text. You can find text similar to the question or a search generated by the LLM and then generate a response based on the context.

In this release, we've made several improvements to RAG! (#126)

Improved Chunking Strategies

When you insert a document into a vector database, it needs to be split into smaller chunks before the text is embedded. The chunks you choose can have a significant impact on the performance of the results you get from the vector database. In this release, we've added two new chunking strategies to the vector database:

Hypothetical Questions

Instead of generating embeddings based on the content of the document, this chunking strategy generates embeddings based on hypothetical questions generated about the document. This can be extremely useful when building a chatbot that needs to find context that is relevant to a question.

For example, if you have a document about the history of the United States, you can generate hypothetical questions like "What is the capital of the United States?" and "Who was the first president of the United States?" and then generate embeddings based on those questions.

Then if you query the vector database with a question like "Who was the leader of the US?" you can find the document about the history of the United States.

Notice that the question "Who was the leader of the US?" doesn't contain many of the same words as the hypothetical questions, but it does convey a similar meaning, so the vector database can still find the relevant document.

Summaries

This chunking strategy generates embeddings based on the summary of the document. This can be useful when you have a large document and you want to find similar documents based on the main points of the document.

Generating embeddings based on the summary of the document can create better embeddings that contain more information about the document than embeddings that only contain one small chunk of the document.

Incremental Indexing

In addition to the new chunking strategies, we've also added support for incremental indexing. This means you can add new documents to the vector database without having to recreate the entire database. This can be extremely useful when you have a large database or you have constantly updating context you want to provide to your LLM.

Kalosm's Vector database is now backed by arroy, a space-efficient and incrementally indexed vector database backed by MeiliSearch!

Performance Improvements

The llama implementation has been rewritten and optimized for better performance and modularity. The new implementation is now 7-25% faster than the previous version. In future releases, we plan to add support for fine tuning models and training new heads for existing models. (#122)

Language models like Llama and Phi output probabilities for each token in the vocabulary. To generate text you need to sample from the probability distribution. Sampling from the probability distribution can be slow, especially with large vocabularies. Kalosm 0.2 uses an optimization introduced in llm-samplers to only sample top 512 tokens. This optimization can make sampling up to 2x faster. (#123)

Large sections of text that are static within a constraint in structured generation is now loaded in a batch which can significantly speed up the process.

For example if you have the constraints:

The text "The title of the book is " will be loaded in a batch instead of one token at a time. Batched loading has been restored in constrained generation. (#131)

New Models

Kalosm 0.2 adds support for several new models, including:

Dolphin Phi v2 A tiny chat model
Solar-11b Models A set of models for chat, text, and code generation
Tiny Llama 1.0 A tiny set of models for chat, and text text generation

Full Changelog

For a detailed list of changes between v0.1.0 and v0.2.0, please see the full changelog.

I hope you enjoy using Kalosm v0.2.0! Your feedback is invaluable to us, so please don't hesitate to share your thoughts and report any issues you encounter.

What's next?

In the next release, we plan to add support for fine tuning models and training new heads for existing models. We also plan to continue improving the performance of the language models and adding support for more models.

If any of those features sound interesting or you want to propose a new feature, consider contributing on Github.

If you are interested in building an application with Kalosm, join the Discord and get involved with the community!