SIARA is a monitoring system for AI agents. It evaluates and tracks agent performance over time by running standardized challenges.
The Origin Story
In 2025, my team and I used AI coding tools and LLMs like Gemini and Claude a lot for different kinds of work. Over time, we noticed a strange pattern. At certain points, the model outputs became almost unusable.
From London, the pattern was pretty clear. The models worked well until around noon, but the quality dropped badly in the evening. I suspected it had something to do with developers in San Francisco coming online and some kind of throttling or prioritization kicking in.
That's how the idea for SIARA came up. I wanted something like uptime monitoring, but for AI agents.
The Challenge: Image Puzzle
The hard part was defining a good agent task. Almost any task can be turned into a script once you understand the problem, so it's tricky to measure true agent behavior. The task had to be not too easy, not too hard, and easy to generate without manual curation.
Inspired by image puzzles, I landed on a simple idea: Take any image and split it into N overlapping tiles. The agent's job is to guess the original image dimensions and reconstruct the full image.
This task is:
- Easy to generate
- Easy to scale by increasing N
- Easy to evaluate using an SSIM score
Task Details
The agent is given a set of overlapping image tiles in a directory. All information is preserved; nothing is missing.
Example input tiles:
| Tile 0 | Tile 1 | Tile 2 |
|---|---|---|
![]() |
![]() |
![]() |
The Goal:
- Determine the original image dimensions
- Reconstruct the full original image from the tiles
- Submit one single image file within the time limit
Expected output (reconstructed image):
See challenges/image_puzzle/ for the full example:
task/tiles/- Input tilestask/README.md- Challenge instructionssolution/image.png- Expected solution
The Solver Agent
The solver agent is a standard LangChain-based agent equipped with tools that mimic a human developer's capabilities:
list_directory- List files and subdirectoriesread_text_file- Read UTF-8 text filesread_binary_file- Read binary files as base64read_image_file- Read and analyze image files (PNG, JPG, GIF, WebP)write_file- Create or overwrite filesexecute_shell- Run shell commandsexecute_python- Run Python scriptssubmit_solution- Submit a solution file to the verification API
The agent operates in a loop: observing the environment, thinking about the next step, executing tools (like writing Python scripts to stitch images), and observing the output until it decides to submit a solution.
Setup
Install dependencies from requirements.txt:
pip install -r requirements.txt
Looking Ahead
SIARA is still early, but I believe that by 2026, agents will need real monitoring tools instead of just trusting provider-reported results.
If you have any feedback, feel free to reach out.
— Agajan
https://github.com/torayeff



