How I Built the Database of my Dreams
And how you can use it to build AI apps 100x faster
I really love to code. That much I’m sure you can tell. So I have this very common problem, that I’m sure many of you have. A gazillion ideas in my head wanting to burst out, and very little time to try them all. To try and get around this, I like to prototype really fast and dirty implementations of anything that makes me curious, just to see if it leads somewhere. (This is why I have hundreds of unfinished projects in my Github.)
So, where was I? Yeah, this is one of the reasons I’ve made Python, and specifically streamlit
, a key part of my toolkit. It lets me build the very simplest semblance of an app extremely fast, without considering anything remotely unnecessary (for a first prototype) like authentication and user roles, or worrying about the user interface.
Now, here is the thing. Recently, I’ve been building a lot of AI stuff. Chatbots galore for all sorts of tasks from general question answering to specific things like storytelling, evolutionary computation, automated coding, you name it. And I keep finding a specific pain point that grinds me to a halt in the most exciting part of prototyping: the database.
Very soon, my simple AI prototype needs a vector database for RAG or a key-value store for simple configurations, or a message queue for background tasks and a persistent list to store conversation history. And this is before anything remotely close to a production-ready feature is in the horizon. This is just the second thing you’ll need right after streaming the first few tokens from OpenAI. It’s unescapable for anything but the simplest toys.
So, before I know it, I’m wrestling with two or three containers—or worse, juggling credentials for three different cloud services—and writing boilerplate code just to glue it all together. I’ve become an accidental, and very grumpy, DevOps engineer. My worst nightmare!
What I want is something in the spirit of streamlit
but for managing data. A minimalistic, no-bullshit database that just supports the basic modalities we all need in modern applications, like, dunno, JSON storage, maybe? And vector search? Yeah, that, but also combined with full-text and fuzzy, persistent lists and queues… hell, even a decent pub-sub system if we’re asking! Is it too much to ask for someone to make SQLite but for modern data!
Well… I guess if you want something done, you might as well do it yourself.
So this is the story of how I built BeaverDB—a Pythonic and multi-modal interface on top of SQLite that just works out of the box for rapid prototyping. And it scales to medium sized projects just fine (SQLite is damn fast!). BeaverDB my attempt to build the tool I wish I had—a library that provides the high-level data structures modern AI applications need, without sacrificing the radical simplicity of a single .db file.
Introducing Beaver
The guiding principle behind Beaver’s API is a minimal surface area with a fluent interface. You only ever need to use two classes. Everything else flows naturally from the main database instance, returning dedicated wrapper objects with a rich, Pythonic interface.
One key idea for this design is that it should just work. Zero configuration, just sensible out of the box decisions. So, no schema, everything indexed by default, no need to declare tables or entities or anything before hand, collections just get created when first used, you know, that kind of thing. It should be as easy as instantiating a class and calling two methods.
Let’s take a quick tour on what you can today with it.
Key-Value & Caching
Need to store configuration, user profiles, or cache expensive API calls? The namespaced dictionary is your go-to. It behaves just like a Python dict but is backed by the database, with optional TTL support.
# Use a dictionary for caching API calls for 1 hour
api_cache = db.dict("api_cache")
api_cache.set("weather", {"temp": "15C"}, ttl_seconds=3600)
print(f"Cached weather: {api_cache.get('weather')}")
Persistent Lists
This is the most straightforward way to manage ordered sequences. For a chatbot, it’s the perfect way to maintain the turn-by-turn history of a conversation. It works like a Redis list, with all bells and whistles, implementing a full Pythonic list interface, but backed in the DB. It supports appending and removing from the head and tail, as well as inserting or removing anywhere in between, blazingly fast, because all operations are indexed.
# Manage a conversation with a user
chat_history = db.list("conversation_with_user_123")
chat_history.push({"role": "user", "content": "Hello, Beaver!"})
chat_history.push({"role": "assistant", "content": "Hello! How can I help?"})
print(f"Conversation length: {len(chat_history)}")
Priority Queues
A priority queue is the essential tool for orchestrating an autonomous agent. It ensures the agent always works on the most critical task first, regardless of when it was added. The API is extremely simplified on purpose. For anything more complicated, use a full-featured list.
# An AI agent's task list
tasks = db.queue("agent_tasks")
tasks.put({"action": "summarize_news"}, priority=10)
tasks.put({"action": "respond_to_user"}, priority=1) # Higher priority
# Agent always gets the most important task first
next_task = tasks.get() # -> Returns the "respond_to_user" task
Real-time Pub/Sub
Need to build a decoupled, event-driven system? The pub/sub channel allows different parts of your application—or even different processes—to communicate in real-time. Beautifully designed with an extremely simple fluent API, but extremely performant, thread-safe, even works across different processes. Plus, it comes with an optional async
interface if you’re feeling fancy.
# In one process, a monitor publishes an event
system_events = db.channel("system_events")
system_events.publish({"event": "user_login", "user": "alice"})
# In another, a logger subscribes and receives the message
with db.channel("system_events").subscribe() as listener:
for message in listener.listen(timeout=1):
print(f"Event received: {message}")
Collections & Hybrid Search
This is the core component for any Retrieval-Augmented Generation task. It’s a multi-modal collection of structured documents that understands vectors, and text, allowing you to combine search strategies for the best results. It also performs fuzzy search on demand with a very clever indexing strategy that I’ll tell you all about in the next section.
from beaver import BeaverDB, Document
from beaver.collections import rerank
docs = db.collection("articles")
doc = Document(embedding=[...], content="Python is fast.")
docs.index(doc, fuzzy=True)
# Combine vector and full-text/fuzzy search for better results
vector_results = docs.search(vector=[0.15, 0.85, 0.15])
text_results = docs.match(query="pthon", fuzziness=1)
best_context = rerank(vector_results, text_results)
And there’s a lot more. You can connect documents with relations and build a knowledge graph that you can later query to find similar documents or implement graph-based recommender system.
All in one freaking database file. No docker. No servers. No headache.
A Peek Under the Hood
Beaver is built on a series of pragmatic design decisions intended to leverage SQLite’s native capabilities to the fullest, avoiding slow application-layer logic wherever possible.
For one, it never creates new tables when storing stuff. Everything is stored and indexed in cleverly designed global tables that are created at startup time (only the very first time the DB file is created). This also means you get virtually infinite lists, dicts, queues, collections, etc., because these aren’t different tables (which would be a pain in the arsenal to maintain). And… (roll drums)… no migrations!
I want to highlight two specific features to tell you a bit about the underlying implementation details, so you can see the lengths it goes to try and be efficient out of the box.
The Pub/Sub System
The pub/sub system is the greatest example of efficiency by design. It’s built on a single, append-only log table with an index for the channel name.
For each channel, a single background thread polls this table for new entries and fans them out to any number of in-memory queues, one for each subscriber. The key insight here is that because the database is only ever touched by one polling thread per channel, adding a second, third, or hundredth subscriber adds zero additional load to the database. From the database’s perspective, new listeners are basically free.
Even better, this polling is only activated when there is at least one listener, and stops immediately after the last listener disconnects. This means we only ever use the minimum resources necessary.
Hybrid Text Search: The Two-Stage Filter
The other feature I think is kinda beautiful is the fuzzy search. Calculating the Levenshtein distance between two strings is computationally expensive, and there is no way to build an index beforehand without a combinatorial explosion in storage size. But running it across every document in a collection would be unusable for anything beyond a few thousand entries.
Beaver solves this with a two-stage process. First, it uses a pre-computed trigram index in SQLite to instantly find a very small set of candidate documents that share a high percentage of 3-character “chunks” with the query. This is a very fast SQL query that already gives us a very good approximation to the response. Then, we run the expensive Levenshtein algorithm only on this small, pre-filtered set of candidates in memory.
The Vector Store Dilemma
Vector storage is my biggest concern at the moment. The current implementation is, I think, pretty good for a typical use case of infrequent indexing and fast retrieval. But it can be much improved.
Right now, Beaver uses an in-memory k-d tree, which provides excellent search speed once the index is loaded into RAM. However, the index is ephemeral—it lives and dies with the application process. This creates a significant bottleneck: every time your application starts, the entire index must be rebuilt from scratch by reading all vectors from the database. Furthermore, indexing a new document requires a full, blocking rebuild of the entire tree.
As long as you index documents in a background process, infrequently—which is what most RAG-based applications do—this works just fine. But I’m not satisfied with this implementation, so here is my plan for its update.
The roadmap goes through integrating a state-of-the-art ANN library like faiss
. This will enable a persistent, on-disk index that can be loaded instantly. But more importantly, we need to support incremental additions, which means newly added documents don’t rebuild the entire index. The key to achieve this is to have a persistent, large index, and a small, in-memory, temporal index of new additions that gets added to the persistent index from time to time. Keep these two in sync is far from trivial, but I already have a somewhat detailed plan.
Conclusion
Beaver is the result of a couple of weekends of focused frustration, and it’s already the backbone of my own AI prototypes. Even if it never reaches any level of maturity, I’m extremely satisfied with it because it already works for my use cases, and I’ve learned so much building it. Damn, who knew building databases could be so fun!
Now, what’s in it for you? Well, Beaver is not meant to replace your production PostgreSQL cluster. That’s unthinkable. But it might be the “good enough” database that lets you go from an idea to a working prototype in minutes, not hours. So, stop being an accidental DevOps engineer and go build something cool.
You can get started with a simple pip install beaver-db
. Check out the GitHub repository for a lot of examples. And leave me any questions or comments either here or in the repository issues.
Have fun!