/POST · ALFIE MILLS

Building forever-llm: three takes on a model that never stops

~4 MIN READ20 May 2026

I had a simple, slightly weird idea: what if a language model just kept generating, forever? Not a chatbot waiting for your turn, but an unbroken stream of thought that a few invited people could watch live and occasionally nudge. I built it three times over two days, and each rewrite taught me something about the gap between a fun prototype and something you can actually share.

v1: one file, no dependencies

The first version (forever-llm) is a single stream-server.js, plain Node, no npm install, served alongside a static HTML page. It talks to a local Ollama instance and runs a perpetual loop: ask Ollama to generate, stream the tokens to every connected browser over Server-Sent Events, and the moment generation finishes, immediately start again.

The interesting bits were all about keeping an infinite generation coherent and interruptible:

Injections. When you type something, it gets queued as [USER: ...] and woven into the stream. The system prompt frames these as "a thought arriving from outside" so the model absorbs them instead of breaking into chat mode. A pending injection aborts the in-flight request so it lands promptly.
Context trimming. Context can't grow forever, so once the accumulated text passes ~8000 chars it keeps the trailing slice and drops the rest behind a marker.
Stop-token scrubbing. Models love to emit </s>, <|eot_id|>, User: and similar. I strip those so the monologue never visibly "ends."

It worked, and it was genuinely fun to watch. But everything lived in module-scope variables, auth didn't exist, and it was hardwired to Ollama on my own machine. Not something I could send to friends.

v1.5: the Nuxt rewrite

The second version (forever_llm) is the same concept rebuilt properly as a Nuxt 3 app on the Nitro Bun preset, with bun:sqlite for persistence. This is where it became shareable:

Magic-link invites. On first boot with an empty DB it prints a bootstrap admin link to stdout. From /admin you mint further invite links. Tokens are SHA-256 hashed in the DB, sessions are httpOnly cookies.
Two providers behind one interface. I added OpenRouter alongside Ollama, hidden behind a single streamCompletion/chatCompletion API. The provider, model, and temperature are all switchable per session from a setup screen, so I could run a free hosted model for friends or a local one for myself.
Backoff that respects the provider. Free tiers rate-limit you. The loop reads Retry-After (header and the JSON error body OpenRouter uses) and otherwise falls back to capped exponential backoff with jitter.

The hardest part was honestly the prompt, not the code. Getting a model to produce flowing, turn-less prose (and to never say "in conclusion") took far more iteration on the system prompt than on the loop.

v2: continuous, but a real chat

The third version (forever-llmv2) reframed the whole thing. Instead of one global stream, it's a multi-conversation chat (sidebar, multiple threads, the lot) where each conversation has a Continuous toggle. Flip it on and the assistant doesn't stop: when it would normally finish a turn, it starts another. You can send a message mid-generation and it joins naturally.

The architectural shift was moving from a single global loop to per-conversation state in a Map, each with its own abort controller and loop promise:

async function runLoop(s: ConvState) {
  s.running = true
  do {
    if (global.killswitch) break
    const messages = loadHistory(s.conversationId)
    // ...stream tokens, persisting a "partial" assistant message as it goes
  } while (s.continuous && !global.killswitch && s.running)
}

Persisting each assistant message as partial: 1 while it streams, then flipping it to 0 on completion, meant a refresh mid-generation didn't lose anything. SSE events fan out two ways: to viewers of a specific conversation, and globally so the sidebar can show which threads are live.

What I'd keep and what I'd change

The single best decision was the provider abstraction in v1.5, being able to develop against local Ollama and demo on a hosted free model without touching the loop. The killswitch and per-token speed throttle (server-side, so it's enforced for everyone) also earned their place.

The honest limitation, written right into the README, is that it's single-replica only: SQLite plus in-memory stream state don't shard. That's fine for a handful of invited people, which is exactly the audience, but it's the first thing I'd have to tear out to make this real.

If I did a v3, I'd lift the loop state out of module scope into something durable so a restart doesn't kill an in-flight stream, and I'd unify the v1.5 "stream of consciousness" and the v2 "continuous chat" into one mode you toggle rather than two separate codebases. Three rewrites in, the idea is still the fun part, the engineering was mostly about making "never stop" behave.