Very cool idea. I had something vaguely similar in my mind. It's nice one see go ahead and implement it. All the Claude code animations and not knowing what's happening, how long it will take and what will come out is really frustrating me. On top of that there is no way to actually limit the scope of things. Opencode's Plan mode and build mode helps a bit.
If a state machine can improve a local LLM to produce better results, it's welcome addition to tinkerers and solo devs.
I feel you on the Claude pulsing thing. Running (or trying to) run Opencode with any model I could throw at it to perform useful work like the frontier/proprietary models do (a tall order I know) is where I started. everyone makes the problem bigger (massive contexts, massive number of parameters, more transformers (MoE)) but I started with how can we make small/local LLMs perform better (do more with less)
Opencode's plan/build is decent, like it is in Claude Code... state machines are the next evolution. model agnostic, tooling agnostic (where feasible)
Yes but this solve just some part of a problem, it stops the agent from doing something. What would be more useful is forcing the agent to do something. To make up an example, let's say you want the agent to change a status in jira after it completes a task. With this framework you can deny the transition until the models changes the status in jira, but that doesn't mean the agent will do it.
If the agent generates structured JSON checked against a schema, then (to the limit of the ability of the agent to not generate correct JSON) the trick here is to have the transition request include a non-optional jira-update field. The agent can be malicious and give blank or useless jira updates, but if for example the transition to the next state requires a jira-state field selected from an enum (where the only options are valid next states, not necessarily every state; so include "fixed" and "not applicable" but not "open" or "new"; or whatever the business logic wants), then it restricts the ability of the agent to fail to make forward progress.
I was looking into this just yesterday. So the Loki + … comparison is a bit off in the Open Source space. The main ones are Signoz and ClickStack in this space. Both using ClickHouse as the database. Heavy compared to something like Loki, but they are OTEL native and not log monitoring. So not in the same category.
I used Signoz + Clickstack on a vibe coded Go server project a few weeks ago. I just made codex figure out how to setup signoz + dependencies via docker compose. I even got it to pre-populate signoz with dashboards. It wasn't too bad. The whole thing runs with a few GB. I tried to cover metrics, tracing, and logging at the same time. This is not a production ready setup but you need to trade off cost vs. utility here. If it's useful enough, that could justify extra cost.
I have a background in having done a lot of stuff on the Elastic stack related to this; including setting up a big Elastic Fleet based stack for one client at some point. It might not be the cheapest, but it does provide awesome filtering and querying capabilities. However, a lot of teams that use it don't really know how to tap into that capability so it tends to be overengineered for what it does in the end. And the extra, underutilized complexity is why a lot of teams are wary of dealing with that stack.
Storing the data is the easy part but what's the point if you can't run queries against it and produce dashboards and diagnostic tools that actually help you? Prometheus/grafana or older graphite type setups tend to be compromises where you get lots of data but are then limited on the querying front or the number of metrics. The tradeoff is always between scale and querying flexibility. If you store tens/hundreds of GB of telemetry per day, you need a way to make sense of it. Clickhouse seems to be quite good at scaling and querying. It's basically a column database. I don't have direct experience with Loki.
But in the end, all that power only matters if people actually use it. And, again, in my experience teams tend not to. They tend to have a lot of unrealized aspirations around their tools and infrastructure. If it's just a dumping ground for data + a few simplistic dashboards, optimize for that. A lot of that data is actually only kept for compliance/auditing reasons. For that, querying is usually a secondary concern and it's OK if queries take a bit longer and are less powerful.
You're absolutely on point with this, I've made the perf tracking opinionated, so it comes preconfigured with SLOs that are good for most of the projects where nobody would bother to set them up.
Traceway has custom dashboards, supports otel logs/traces/metrics/exceptions fully, has session replays for web and flutter (working on ios/android now), has alerting integrations with slack/email/github, oauth login w google/github, and a bunch of other features... All MIT. None behind a paywall.
It has a specific set of trade offs, those are by design, but I am also always open to changing them and improving it. If you try it and have any thoughts the git issues are constantly monitored.
In reality it's a very modular system, the telemetry repositories can be swapped out easily, I have implemented a clickhouse and a sqlite version (to simplify self hosting) so adding a loki like repository would be a breeze. It's not on the roadmap currently as I am putting a lot of effort into 3 diff parts rn.
The truth is that Clickhouse is an incredible DB that scales really well for observability data.
When I was starting Traceway I was heavily inspired by skylightio from the Ruby ecosystem. I loved their SLOs/ranking perf issues, but I also wanted the features that Sentry offered in one place.
I have written zero skills, so not sure how normal it is. I counted the words in couple of them and they seem to be around 2k range. So 5 skills would be around 10K. Even at a small LLM context of 128k, that's still around 10%. And for a 1M context window like the big ones, it barely registers.
If there is anything we have learned in decades of Software engineering, it's "A clear outcome" is not easy to describe. In many cases, it's impossible unless people from 4 different domains collaborate. That's why process matters. It allows for software to be built is a "semi-standardized" way that can allow iterations to get us closed towards the expected outcome, that might emerge over time.
Yes, not everything I use LLMs for going to have the same level of ambiguity or complex requirements. Optimizing by choosing to skip over parts of the process is exactly Addy is talking in this article.
Seriously though, why is it a model "card", safety "card"? I had to lookup to learn that it comes from HuggingFace's vague definition of "README" in the model's repo. This is such a specific thing that I don't think anyone except a very small population would know - not the users, not the c-suites.
I don't like Musk or Grok. But not knowing what's a safety card is not a signal of anything IMO.
The "model card" concept actually comes from a pre-LLM Google paper (https://arxiv.org/abs/1810.03993), where the example cards did fit on a single page. The concept quickly became a standard component of AI governance frameworks, and Hugging Face adopted it as a reasonable standard format for a model README. As LLMs emerged and became more capable at broader ranges of tasks, model cards expanded to the sizes we see today.
That makes sense. I recall a “battle card“ (“concise, easy-to-scan document that helps [sales] reps handle competitive conversations, respond to objections, and highlight key differentiators” per HubSpot) as about a half sheet document, which is congruent.
But users don’t need to know you’re 100% right, you shouldn’t need to know this inside baseball (you didn’t pollute & compute & gain the responsibility).
> Seriously though, why is it a model "card", safety "card"?
My assumption is because "card" has a more formal tone than a README, which is more like a quick "how to use the software" guide.
Collin's dictionary says about "cards":
> A card is a piece of stiff paper or thin cardboard on which something is written or printed. (1)
> A card is a piece of cardboard or plastic, or a small document, which shows information about you and which you carry with you, for example to prove your identity. (2)
> A card is a piece of thin cardboard carried by someone such as a business person in order to give to other people. A card shows the name, address, phone number, and other details of the person who carries it. (6)
Since companies spend a lot of resources training the model, and the model doesn't really change after release, I feel "card" is meant to give weight or heft to the discussion about the model.
It's not meant to be updated like a README or other software documents, it's meant to be handed out to others as a firm, unchanging "this is a summary of the model and its specifications", like a business card for models.
Kudos. I set on this exact journey a couple of days back and Pi is what I started reading for inspiration as well. I really can't stand the text boxes and the animations of the mainstream harnesses.
I'm not a teenager anymore but I thoroughly enjoyed it, a lot better than some random dev breathlessly talking about how they haven't written a line of code in 6 months, or an article talking about how LLMs lead to the end of programming/the economy/the world, etc etc.
reply