Isn’t that what this codebase is doing? I haven’t grokked it 100% yet.
Recently I’ve been trying to engineer a prompt that I intend to run 1k times.
Noticing GPT4 bug out on several responses, I’ve talked it through the problem more and asked it to rewrite the prompt. So an automated approach to help build better prompts based upon held out gold data is useful to me.