I ended up training a bert on nothing but python for the embedding search. The results were crap. Then I used an llm to write a new docstring for each class/function definition in the training data and the results were better than state of the art.
There's so much wide open space to explore. It's a shame that everyone is wasting their time with the biggest possible models they can afford.
Do you have any more detailed info on this process? I've played around with using LLMs, but nothing in the training realm. I'd love to see a writeup or guide to the process you used there.
The tools have broken again since then - thanks tensorflow data loaders - and my code only works against a version of python that's no longer supported in LTS Ubuntu/Debian10+.
I have been mulling about running a subscription service where you get up to date code that works on topics like the above. If you're interested drop me a line at my profile email and I'll add you to a mailing list when/if I ever get around to doing it.
My example was asking for a poem about the headlines (good example of info they don't have, and something that's very hard to do mechanically).
https://news.ycombinator.com/item?id=37015591