That's cool. I've created a website(https://papertube.site) that essentially transcribes video conversations for reading on Kindle. Right now, I'm relying on third-party APIs, but I was thinking about self-hosting to reduce costs.
It's like reverse audio-book, but how do you tackle issues related to video content, as the visual medium contains more information dimension than just sound.