I'd argue that multimodal analysis can improve uni/bimodal models. There is over...

I'd argue that multimodal analysis can improve uni/bimodal models.

There is overlap between text to image and text to video -- image would help video animating interesting or complex prompts; video would help image learn how to differentiate features as there are additional clues in terms of how the image changes and remains the same.

There's overlap with audio, text transcripts, and video around learning to animate speech e.g. by leaning how faces move with the corresponding audio/text.

There's overlap with sound and video -- e.g. being able to associate sounds like dog barking without direct labelling of either.