It really depends on the job you're trying to accomplish.
I'd venture saying that it's way too early for horizontal / massive scale RAG apps.
Most solutions will want to focus on a very specific vertical application where the dataset is much more constrained. That we're this makes more sense.
It depends on your latency requirements, not every RAG task has a user waiting for an immediate response, for my use case it doesn't matter if an answer takes even 10's of minutes to generate
At indexing time:
- run LLM over every data point multiple times ("gleanings") for entity extraction and constructing a graph index
- run an LLM over the graph multiple times to create clusters ("communities")
At query time:
- Run the LLM across all clusters, creating an answer from each and score them
- Run the LLM across all but the lowest scoring answers to produce a "global answer"
...aren't the compute requirements here untenable for any decent sized dataset?