The latter. The major frameworks, at least, can be run in CPU-only mode, with a hardware abstraction layer for other devices (like CUDA-capable cards, TPUs etc). So practically it means you need an Nvidia GPU to get anywhere in a reasonable amount of time, but if you're not super dependent on latency (for inference) then CPU is an option. In principle, CPUs can run much bigger model inputs (at the expense of even more latency) because RAM is an order of magnitude more available typically.
I was thinking (as someone who knows nothing about this really) that the Apple chips might be interesting because, while they obviously don't have the GPGPU grunt to compete with NVIDIA, they might have a more practical memory:compute ratio... depending on the application of course.
Is there any blocker to have VRAM swap (on RAM or SSD)? It would make processing much slower, but it should be better than nothing (cause OOM) or alternatively run on CPU (more slower).
Not sure. I suspect the issue would be lots of memory transfer between the GPU and the CPU, because downstream layers usually need previous layer outputs. It would probably depend on the receptive field of the network? Also on how expensive memory transfer is, maybe it's worth it in some cases. But there's no reason why you couldn't run say the first big layers on the CPU and then treat deeper layers (which may take a smaller input) as a separate network to run on the GPU. I suppose you want the largest subgraph in your model that can fit in available VRAM. Certainly the Coral/EdgeTPU will dispatch unsupported operations to the CPU but that affects all ops beyond that point in the computation graph.