Even without sparsification, AVX-512 CPUs are far more cost-effective for infere...

boulos · on Sept 11, 2021

Usual Disclosure: I used to work on Google Cloud.

> A GPU cloud instance costs $1000+ per month (vs. $10 per month for an AVX-512 CPU). A bargain GPU instance (e.g., Linode) costs $1.50 per hour (and far more on AWS) but an AVX-512 CPU costs maybe $0.02 per hour.

This is a little confused. A T4 on GCP is $.35/hr at on-demand rates. A single thread of a Skylake+ CPU is in the 2-3c/hr range [2] (so $15-20/month w/o any committed or sustained use discounts. Your $10 is close enough).

So roughly the T4 GPU itself is ~10 threads. Both of these are before adding memory, storage, etc., but the T4 is a great inference part and hard to beat.

Comparing a single thread of a CPU to training-optimized GPU parts (like the A100 or V100) is sort of apples and oranges.

[1] https://cloud.google.com/compute/gpus-pricing

[2] https://cloud.google.com/compute/all-pricing

37ef_ced3 · on Sept 11, 2021

$10 per month (i.e., $0.01389 per hour) Skylake cores: https://www.vultr.com/products/cloud-compute/

If you're not holding it for a full month, it's $0.015 per hour.

Here are the $1000 per month ($1.50 per hour) cloud GPUs I use, the cheapest that Linode provides: https://www.linode.com/pricing/

I would like to see a comparison of Linode vs Google Cloud, taking into account network bandwidth costs, etc. Maybe the Google Cloud T4's total cost of ownership is lower? I doubt it: for example, the Linode plan includes 16 terabytes of fast network traffic at no additional cost, and Google Compute charges $0.085 per gigabyte.

And the T4 is still 23x more expensive than an AVX-512 core, hourly.

rrss · on Sept 11, 2021

boulos is 100% right that you are choosing to compare to particularly expensive GPUs instead of cheaper GPU instances that are actually intended for inference at low cost.

EDIT: please stop it with the ninja edits.

ramzyo · on Sept 10, 2021

> Note that Neural Magic's engine is completely closed-source, and their python library communicates with the Neural Magic servers. If you use their engine, your project survives only as long as Neural Magic's business survives.

Could be totally off here, but have a feeling this team is going to get scooped up by one of the big co's. Having deja vu of XNOR.ai (purchased by Apple), when partners got the short end of the stick (see Wyze cam saga: https://www.theverge.com/2019/11/27/20985527/wyze-person-det...)

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi 37ef_ced3, AVX-512 has certainly helped close the gap for CPUs vs GPUs. Even then, though, most inference engines on CPUs are still very compute bound for networks. Unstructured sparsity enables us to cut the compute requirements down for CPUs leading to most layers becoming memory bound. This then allows us to focus on the unique cache architectures within the CPUs to remove the memory bottlenecks through proprietary techniques. The combination is what's truly unique for Neural Magic and enables the DeepSparse engine to compete with GPUs while leveraging the flexibility and deployability of highly available CPU servers.

Also as boulos said, the numbers quoted here are a bit lopsided. There are more affordable GPUs and GCP has enabled some very cost effective T4 instances. Our performance blog on YOLOv3 walks through a direct comparison to T4s and V100s on GCP for performance and cost: https://neuralmagic.com/blog/benchmark-yolov3-on-cpus-with-d...

Additionally, I'd like to clarify that our open sourced products do not communicate with any of our servers currently. These are stand alone products that do not require Neural Magic to be in the loop for use.

shock-value · on Sept 10, 2021

GCP lists a T4 which is suitable for inference for between $0.11/hour and $0.35/hour (depending on commitment duration and preemptibility).

https://cloud.google.com/compute/gpus-pricing

carbocation · on Sept 10, 2021

Agreed - I priced this out for a specific distributed inference task a few months ago and the T4 was cheaper and faster than GPU.

On 26 million images using a pytorch model that had 41 million parameters, T4 instances were about 32% cheaper than CPU instances, and took about 45% of the time even after accounting for extra GPU startup time.

ml_hardware · on Sept 10, 2021

T4 is a gpu :) NVIDIA Tesla T4: https://www.nvidia.com/en-us/data-center/tesla-t4/

carbocation · on Sept 11, 2021

Yes, I’m using “T4” as a shorthand for “instances otherwise matched to the CPU-based instances but which also have a T4 GPU.”

mkaic · on Sept 10, 2021

OP was most likely referring to the AWS EC2 instance type T4, which runs on Amazon Graviton processors IIRC.

mkl · on Sept 11, 2021

> GCP lists a T4

GCP not AWS, or are you talking about a different OP?

37ef_ced3 · on Sept 10, 2021

Many businesses/services can't saturate the hardware you describe. It's just too much compute power. With CPUs you can scale down to fit your actual needs: all the way down to a single AVX-512 core doing maybe 24 inferences per second (costing a few dollars PER MONTH).

Also, your cost/inference results will change if you use a fast CPU inference engine, instead of something slow like PyTorch (which you appear to be using).

carbocation · on Sept 11, 2021

Thanks - this is something I wasn’t familiar with. Do you have any pointers for CPU inference engines that you’ve had good experience with or that I can look into further?

markurtz · on Sept 11, 2021

Disclosure: I work for Neural Magic.

Hi carbocation, we'd love to see what you think of the performance using the DeepSparse engine for CPU inference: https://github.com/neuralmagic/deepsparse

Take a look through our getting started pages that walk through performance benchmarking, training, and deployment for our featured models: https://sparsezoo.neuralmagic.com/getting-started

pjmlp · on Sept 11, 2021

Well AVX-512 is kind of the survivor of the whole Larrabee failure, so it is no wonder that it has a similar capability.

gautamcgoel · on Sept 11, 2021

I notice that you failed to mention that you are the maintainer of NN-512.

outlace · on Sept 10, 2021

“That said, GPUs are essential for training”

http://learningsys.org/neurips19/assets/papers/18_CameraRead...

zekrioca · on Sept 10, 2021

Read parent's (emphasis mine):

> Even without sparsification, AVX-512 CPUs are far more cost-effective for inference.

fxtentacle · on Sept 10, 2021

I believe that's also why there are so many options for turning fully trained TensorFlow graphs into C++ code.

You use the expensive GPU for building the AI, then dumb it down for mass-deployment on cheap CPUs.