How are such savings not obvious after putting the amounts in an Excel sheet, an...

mschuster91 · on Nov 16, 2023

> and most importantly doing this before spending half a million/year on AWS

AWS is... incentivizing scope creep, to put it mildly. In ye olde days, you had your ESXi blades, and if you were lucky some decent storage attached to it, and you gotta made do with what you had - if you needed more resources, you'd have to go through the entire usual corporate bullshit. Get quotes from at least three comparable vendors, line up contract details, POs, get approval from multiple levels...

Now? Who cares if you spin up entire servers worth of instances for feature branch environments, and look, isn't that new AI chatbot something we could use... you get the idea. The reason why cloud (not just AWS) is so popular in corporate hellscapes is because it eliminates a lot of the busybody impeders. Shadow IT as a Service.

tqi · on Nov 16, 2023

Those busybodies are also there to keep rogue engineers from burning money on useless features (like AI chat bots) that only serve to bolster their promo packet...

threeseed · on Nov 16, 2023

> keep rogue engineers from burning money on useless features (like AI chat bots)

As someone who has worked on an AI chat bot I can assure you it does not come from engineers.

It's coming from the CFO who is salivating at the thought of downsizing their customer support team.

tqi · on Nov 16, 2023

The other reply in this thread might beg to differ: https://news.ycombinator.com/item?id=38295849

I think the reality is no discipline, be it Engineering, Product or Finance, is immune to flights of fancy.

withinboredom · on Nov 16, 2023

I literally started laughing at this. I worked at a bare-metal shop fairly recently and a guy on my team used a corporate credit card to set up an AWS account and create an AI chatbot.

The dude nearly got fired, but your comment hit the spot. You made my night, thank you.

mschuster91 · on Nov 16, 2023

That depends on how the incentive structures for your corporate purchase department are set up - and there's really a ton of variance there, with results ranging from everyone being happy in the best case to frustrated employees quitting in droves or the company getting burned at employer rating portals.

tqi · on Nov 16, 2023

> That depends on how the incentive structures for your corporate purchase department are set up

Sure, but that seems orthogonal to the pros and cons of having more layers of oversight (busybodies, to use your term) on infra spend. Badly run companies are badly run, and I don't think having the increased flexibility that comes from cloud providers changes that.

m00x · on Nov 17, 2023

So instead of using 500k of engineering time when you need more resources, you're now saving 50k of AWS overspend.

Spivak · on Nov 16, 2023

I would be surprised if people didn't know that coloing was cheaper. I certainly evangelize it for workloads that are particularly expensive on AWS.

It's not entirely without downsides though and I think many shops are willing to pay more for a different set of them. It is incredibly rewarding work though. You get to do magic.

* You do need more experienced people, there's no way around it and the skills are hard to come by sometimes. We spent probably 3 years looking to hire a senior dba before we found one. Networking people are also unicorns.

* Having to deal with the full full stack is a lot more work and needing manage IRL hardware is a PITA. I hated driving 50 miles to swap some hard drives. Rather than using those nice cloud APIs you are also on the other side implementing them. And all the VM management software sucks in their own unique ways.

* Storage will make you lose sleep. Ceph is a wonder of the technological world but it will also follow you in a dark alleyway and ruin your sleep.

* Building true redundancy is harder than you think it should be. "What if your ceph cluster dies?" "What if your ESXi shits the bed?" "What if Consul?" Setting things up so that you don't accidentally have single points of failure is tedious work.

* You have to constantly be looking at your horizons. We made a stupid little doomsday clock web app that we put all the "in the next x days/weeks/months we have to do x or we'll have an outage." Because it will take more time than you think it should to buy equipment.

theLiminator · on Nov 16, 2023

It's great when you don't need instant elasticity and traffic is very predictable.

I think it's very useful for batch processing, especially owning a GPU cluster could be great for ML startups.

Hybrid cloud + bare metal is probably the way to go (though that does incur the complexity of dealing with both, which is also hard).

dilyevsky · on Nov 16, 2023

Cloud is putting spend decisions into individual EMs’ or even devs’ hands. With baremetal one team (“infra” or whatever) will own all compute and thus spend decisions need to be justified by EMs which they usually dont like ;)

ldargin · on Nov 16, 2023

Bare-metal solutions save money, but are costly in terms of development time and lost agility. Basically, they have much more friction.

withinboredom · on Nov 16, 2023

huh, say wut?

I guess before Amazon invented "the cloud" there wasn't any software companies...

threeseed · on Nov 16, 2023

AWS isn't just IaaS they are PaaS.

So it's a fact that for most use cases it will be significantly easier to manage than bare metal.

Because much of it is being managed for you e.g. object store, databases etc.

withinboredom · on Nov 16, 2023

Setting up k3s: 2 hours

Setting up Garage for obj store: 1 hour.

Setting up Longhorn for storage: .25 hour.

Setting up db: 30 minutes.

Setting up Cilium with a pool of ips to use as a lb: 45 mins.

All in: ~5 hours and I'm ready to deploy and spending 300 bucks a month, just renting bare metal servers.

AWS, for far less compute and same capabilities: approximately 800-1000 bucks a month, and takes about 3 hours -- we aren't even counting egress costs yet.

So, for two extra hours on your initial setup, you can save a ridiculous amount of money. Maintenance is actually less work than AWS too.

(source: I'm working on a youtube video)

threeseed · on Nov 16, 2023

You should stick to making Youtube videos then.

Because there is a world of difference between installing some software and making it robust enough to support a multi-million dollar business. I would be surprised if you can setup and test a proper highly-available database with automated backup in < 30 mins.

withinboredom · on Nov 17, 2023

When you are making multi-million dollars, that’s when you spend the money on the cloud. There is a spot somewhere between “survive on cloud credits” and “survive in the cloud” where “survive on bare metal” makes far more sense. The transitions are hard, but it is worth it.

flkenosad · on Nov 16, 2023

What do you mean? You install postgres on 2+ machines and configure them.

Installing software is exactly what multi-million dollar companies do.

Backups are not hard either. There's many open source setups out there and building your own is not that complex.

icedchai · on Nov 17, 2023

Doing this right is easier said than done. I worked at one company that ran their own Postgres instances on EC2 instead of using RDS. Big mistake. The configuration was so screwed up they couldn't even take a backup that wasn't corrupted. They had "experts" working on this for months.

kikimora · on Nov 17, 2023

>Backups are not hard either.

I guess you never setup one, because I’ve see numerous attempts at this and none took less than a month to do.

withinboredom · on Nov 18, 2023

Backups are not hard. I've set them up, and restored from them, numerous times in my career. Even PITR is not hard.

It takes about 10-15 mins to apply the configs and install the stuff, then maybe another 10-15 minutes to run a simple test and verify things.

sgarland · on Nov 17, 2023

> You install postgres on 2+ machines and configure them.

Oh sweet summer child... let me know how that goes for you.

nunez · on Nov 17, 2023

It's easy to stand this stuff up initially, but the real work is in scaling, automating, testing and documenting all of that, and many places don't have the people, skills or both to do all of that easily.

Also, with EKS, you get literally all of this (except Cilium and Longhorn, if you need that, which you don't if you use vpc-cni and eks-csi), in ~8 minutes, and it comes with node autoscaling, tie-ins into IAM and a bunch of other stuff for free. This is perfect for a typical lean engineering team that doesn't really do platform stuff but need to out of necessity and/or a platform team that's just getting ramped up on k8s.

You also don't need to test your automation for k8s upgrades or maintain etcd with EKS, which can be big time-savers.

(FWIW I love Kubernetes and have made courses/workshops of exactly this work)

tutfbhuf · on Nov 17, 2023

Except that Kubernetes, along with all the additional components that are needed, for example, you mentioned Longhorn and a database such as PostgreSQL, Cilium, and others, has a multitude of ways to fail in unpredictable manners during updates, and there can be many hidden, minor bugs. The list of issues on GitHub for these projects is very lengthy.

I'm not suggesting that this isn't a viable solution, but I would prefer to be well-prepared and have a team of experts in their respective fields who are willing to have an on-call duty. This team would include a specialist for Longhorn or Ceph, since storage is extremely important, one to set up and maintain a high-availability PostgreSQL database with an operator and automated, thoroughly tested backups, and another for eBPF/Cilium networking complexities, which is also crucial because if your cluster network fails, it results in an immediate major outage.

Certainly, you can claim that you have sufficient experience to manage all these systems independently, but when do you plan to sleep if you're on call 24/7/365? Therefore, you either need a highly competent team of domain experts, which also incurs a significant cost, or you opt for cloud services where all of this management is taken care of for you. Of course, this service is already included in the price, hence it's more expensive than bare-metal.

withinboredom · on Nov 18, 2023

I feel like this is a pretty short-sighted solution.

> you either need a highly competent team of domain experts, which also incurs a significant cost

You can build this team and domain knowledge from the ground up. Internal documentation goes a long, long way too. From working at bare-metal shops off-and-on over the last decade, the documentation capturing why and how are extremely important.

> you opt for cloud services where all of this management is taken care of for you

Is it though? One of the biggest differences between a bare-metal shop and a cloud shop is when shit hits the fan. When the cloud goes down, everyone is sitting around twiddling their thumbs while customer money flies away. There isn't anything anyone can do. Maybe there will be discussions about how we should make the staging/test envs in more regions ... but that's expensive. When the cloud does come back, you'll be spending more hours possibly rebooting things to get everything back into a good state, maybe even shipping code.

When bare-metal goes down, a team of highly competent people, who know it in-and-out are giving you minute-by-minute status updates, predictions on when it will be back, etc. They can even bring certain systems back online in whatever order they want instead of a random order. Thus you can get core services back online pretty quickly, while everything runs degraded.

Like all things in software, there are tradeoffs.

tutfbhuf · on Nov 18, 2023

Based on my experience, uptime for the Generally Available (GA) services in the cloud typically ranges between 99.9% and 99.95% (maximum 8 hours and 41 minutes of downtime per year in case of 99.9) for a single region. This aligns with my long-term experience as a Google Cloud Platform (GCP) user. If you use preview or beta versions of services, the reliability may be lower, but then you're taking on that risk yourself.

Should you require greater than 99.95% uptime for particularly critical operations, then opting for a multi-region approach, such as using multi-region storage buckets, is advisable. It's also worth mentioning that I have never experienced a full 8 hours of downtime at once in a given year. It has usually been a case of increased error rates or heightened latency due to the inherent redundancy provided by availability zones within each region. Just make sure your network calls have a retry mechanism and you should be fine in almost all cases.

withinboredom · on Nov 18, 2023

On my first day ever in production on AWS, we had an entire 9 hours of downtime[1]... so maybe I'm a bit biased, especially because we had another one not too long after that one[2] on Christmas freaking Eve. Prior to moving to AWS, resolution timelines were able to be passed to stakeholders within minutes of discovering the issue. After moving to AWS, we played darts and looked like fools because there was nothing we could do while the company hemorrhaged money.

The cloud is much more mature as is dev-ops in general, these days, but major outages still happen. If you run your own cloud, you'll still have major outages. You can't really escape them.

If you have the expertise or can get the expertise, to do it in-house, you should do it. Just look to the US and its inability to build an in-expensive rocket, or even just manufacture goods. They outsourced everything (basically) to the point where they are reliant on the rest of the world for basic necessities.

You gotta think long-term, not short-term.

[1]: https://aws.amazon.com/message/680342/

[2]: https://aws.amazon.com/message/680587/

kikimora · on Nov 17, 2023

Setting up a db with backups, PITR, periodic backup tests in 30 minutes? You made my day :) Also how about controlling who has access to servers, tamper protected activity logs? It is just scratching the surface.

withinboredom · on Nov 18, 2023

Yeah, it is pretty straightforward to set it up in about 30-45 minutes. I have a cookbook I've been building since 2004, and I keep it updated. Maybe one day I'll publish it, but mostly, it is for me.

> how about controlling who has access to servers

It depends on where you want to control access to and how. If you are talking about SSH, you can bind accounts to a GitHub team with about 20 lines of bash deployed to all the servers. Actually, I have a daemonset that keeps it updated.

Thus only people in my org, on a specific team, can access a server. If I kick them out of the team, they lose access to the servers.

I have something similar for k8s, but it is a bit more complicated and took quite a bit of time to get right.

> tamper protected activity logs?

sudo chattr +au /path/to/file

will force Linux to only allow a file to written to and read, but only in append mode and deletions are actually soft-deletes (assuming you are using a supported filesystem). Things like shell history files, logs, etc. make a ton of sense to be set that way. There are probably a couple of edge cases I'm forgetting, but that will get you at least 80% there.

jessekv · on Nov 17, 2023

Debugging longhorn outages on a monthly basis: priceless.

withinboredom · on Nov 17, 2023

I haven’t had any issues with longhorn in a long time. Most issues I run into are network related, in fact, I tend to find a lot of cilium bugs (still waiting on one to be fixed to upgrade, actually).

nunez · on Nov 17, 2023

I hope you like slugging through the depths of Kubernetes api-resources and trudging through a trillion lines of logs across a million different pods

RhodesianHunter · on Nov 16, 2023

Because the amount of time your engineers will spend maintaining your little herd of pet servers, and the opportunity cost of not being able to spin up manged service X to try an experiment, are not measurable.

yjftsjthsd-h · on Nov 16, 2023

> maintaining your little herd of pet servers

You know bare metal can be an automated fleet of cattle too, right?

withinboredom · on Nov 16, 2023

Have you ever heard of pxe boot? You should check out Harvester made by Rancher (IIRC). Basically, manage bare metal machines using standard k8s tooling.