So a goal of the service mesh is to keep it an api agnostic appliance. I'm not s...

tptacek · on Nov 24, 2019

Yeah, you haven't grokked the concept yet.

The idea behind a service mesh is that no matter how you design and implement a microservice architecture, you still have all these little services talking to each other somehow. By "default", they're communicating using native sockets and HTTP or gRPC.

There's lots that can go wrong any time you have a bunch of things talking to each other on a network (see Tanenbaum's "Critique of RPCs" for a classic explanation).

If everything was written in the same language, what you'd probably do is come up with a common networking/RPC library for all your services to use. That library would give you a common interface, so everything used the network the same way, and some measure of observability (maybe just by doing a standard log format), and maybe some security controls like a standard HTTPS connection and a standard token format.

But if you've got multiple languages, that option isn't very attractive anymore, because even if it's feasible to build that library once for each platform, now you're keeping an additional component (the library) in sync between the platforms, and that's a drag.

The insight behind a service mesh is that containers make it cheap and easy to just stack a tiny out-of-process proxy alongside your services. So you can just move all the logic you would have had in that common service library into the proxy. The proxy will give you the same features no matter whether your service is a Rust binary, Clojure running on the JVM, or a shell script.

And, because it's an out-of-process standalone component, it can be built and maintained by a third party, which means the features it provides can be a lot more ambitious than you'd build in your own library. So now instead of just hoping to get some TLS and maybe a standard token, everything can be mTLS with client certificates, a standard dashboard for managing certificates, rule systems for what client certs will allow you to talk to which systems, graphical maps of who's communicating with who and trace collection for specific pairs of services, etc.

This is all very un- like what an API gateway does, and the whole concept sort of revolves around having little reverse proxies running alongside the services.

jacques_chester · on Nov 25, 2019

> Tanenbaum's "Critique of RPCs"

I had trouble finding this under the given title. For those searching, I believe tptacek is referring to Tanenbaum & Renesse, A Critique of the Remote Procedure Call Paradigm [0], published in 1987 or 1988.

A fun read, especially since "it's just the same as local" is a myth each generation gets to revisit. Compare to Waldo, Wyant, Wollrath & Kendall's A Note on Distributed Computing[1], published in 1994.

[0] https://www.win.tue.nl/~johanl/educ/2II45/2010/Lit/Tanenbaum...

[1] https://www.cc.gatech.edu/classes/AY2010/cs4210_fall/papers/...

jayd16 · on Nov 25, 2019

I think I get it. Its underpinned by the assumption that the application doesn't need robust logic around connecting to a local proxy like it would connecting to a remote one. In this sense its going back to the days when instances had a local HAProxy running. That assumption didn't really pan out and we all decided service LBs were better but ok, sure. We can have both.

I still think describing the concept more plainly would help a lot of the confusion.

tptacek · on Nov 25, 2019

I think this is right, yes. We're not worried about the connectivity between the service and the local proxy because they're (always) cotenants of a container and using localhost to communicate.

A more general way to think about service meshes is that the network layer we code to right now is actually really primitive; its service model was fixated in the 1980s, and its programming interface hasn't much evolved from the early 1990s. We'd be happier if we could level up the whole network, so that it had QoS controls, a really expressive security model that didn't rely on magic-number ports and address ranges, and observation capabilities that communicated application-layer details and didn't just try to approximate them the way flow logs do. You can get all that stuff, internally at least, by putting all your services on the same service mesh.

Another thing to look at is Slack's Nebula, which was just released last week:

https://slack.engineering/introducing-nebula-the-open-source...

Nebula is a service mesh that runs at the IP layer (where Istio and Linkerd ride on top of HTTPS proxies, Nebula rides on top of a somewhat Wireguard-ish VPN). Slack has been using it internally for 2 years now. It's solving the same problems Linkerd is, but with a radically different implementation. You can get your laptop connected to a Nebula service mesh in ways that would be clunky to do with a Linkerd mesh.

phlakaton · on Nov 25, 2019

I would say it's even a bit more pessimistic than that: it's the assumption that the application _can't be relied on_ to provide robust logic around connecting to a remote service. You can address that problem in a library or in an intermediary service of some sort, but in an organization with a heterogeneous collection of languages, versions, and stacks, the library solution becomes expensive.

I'm not sure what you mean by "instances had a local HAProxy running" but if you're thinking about a bunch of reverse proxies handling incoming requests, be aware that service meshes are handling both inbound _and_ outbound traffic to/from your service. For example, you might use a reverse proxy in front of your instance to terminate TLS, but you cannot implement something like two-way TLS authentication between your services unless you're putting something in at the client side as well.

jayd16 · on Nov 25, 2019

I'm talking about using HAProxy as a local reverse proxy instead of a remote elastic load balancer. So I guess we're back to it.

erik_seaberg · on Nov 25, 2019

It's an RPC plugin but doing it out-of-process is weird (I need an RPC client to talk to my RPC client?). If you don't need a trust boundary or separate ulimits, loopback network traffic and context switches and reserialization are a really expensive way to mitigate a language having poor FFI plugin support.

If it were in-kernel and supported scatter-gather and zero-copy, that might be different (though some people even avoid going through the kernel).

tptacek · on Nov 25, 2019

The "RPC" that the proxy in Linkerd is doing is radically different and more expressive than the "RPC" that is running between the Linkerd proxy and your application, which is the whole point of the architecture.