Redis cluster, no longer vaporware

ryanpetrich · on Oct 9, 2014

Can't wait to see it get the Jepsen treatment.

jrallison · on Oct 9, 2014

It won't do well, but at least Antirez is up front about the shortcomings.

The article itself states that Redis Cluster will lose acknowledged writes during partitions (no P), will be eventually consistent (no C), and won't be available during partitions (no A).

antirez · on Oct 9, 2014

Yep, Jepsen is more suitable to check systems that claim either linerizability, or at least write safety, during partitions. I guess that a modified version of Jepsen could be used in order to validate the failure modes or to discover other unexpected ones that at human inspection look easy to reproduce in actual production environments. Also I don't know if Jepsen is good at this, but in theory it could be instrumented in order to check how good the implementation is, which is, even if it is not designed for write safety during partitions, how better the countermeasures are working?

bkirwi · on Oct 9, 2014

In theory, Jepsen or a Jepsen-like system should be able to check any of these failure modes.

On the other hand, it sounds like Redis Cluster offers few hard guarantees; instead, it promises that failures should be rare 'in practice'. Which is a fine thing for a tool to do, of course, but it makes things less amenable to the kind of stress-testing Jepsen does -- since running inside Jepsen's little universe is about as far from normal operation as you can get. If you already know that a system can fail in a certain way, getting Jepsen to reproduce that failure tells you very little.

If you'd like to make this kind of testing possible, it would be useful to state as many 'positive' rules as possible, which Redis Cluster should always respect -- things like "if a majority of nodes are fully connected, they should always accept writes" and "an unpartitioned cluster should always agree on the same value" -- alongside the documentation on ways it might fail. This way, clients can be assured of the 'bare minimum' that the system supports, and tools like Jepsen can give you more useful information.

antirez · on Oct 9, 2014

Oh, there are definitely hard rules like that. For example a majority partition never accepts queries, and when there are no partitions at all Redis Cluster guarantees to converge on a single value for each key, and to a single view of the cluster configuration. I'll try to document better this things, but basically they arise from the simple algorithm that makes the configuration eventually consistent.

bkirwi · on Oct 9, 2014

Neat; that should be very useful.

It's great to see that Redis has a official story for clustering / failover out; like you said in the post, the worst distributed systems are the ones you have to rewrite every single time. It's going to be interesting watching this evolve.

swartkrans · on Oct 9, 2014

Seems like given the above problems you would only want to use this with non-critical data of which you didn't care a great deal about its integrity. This doesn't describe a lot of use cases.

It's not too hard to get memcached running for cluster of machines, nor is it really all that difficult to get started with cassandra for basic applications, although I haven't done it. I've heard ReThinkDB wants to be a serious contender and that they've tried to solve these problems, but I haven't read about how great it is nor have I heard anyone use it in production so I don't know where it stands.

seiji · on Oct 9, 2014

This doesn't describe a lot of use cases.

Do you have any idea how many people run master/slave mysql? That's async and will definitely give you data drift over time. There's an entire suite of Percona tools to detect and repair broken mysql datasets because mysql breaks data so often due to replication problems.

The world is much more async and much less consistent than anybody realizes (cash machines are even async), but everything still works pretty much okay. The last line of defense for solving irreconcilable consistency problems is customer service.

jeremyjh · on Oct 9, 2014

Cash machines are definitely not async in normal operations, though the authorization network is configured with rules for stand-in mode when third-party systems are unavailable. Without the primary authorization system immediately available the ATM is hard down.

jrallison · on Oct 9, 2014

async != an incorrect system. Cash machines are a good example of an asynchronous, eventually consistent system that works. As is any well built eventually consistent data store.

Redis/Mysql has it's problems, as does all databases. However, as a practitioner it's important I know the shortcomings of each, so that I can know exactly what a certain database is offering in terms of safety and whether it's a good fit for my use-case. Antirez being open about them is a huge +1 in my book.

Data drift or loss of acknowledged writes isn't correct for a wide range of use-cases. If you know the limits, you can choose the incorrect system, and deal with the fallout when that "rare" event does happen.

seiji · on Oct 9, 2014

async != an incorrect system.

Very correct! Most of the mysql problems derive from it being statement replication and not log shipping. A proper postgres replication setup will be less prone to having data inconstancies.

antirez · on Oct 9, 2014

Note that systems with asynchronous replication and normal failover procedures are all subject to the same rules as Redis Cluster, and I believe there is a great deal of people running RDBMs in this setup. So IMHO actually there are a lot of use cases.

IgorPartola · on Oct 9, 2014

From what I understand, there are two types of systems really (in terms of CAP): those that do async replication and those that do sync replication.

Async replicated systems can be (but are not necessarily) AP: each system can process requests independently of the other, and they'll get consistent eventually (how they arrive at this consistency is different between systems, and is not always "correct" for the general use case).

Sync replicated systems can be (but are not necessarily) CA: as long as replication is not broken they are available and consistent. These systems have the upside in that they allow you to serve read-only data during partitions, but have slow writes because of the sync replication.

Both types can have features removed, for whatever reason. For example, you could have a sync replicated system that doesn't serve any requests when the replication is broken. Or you could have an async system that doesn't provide a mechanism to correct inconsistencies (a la MySQL async replication). What you describe sounds like an imperfect version of the async system: not highly available, not consistent, and not partition tolerant. Given that better AP systems exist, ones that actually provide things like availability during partitions and eventual consistency mechanisms/guarantees, will Redis Cluster eventually support these features?

Edit: please don't take any of this as criticism. Redis is an awesome product that I use and rely on often. Thanks for all the great work.

antirez · on Oct 9, 2014

Hello Igor,

basically we can rule out synchronous systems: product mismatch from the POV of Redis. We are left with AP systems. AP systems trying to get as good as they can from the POV of Availability, and the consistency model provided (even if not "C"), do these two additional things that Redis Cluster is not able to do:

1) They are available in the minority partition. Redis Cluster has some degree of availability, but not to that extend. True "A" means, serve queries even if you are the only node left.

2) They provide more write safety, unless you use it with last-write-win strategies or alike.

"1" is actually related to "2". If you merge values later, you can accept writes even if the node has no clue about what is the state of the rest of the system.

About "2", if you check around, you'll immediately figure out that there are no systems available that are AP and can easily model Redis data structures with the same performances, and having the same number of elements per value. This is because to merge huge stuff is time consuming, requires a lot of meta data, and so forth.

Now there is another constraint, that I don't want a system with application-assisted merging. For example recent Riak versions provide data structures with a default merge strategy, so we have examples of similar stuff. I can assure you that there is either meta-data needed to merge, that would make Redis a lot less memory efficient, and sometimes, there are simply no good strategies. For example take the "Hash" type. There is a partition and in one node I set as "name" field of user 1234 "George" and in another node "Melissa". There is no better merge strategy than last-write-wins, basically.

Now, that said, there are certain things that don't add major overhead, AND, improve write safety. For example if the connection is declared as "SAFE" we could cache all the SADD commands (and others having similar properties) until they are acknowledged by all the nodes serving this key. Unacknowledged writes are retained as a "log" of commands, and are re-played when the node is turned into a slave of a new master.

So there are plans to improve the consistency model, but it is unlikely that we go towards the Riak model, but the good thing is, there are stores like Riak that are exploring this other way of doing data structures without application assisted merge functions, with per element meta data and so forth.

IMHO what there is to ponder is that at the end of the day, is that a distributed database is the sum of what it offers when there are not partitions, and what it offers when there are partitions. Often to improve the latter requires to give up something in the former: simplicity, space efficiency, data model freedom, ...

IgorPartola · on Oct 9, 2014

Thanks for the explanation. Is it possible to add something like "this record was not merged cleanly" flag? Basically, from what I understand without sync replication you cannot theoretically have a system that does the correct merge in all use cases. MySQL would deal with this by letting two replicas contain different data until you scan for this in your application code and detect it. Other systems (I think Cassandra) will attempt to merge things in the background after the partition is gone, but with strategies like "last write wins" or even versioning you can still get bad records. I am thinking of a system that can do whatever strategy you choose, or even just "last write wins" but warns you that a write on this key was lost. That way you can either ignore it if that's appropriate (some types of sensors for example, that frequently update values) or you can bubble this error up your application code until a user can fix it. On the other hand I definitely can see this as a performance issue so it might be more appropriate for a datastore that is aiming for high consistency while still allowing AP mode.

seiji · on Oct 9, 2014

Picking out the three points: not highly available, not consistent, and not partition tolerant.

[reverting to non-standard, dummy definitions below just to explicitly define some things]

Redis Cluster is available as long as (50% + 1) of the master nodes are reachable by each other. So, 15 masters will be more "available" than 3 masters if they are deployed across a sane network topology. Your masters will have replicas as well, so if a single master instance fails, a replica for that master will be elected to be the new master.

Redis Cluster is consistent in that each replica is an exact copy of its attached master (by default, each master owns N/16384 of the keyspace, where N is the total number of masters). You can lose writes if the master doesn't have time to replicate before the master fails, but that's the nature of async replication. (You can request your Redis client "wait" for at least N replicas to receive your commands before the command completes; this gives you some assurances.) It's possible for you to get an "inconsistent read" if you read from an async replica after writing to the master, but you have perfect read-after-write consistency if only talk to master instances.

Redis Cluster nodes on the minority side of a partition will, by default, deny reads and writes until they re-join the cluster. Redis Cluster prefers data integrity over availability by default since Redis Cluster has no merge operations. You can optionally allow the minority side to keep accepting commands, but that definitely will not guarantee any consistency of your data when the whole cluster reappears.

Given that better AP systems exist, ones that actually provide things like availability during partitions and eventual consistency mechanisms/guarantees, will Redis Cluster eventually support these features?

The one limitation for each of those features is: Redis is an in-memory database. Riak uses multiple KB of metadata for each object in the DB. With Redis, it's common to have tens of millions (or hundreds of millions) of keys on one server.

Availability during partitions with eventual consistency (if you want more than last-write-wins) requires CRDT-like things, which requires metadata, which requires more memory usage per-key.

Full and usable availability during partitions is nice to think about, and it would be great to one day have the option to maybe have in-memory Redis "better data consistency for smaller, manageable datasets" with CRDT merge operations so you get a Reis-dynamo type thing.

But, since Redis Cluster is already 4 years under development, it's not worth burning brain cycles on until real users start using Redis Cluster and we can see where the needs-vs-features plot falls. (plus, the issue backlog for other Redis development improvements is about 100 tasks long, so speculative improvements have to fall in line with the balance of Cluster vs. Standalone vs. Master-Replica vs. New Commands vs. Improving Existing Commands vs. Bug Fixes vs. User Contributions vs. Doc Updates vs. Evangelism vs. ...).

giaour · on Oct 9, 2014

That's the use case of everyone running MongoDB, so Redis Cluster should be able to find a significant user base.

vidarh · on Oct 10, 2014

> . This doesn't describe a lot of use cases.

It describes a huge proportion of the data I've ever worked with, because there are massive amount of use cases where what you are working with does not have to be (and as the systems scale: rarely is) the single source of truth for that data.

Redis beats memcached the moment you need more than a bunch of blobs.

E.g. our in-house capacity monitoring uses Redis for a ~1 hour time series view of our systems. We don't care about data loss - the odds of losing the data within an hour of an outage we need to care about is small enough to be justified and if that ever becomes a concern we can run two and split our updates between them. For long term storage, we migrate rolled up data to couchdb at the moment (doesn't matter what - we could use anything really; it's rarely queried and basically to let us get an occasional longer historical view to budget for resources etc).

We don't particularly care about integrity as long as it's "right most of the time" because the data gets constantly corrected, and precision in the averaged longer term numbers is not particularly important either (I want to be able to project when we run out of disk, for example - I don't care if disk usage is at 96% or 98%, as both are way too close for comfort).

Yet memcached is less attractive because it means far more book-keeping in the app. With Redis we encode part of the information in the keys (keys indicate which level of roll-up the data is at, the name of the data item, and the starting timestamp of the period the data is for) in a way that lets us easily retrieve a list of keys to do the roll-up. With Redis' Lua support we could probably do even more of the roll-up process in Redis itself, but I haven't looked at that yet.

There are a lot of apps like this, where your data volume makes hitting disk annoying (this replaced a Graphite based system - Graphite was thoroughly trashing an expensive disk array and still regularly getting too slow to us; the Redis based system uses ~10% of a much slower machine) and where the cost of losing some time window worth of data is low enough not to matter because it just means a blip in a data stream that is rapidly obsoleting itself.

jbinto · on Oct 9, 2014

I didn't get the context, so I looked it up:

http://aphyr.com/tags/Jepsen

> This article is part of Jepsen, a series on network partitions.

Covers MongoDB, redis, RabbitMQ, among others. All articles start with "call me maybe", thus the Jepsen reference.

rurounijones · on Oct 10, 2014

Would love to see Jepsen thrown at Apache Storm.

0xbadcafebee · on Oct 9, 2014

@antirez, what would you say is the most useful lesson you've learned throughout this process?

antirez · on Oct 9, 2014

Argument with people just with technical arguments, don't get into Twitter fights, and at the end of the day, use the feedbacks as a learning process, but do what you believe makes sense. So a social lesson, more than technological.

legedemon · on Oct 10, 2014

Thanks for this lesson! As much as I love using Redis, this learning will help me way more than Redis has done.

chrisan · on Oct 9, 2014

Anyone know how AWS clusters Redis?

I had alwas just assumed Redis released this a while ago and that is why AWS started to support Redis along with Memcache in their clustered ElastiCache product

http://aws.amazon.com/about-aws/whats-new/2013/09/04/amazon-...

ihsw · on Oct 9, 2014

That's failover -- essentially Redis Sentinel.

Redis Cluster on the other hand orchestrates multiple redis instances to function as one, thereby preventing the need for redis servers that are very large and preventing the need for presharding[1].

[1] http://blog.zawodny.com/2011/02/26/redis-sharding-at-craigsl...

toomuchtodo · on Oct 10, 2014

Redis Cluster appears to be similar to using memcached with Facebook's mcrouter [1].

[1] https://github.com/facebook/mcrouter

itamarhaber · on Oct 9, 2014

AFAIK (but I may be wrong here so if anyone knows better please feel free to correct me) the AWS cluster implementation is a single master and multiple read slaves. Failover is done manually with an API call and in an event of the master's failure, EC spins up a new (empty) master automatically before you can promote a slave in its place.

mfenniak · on Oct 9, 2014

I have very little evidence to support this, but they probably use twemproxy. twemproxy is a proxy for redis that automatically manages routing commands to multiple redis servers efficiently.

They definitely don't use redis cluster, because redis cluster requires explicit client-side support. That support is only available in a limited number of clients today.

seiji · on Oct 9, 2014

Nope, they don't (and neither should anybody else... twemproxy is weird).

mfenniak · on Oct 9, 2014

Oh, weird in what way? I was considering using it for horizontal redis scaling recently, under the impression that redis cluster was further away from release than it seems to be.

seiji · on Oct 9, 2014

The code is just strange. To support Redis, it has a 2,500 line Redis adapter. It has special cases for each Redis command because it doesn't extract key position information from Redis directly.

Plus, despite the name and where it's hosted, Twitter never used it in production for Redis. Redis support was just added "because we can."

It probably works for very direct use cases, but it's one of those things that could probably bite you very severely under edge cases, and you'd be left on your own to figure out what happened.

litewulf · on Oct 9, 2014

We run twemproxy over about a terabyte of data in Redis, and haven't had any problems...

zamio · on Oct 9, 2014

pinterest, snapshot, vine, wanelo use the twemproxy in prod for redis

seiji · on Oct 9, 2014

Well, companies also use ejabberd, but that doesn't make ejabberd a valid piece of software either.

Then there's the massive problem of "everyone" using MongoDB and we know how much pain _that_ sucker causes. Playing popularity contests with software isn't useful. People install software either because of social reasons ("I don't know how anything works, but I hear everybody is using X, so I'm going to use X too!") or because they need the functionality ("redis is AMAZING at 30 different things!").

If install base at well known companies were a valid measure, we'd all be using Windows laptops while talking into our Kin and squirting our Zunes all over each other.

dberg · on Oct 9, 2014

why is it weird ?

alphadevx · on Oct 9, 2014

Never used it, but as far as I know twemproxy does not support the select command to select another Redis database on the current connection, which would be a good indicator of it being in use at Amazon or not.

rem7 · on Oct 9, 2014

If I remember correctly they only have read replicas. I don't think they even have proper Fail over. You manually have to promote one of the read replicas to master... Feel free to correct me :)

jabo · on Oct 9, 2014

No, they do have automatic failover to your hot standby via Replication Groups.

brianpgordon · on Oct 10, 2014

> From the point of view of distributed databases, Redis Cluster provides a limited amount of availability during partitions, and a weak form of consistency. Basically it is neither a CP nor an AP system. In other words, Redis Cluster does not achieve the theoretical limits of what is possible with distributed systems, in order to gain certain real world properties.

@antirez Could you elaborate on what these properties are and your thinking behind why they're more important than AP?

itamarhaber · on Oct 9, 2014

Congratulations @antirez and the Redis community - this is a key step forward for this amazingly useful database!

thorway-redis · on Oct 10, 2014

Your communication with your users is wonderful. I'm grateful for it, and appreciate how much effort goes into it.

Thank you.

alphadevx · on Oct 9, 2014

Single-master Redis and replication overhead of current Redis architecture not ideal for getting Redis beyond the "it's a great cache" use case, so I'm been looking forward to this.

merlin83 · on Oct 9, 2014

I created a Fig environment in case anyone is interested in giving it a spin.

https://github.com/merlin83/rediscluster-docker

joshbaptiste · on Oct 10, 2014

Wonder is Antirez is considering starting a commercial Redis alternative providing commercial support a la Nginx.

antirez · on Oct 10, 2014

The world if full of possibilities... for now to be honest I focused a lot in the technical / community side, and the sponsorship model helped a lot with that. However for 10 years before starting this project I was self-employed and started two companies. So yes, doing companies is generally tempting for me, just in the case of Redis so far it was more an evolution of an hacking session, lasting 6 years.

jpgvm · on Oct 9, 2014

I am going to be honest, I am somewhat disappointed.

I was hoping for a CP system with sharding + replication.

Without which we can't have things like multi-key atomic operations etc...

Will wait and see though, maybe things are better than they seem.

antirez · on Oct 10, 2014

I'm pretty sure that if you are a Redis user, whatever you hate or like the current Redis Cluster design, what you never wanted, is a CP system. CP systems need agreement at every query so the latency and OPS/sec figures are not suitable for Redis-alike use cases. Actually with an existing CP system you can easily build the same Redis data structures since CP systems are linearizable so you can have a CP shell that internally runs a Redis kernel. Imagine the CP system is Raft, you just consider Redis your internal state machine, and every operation that is committed is applied to Redis just sending it the command.

jpgvm · on Oct 11, 2014

That is actually exactly what I wanted. :)

I have done something similar with Zookeeper recently (through I wrote it in Ruby so it's terribly slow).

You would be surprised how little CP stores are actually available. There is an abundancy of AP stores but very very few useful CP stores, especially ones with shared-nothing architecture.

You don't need for every query to traverse the consensus algorithm. It's sufficient to use the consensus algorithm to agree on the master shards for each chunk of data and have sync replication.

Considering Redis is designed for 100% in-memory workloads if you are operating on a low latency network this is more than fine. Especially because this behavior would only be required for writes, reads could be served from either the master or a sync replica (and in theory if you had configurable read consistency a possibly out of date async replica).

I guess what I am sad about is just the lack of CP stores that are useful right now, I probably just need to suck it up and start writing my own.

abluecloud · on Oct 10, 2014

>stopped and stopped