The authors say that the colors correlated to similar implementations that result in similar behavior, with red specifically indicative of passing all tests. (I totally agree with you that green would have been a much more logical choice). Which only leaves green and blue. I'm curious what the distinction is between those implementations (e.g, blue passed some of the tests, green didn't pass any tests).