Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I think the machine learning aspect is the wrong choice here. I'm seeing another glaring answer, and one that can be done fairly quickly: Integral analysis.

The servers that have error accumulation aren't always the ones highest in spikes, as shown in the link. However, they do consistently stay above others in amounts of errors. So the answer is thus:

Since your time series is quantized, the integral is simply a sum over the timeframe. I would recommend multiple time windows, like "10 seconds, 3 minutes, 1 hour, 12 hours". Do this for all your servers.

Now, you have a normalized 2d graph, with respect for time. We can now calculate the mean of all servers' error areas as well as standard deviation. Now, for reducing CPU load, one can run a modified thompson tau test on +/-2σ for outlier detection. 4σ is around 97.7% of all your data, so you would only be checking ~2.3% of your data.

When a server fails an outlier test (in other words, is detected as an outlier), the historical data can also be investigated. Is the machine doing gradually worse? If so, the machine could be removed from service until a memory and CPU test can be run. You can also keep anomaly detection on how many outliers per hour. Passing a threshold could indicate potentially erroneous machines.

My answer also assumes that the machines are equal in the amount of load. I'm sure this is not the case, in which you will have to normalize errors/# of requests. This shouldn't be a problem.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: