Why your AI-based monitoring startup will fail

Please don’t call me if you are selling a monitoring project that uses AI to do amazing things.

Since 1991 I have been contacted by sales people from startups that use AI to make their monitoring system awesome.

The first couple times I took their call and set up a demo. I was always disappointed.

Now I just ignore these phone calls and emails.

None of those companies are in business any more. Why? Because they couldn’t make a product that is actually useful.

I’m going to let other fools waste their time with those demos. Me? I’ll believe it when I see one of these companies actually have enough successful customers to be profitable for 12 consecutive months. Then I’ll take a look.

I hate to be a hard-ass, but in those 27 years I haven’t seen an AI-based monitoring startup that succeeded. Yes, there’s been a lot of progress in AI in the last 10 years. Sadly most of that progress has been in things that look good to investors that don’t know any better.

AI-based monitoring is pre-destined to fail. Every 3-4 years I’ve been contacted by such a startup. They’ve all gone under.

Why do these companies fail?

None of them made products that actually helped sysadmins (or devops or SREs, etc). They made products that helped what an AI researcher thought would be useful.

By definition, an AI researcher does not have operational experience.

What they do invent looks good on paper, or looks good to a CEO, or an investor. None of those people understand what monitoring and alerting is really about.

Here are the claims I’ve heard:

“We can pin-point where the problem is” – Look, if I need software to do that, I’ve designed my system wrong.

“We can determine who should be paged” – Whoa whoa whoa. Are there companies that just page one person for every problem and that person has to figure out who should actually get paged? That’s terrible. Each team should set up their own monitoring and alerting rules. How the hell did you get into that situation in the first place?

“We can predict failures. You don’t have to write alert rules any more.” – I’ll believe it when I see it. I can do that with linear algebra and a lot of compute power. Or, I can save money and just do good capacity planning.

Look, to get real results the AI system would have to be bespoke for my system. You think you can really create a general AI solution? I doubt it.

Here’s why your AI-based monitoring startup is going to fail:

If I have a small system, I don’t have problems big enough for AI to be useful (or at least cost effective).
If I have a big system, components should be clearly defined so that I can route alerts properly without any fancy AI. Heck, why not just add a tag to each alert rule?
If I have one NOC for all of my services and they’re chasing their tail trying to figure out root causes and who is responsible for each service… you have a management problem. First of all, NOCs are bad and you are bad for having one (see my SRE book). Second, there’s no single root cause (cite: every speaker at every DevOps conference in the last 10 years). Third, if you don’t know who is responsible for each service, how the hell would an AI know better than you? Get a damn technical project manager to talk with each team and help them get their monitoring and alerting in order.

Look, the bottom line is that no matter how good you are at AI, Google is better at it, and Google doesn’t even use AI in their monitoring system. They use a little linear algebra in their capacity prediction system and that’s about it. They use some machine learning for problems that are so big that only Google would have those problems. You are building a startup for a customer base that doesn’t exist: people with Google-sized problems that don’t have their own AI researchers.

If you work for an AI-based monitoring startup here’s my honest advice: Update your resume and start sending it out. You’ve joined the wrong startup.