python - Machine learning for monitoring servers -

i'm looking @ pybrain taking server monitor alarms , determining root cause of problem. i'm happy training using supervised learning , curating training data sets. data structured this:

 * server type **a** #1   * alarm type 1   * alarm type 2  * server type **a** #2   * alarm type 1   * alarm type 2  * server type **b** #1   * alarm type **99**   * alarm type 2

so there n servers, x alarms can up or down. both n , x variable.

if server a1 has alarm 1 & 2 down, can service a down on server , cause of problem.

if alarm 1 down on servers, can service a cause.

there can potentially multiple options cause, straight classification doesn't seem appropriate.

i tie later sources of data net. such scripts ping external service.

all appropriate alarms may not triggered @ once, due serial service checks, can start 1 server down , server down 5 minutes later.

i'm trying basic stuff @ first:

from pybrain.tools.shortcuts import buildnetwork pybrain.datasets import superviseddataset pybrain.supervised.trainers import backproptrainer   inputs = 2 outputs = 1  # build network  # 2 inputs, 3 hidden, 1 output neurons net = buildnetwork(inputs, 3, outputs)   # build dataset  # dataset 2 inputs , 1 output ds = superviseddataset(inputs, outputs)   # add 1 sample, iterable of inputs , iterable of outputs ds.addsample((0, 0), (0,))    # train network dataset trainer = backproptrainer(net, ds)  # train 1000 epochs x in xrange(10):     trainer.train()  # train infinite epochs until error rate low trainer.trainuntilconvergence()   # run input on network result = net.activate([2, 1])

but i[m having hard time mapping variable numbers of alarms static numbers of inputs. example, if add alarm server, or add server, whole net needs rebuilt. if needs done, can it, want know if there's better way.

another option i'm trying think of, have different net each type of server, don't see how can draw environment-wide conclusion, since make evaluations on single host, instead of hosts @ once.

which type of algorithm should use , how map dataset draw environment-wide conclusions whole variable inputs?

i'm open algorithm work. go better python.

this challenging problem actually.

representation of labels

it's difficult represent target labels learning. pointed out,

if server a1 has alarm 1 & 2 down, can service down on server , cause of problem. if alarm 1 down on servers, can service cause. there can potentially multiple options cause ...

i guess need list possible options otherwise cannot expect ml algorithm generalize. make simple, let's have 2 possible causes of problem:

1. service problem  2. server problem

site-wise binary classifier

suppose in first ml model, above 2 causes. working on site-wise binary classifier now. logistic regression better started since interpretable.

to find out server problem or service problem, can second step. solve second step, based on example,

if service problem, think decision rules can manually derived service name can pinpointed. idea should see significant amount of servers triggering same alarm, right? see advanced readings @ end check more options.
if server problem, can construct second binary classifier (an individual server side classifier), runs on each server using features coming server , answers question: "if have problem".

features site-wise binary classifier

i assume alarms best source of features. guess using summary statistics data features more site-wise classifier here. example,

the percentage of servers receiving alarm down
the average length of time across servers alarm b down
across servers alarm b down, percentage of them have alarm down. ...

features server-side binary classifier

you should explicitly use alarm signals features server-side classifier. however, @ training time, should take data of servers. labels "has-problem" or "has-no-problem". training data like:

  alarm on, alarm b on, alarm c on, ..., alarm z on, has-problem     yes,        yes,       no,               yes,      yes     no,         yes,       no,               no,       no     ?,          no,        yes,              no,       no

note used "?" indicate possible alarms might have missing data (unknown state), can used describe situation below:

all appropriate alarms may not triggered @ once,  due serial service checks,  can start 1 server down ,  server down 5 minutes later.

some advanced readings

this problem related few topics, e.g., alarm correlation, event correlation, fault diagnosis.

Search This Blog

Back