Delivering Project & Product Management as a Service

Illustration of a royal shield

AI Reducing False Positive in InfoSec SOC

Some algebraic background

False positive is when you system is alerting the users for some event or anomaly, while there is none. In other words it’s a classification error that causes extra effort that is not needed, and thus needs to be reduced.

In Security Operation Center – The analysts are working on security events created by various IS controls, and as the number of false positive events grow, the SOC efficiency drops because the analyst is supposed to resolve a false events instead of working on the true positives.

False positives are inherently problematic in environments were the population of events have low incidence rate that is lower than the false positive rate. In this case the false positive rate will always be greater from the incidence rate!

This is not intuitive at all, as we can see in this example: Let’s assume we have an organization with 10K computers and virus infection of 1% of the population with Antivirus reporting on it with false positive rate of 5%

  • In this case the number of infected computers is: 0.01*10,000=100 computers infected (True positive)
  • And the false positive number of computers (unaffected but indicated with a virus) will be: 0.99*10,000*0.05=495 (the number of non-infected doubled by the false positive rate.

In this case we can see that we got more false positives than true positives as well as if we look at the probability of the case indicate by IMS to the SOC as positive, we can see that the probability of it’s being truly infected is only: 100/(100+495) = 17% and this is where the antivirus is expected to give 95% accurate results!

Using supervised learning to reduce workload

SOC analysts are recording cases decisions after the analysis – This data is mandatory and should be chosen from a closed list, so that one can easily use it as a training data – We had to do with unstructured text field, and using regexs and other methods to extract if the case was indeed a true or false positive.

The next step was to chose an analytical environment – If the organization have data analytics platform you can live within it – Most have by now AI modules that are accessed from the UI and wrap tried and tested libraries like SCI-Kit or Torch embedded in products like Splunk or Tableau. In our case the available environment was Splunk, so we worked with it and avoided the tool selection process.

Choosing the model was rather trivial – You basically select from various existing classification algorithms in Splunk, run them on the test data, and chose the best one according the the confusion matrix – Actually the most accurate one was random forest in our case.

Once we had the model, we used it to provide SIEM (Security Information and Event Management) with prediction of case truthfulness score, which is basically the expected probability of it being true accurate. This information was presented to the Analysts so that they could chose and prioritize the events accordingly.

Getting reinforcement learning for free

Reinforcement learning is hype in machine learning because it’s a general term for self improvement process – Your algorithm is getting rewards per action, if the action was right you feed back the reward and the state to the algorithm and it does better next time.

We insert the system recommendation at time t-1 for the prediction t, so in our case, the fact that the SOC analysts got the AI prediction, and checked the event and post ante classified the event as not relevant (i.e false positive) then this will change the evaluation of the scoring next time, as it plays the cost function role.

Epilog

AI building blocks are getting bigger and better all the time, you can actually do quite well with not much coding and only basic understanding, those who implemented the procedures had no training in AI at all.

Keep it simple – Most gain is done at doing coarse work using pareto principle and simple methods.

Utilise what you have to add value to the organization – Start small and pick the low hanging fruits.