Sentinel Health Monitoring

Sentinel Health Monitoring

Jake Quirke, Senior Security Consultant.

Introduction

Logs are the bread and butter of a Security Operations Centre (SOC). An absence of logs means no visibility, and no visibility means analysts are unable to analyse. No analysis means what was once the final bastion of defence against the Dark Arts, can now be a very expensive hole into which you pour money.      

We recently had an issue in one of our clients’ Azure subscriptions where, while reviewing the beautifully designed Overview blade in Azure Sentinel, an analyst noticed a drastic reduction in the number of logs being received. With the aforementioned ‘Logs are bread and butter’ premise in mind, this caused some consternation amongst the SOC team. After a bit of digging, we discovered this had been caused by a rolling restart of our Domain controllers! A quick reboot of some MMA agents and we are away, receiving logs again.

However, given it had taken most of the day for this to become noticed, the scenario highlighted an interesting and urgent problem.

How do you detect missing logs in Sentinel?

Azure sentinel is a great tool right out of the box, but currently lacks some key features. One of these is the ability to extract all the metadata related to security incidents in a simple and effective way.

This is useful if you want to monitor KPIs, the effectiveness of sentinel detection or even just providing a simple data dump. The data could be imported into a variety of tools such as PowerBI or a custom workbook.

Searching through the Microsoft GitHub page and looking at some community blogs we came across the Azure Sentinel API (currently still in preview). This API has methods for retrieving incident metadata and importantly also has methods to pull corresponding comment data for each incident.

Now we could have created some local scripts to pull the required information, format the data and it would have been job done; but that’s not very collaborative or cloud centric, so we decided a Playbook would be a perfect fit.

The solution…

Sentinel has lots of fancy data analysis features, but nothing to tell you when you are no longer receiving data to analyse. In some ways, this makes sense. Depending on their architecture, organisations won’t all be receiving logs every minute of every day, and additionally organisations won’t all have the same usage patterns. For example, one company I worked for finished at 12pm on a Friday. This would inherently create a significant drop in the volumes of data received. So, there’s not really a ‘one-size fits all’ approach to monitoring log volumes. Nevertheless, I set out to address (or at least, mitigate) the issue.

There are two scenarios where we want to be alerted:

  1. Complete loss of logs from a given log source
  2. Severe reduction in logs from a given log source

Logs in Sentinel are stored in Log Analytics tables, and there are (broadly) two kinds of tables:

  1. Tables populated from a single log source/connector
  2. Tables populated from multiple log sources/connectors

Examples of the former include SigninLogs, which is populated from the Azure AD connector, or the OfficeActivity table which is populated by the Office 365 connector. The latter statement applies to tables like SecurityEvent, which is populated from the MMA agent on separate machines, or the CommonSecurityLog, which can be populated by various firewalls or other logs received as common event format (CEF). This is important to be aware of, as just looking for a loss of logs into a table doesn’t address the possibility of a single event source going down.

I figured I’d be able to address the requirements for complete and partial loss of logs in a single query but detecting loss of events from tables populated by multiple sources would require logic bespoke to each table. With that in mind, I decided to tackle the two kinds of tables separately.

When writing KQL queries, it is often useful to break the query down into a series of problems (or, if you prefer, challenges) to solve.

The first problem was one of convenience – How to easily search and count logs across all tables?

The obvious answer is to use the aptly named search function, which does exactly that, and exposes a column named “$table” to show which table the results were found in.

search *

| summarize count() by $table

This will count all logs per table in your analytics workspace, for the time period defined in the datetime picker above the query editor. This is good for ad-hoc checks, but analytic rules need to run dynamically, so I added a timespan variable and included it in a where clause.

let lookback = 1h;

search *

| where TimeGenerated >= ago(lookback)

| summarize count() by $table

This query will show us how many logs have been received per table within the last hour, however if a table has not received any logs in the last hour, it will not be in the list, as naturally there was nothing to count and therefore no table name to summarise against. So, the next question to answer is: How to show a zero count?

The solution is to get a list of the tables you want to check, and then join our counts onto that result set. In some flavours of Kusto you can run a statement like “.show tables”, but simply listing all the tables doesn’t work for our purpose – what if a table is no longer being populated due to a change in technology, but is being kept for compliance purposes?

We wouldn’t want to generate an alert for this table each time the detection runs.  To address this, I decided to look at the same hour, a week ago. This would also feed into the second requirement – tracking severe reduction in logs.

Incorporating this into our current query, we get:

let lookback = 1h;

let Tables = search *

| where TimeGenerated between (ago(7d + lookback) .. ago(7d))

| summarize count() by Table=$table;

let lastHour = search *

| where TimeGenerated >= ago(lookback)

| summarize count() by Table=$table;

Tables

| join kind=leftouter lastHour on Table

| project Table, LogsLastWeek=count_, LogsLastHour=coalesce(count_1, 0)

| where LogsLastHour == 0

Now we have a query that will inform us of any tables didn’t receive logs in the previous lookback period that received logs for the same time period in the previous week, which is fit for requirement 1.

To account for severe reduction, we will need a baseline to compare against. We already have the same lookback period from the previous week, but it would be better to compare against an average value for that period. To do so, we can adjust our first query and our set up variables to give us a better sample of data and make our analysis more accurate. We’ll also need a standard deviation to tell us, and a modifier to adjust how far from the average we can deviate before triggering an alert. Typically, a value that does not fall within three standard deviations of the average is considered statistically anomalous but depending on your environment you may wish to be alerted before that point.

Additionally, we found that due to the sheer volume of logs in some tables, using the standard deviation to detect anomalous values generated many false positives. We eventually settled on adding a criterion that log volumes should also fall below a given percentage of the average log volume.

Here is the final, fully commented query:

// — VARIABLES —

// What size period (in hours) to analyze

let binTime = 3h;

// Multiplier for the standard deviation threshold. Set to 0 to ignore std dev.

// As a guide, values that don’t fall within 3 standard deviations of the average are considered statistically anomalous

let stdMod = 1.5;

// Get the end of the previous hour. E.g. if it is 7:03pm, the value will be 6:59:59.999 PM

let end = now() – (totimespan(strcat(“00:”,format_datetime(now(),“mm:ss.ffff”))) + 1ms);

// Offset from the end of the previous hour by our lookback period. 

let start = end – (time(“00:59:59.999”) + (binTime – 1h));

// tables that are omitted from the health check for various reasons (not expective to receive a consistent volume of logs)

let excludedTables = dynamic([‘DnsInventory’, ‘SecurityBaseline’, ‘SecurityBaselineSummary’, ‘SecurityAlert’, ‘SecurityDetection’, ‘WindowsFirewall’, ‘ThreatIntelligenceIndicator’]);

// The threshold for triggering the loss of logs as percentage / 100

let threshold = 0.02;

// Get tables and log volumes for last week during time period, work out average and standard deviation

let Tables = find where TimeGenerated between((start – 8d) .. (end – 1d))

            | extend Bin=bin_at(TimeGenerated, binTime, (start – 8d)), Table=source_

            | where TimeGenerated between((start – 8d) .. (end – 1d)) and hourofday(start) == hourofday(Bin)

            | where Table !in (excludedTables)

            | summarize Logs=count() by Bin, Table

            | summarize average=avg(Logs), dev=stdev(Logs) by Hour=hourofday(Bin), Table;

// Get log volumes for the time period just passed

let lastPeriod = find TimeGenerated between (start .. end)

                | where TimeGenerated between (start .. end)

                | extend Bin=bin_at(TimeGenerated, binTime, start), Table=source_

                | summarize today=count() by Hour=hourofday(Bin), Table;

// Join Tables and lastPeriod together

Tables

| join kind=leftouter lastPeriod on Hour and Table

| extend CountToday=coalesce(today, 0)

// In this line, calculate how many logs we have received as a percentage of the average, and the deviation threshold.

// The deviation threshold is calculated by subtracting the standard deviation multiplied by the modifider from the mean average. 

| project Table, Hour, CountToday, average, percentageOfAverage=CountToday/average, dev, deviation_threshold=iif(average – (dev * stdMod) < 0, toreal(0), average – (dev * stdMod))

// exclude statistical analysis for tables that receive a small volume of logs, and tables, but always alert if today’s count is 0.

| where (percentageOfAverage <= threshold and CountToday <= deviation_threshold and average > 999) or CountToday == 0

 

A reliable and effective solution

In this query, we have used the find operator as opposed to the search function. This is because, at the time of writing, the search function is not supported within analytics.  

This query is by no means a perfect solution, but from our testing over the last six months it has proven reliable and has significantly improved our response time to feed outages.  

Need help with Azure Sentinel?

At Bridewell, we are always looking to improve our SOC and Security Orchestration, Automation and Response (SOAR) offering by finding gaps in the toolset and taking on challenges like this one to close that gap.

If you or your business are struggling with automated responses to security events, or perhaps you find the day-to-day demands of operating a SOC are hindering your ability to develop a more streamlined and effective solution, then we will be more than happy to help.

You can reach the team on 03303 110 940 or via bc@bridewellconsulting.com

Close Menu