EPIC KQL Crash Course!

EPIC Kusto Query Language (KQL) Crash Course!

There is a 4+ hour version of this on YouTube. I encourage you to check it out.

The point of this is to introduce you to Kusto Query Language, or KQL. This isn’t a 100% coverage course on how every last corner of the language functions. Rather the intention is just to get you to a place of a reasonable degree of functional fluency that should allow you to start profiling the behavior of some system which is generating data in a store you can query with KQL. This probably shouldn’t be your first exposure to programming, but who am I to stop an eager learner.

My background with KQL and also similar query languages is mostly in an investigative capacity, troubleshooting problems by way of telemetry. My aim is to be general enough in my approach though that this will be a suitable introduction regardless of your use-case.

With such query languages, and maybe even programming languages in general, but especially with these types of query languages, I think it’s healthiest to approach them as you typically would a Role Playing Game, or RPG. In such games, there are different categories of skillsets, and usually you aim to level up all of the categories of skills at approximately the same rate for a balanced gameplay experience.

If I were to draft a list of such skill categories for KQL, I’d probably make it like this:

  1. Language Recognition/Manipulation
  2. Table Manipulation
  3. Statistics
  4. Domain Knowledge

I’m going to try to keep this curriculum balanced so you develop each of the first three skillset categories to a working fluency. In the real world, domain knowledge about the business meaning of the data sets you’re querying will also be useful to you.

I advise you to follow along in the Azure Data Explorer Help Clusters.1 There are a bunch of test data tables in there provided by Microsoft, and in some cases I’ll provide fake tables of data I conjured up in a smattering of different scripting languages. It’ll be easier to understand the example queries I’m giving if you run them so you can see how tables are transformed rather than just imagining.

KQL is a language whose focus is transforming tables. The general idea is to provide a pipeline of commands, where the table resulting from the previous command is given as input to the next command in the pipeline. The operations you define in your commands are often performed on each row in the table at that point in the pipeline before moving on to the next command.

Filtering Data

The first step to success with KQL is not grabbing the whole table, especially if the data you’re querying is absolutely massive in volume. This is also colloquially known as getting your head in the game. Try to think of yourself like a hunter, not one of those deep sea trawlers, okay? You’re like a detective hot on the very heels of a demented serial killer, not a band of hopeless traffic cops doing a dragnet by setting up roadblocks all over town. Seriously, get a grip!

There are two general ways to approach the problem of not grabbing an entire table at once.

  1. Based on a set of conditions regarding the contents of the rows in the table.
  2. NOT based on the contents of the rows in the table.

Important distinction.

The first skill you should learn is limiting the results of your search regardless of the nature of the table’s data. There are a few ways to do this. A common time that you might use this skill is during your first encounter with a table you haven’t used before; if you haven’t seen the table before, you probably need to get a sense of what the records are like before writing more detailed queries. So you start by using the following 2 strategies to constrain the amount of data you’re pulling from the table.

Content-Ignorant Filtering

take or limit

take and limit are synonyms and do the same thing. This is the command in this category I use most commonly. You just give it an integer and in return it gives you that number of records from the table. Even if the table doesn’t change in between subsequent runs of this command, you might not get the same set of records each time you run it. If I never used a particular table before and just need a quick glance at what kind of things might be inside of it, I’m probably going to use this to return a small constant quantity of records to have a gander.

// help/SampleLogs
TraceLogs 
| take 10 

rand()

Let’s say instead of a constant number, you need a slightly more relative way of returning a limited quantity of records from the table. You want to limit how much data you’re grabbing, but rather than having to specify a literal number of rows you’d instead prefer to specify an approximate percentage of the records from the table.

The rand function as called below returns a random (I’m assuming pseudo-random) number in the range of 0.0 to 1.0. The where expression will run once for each record in the table (also don’t take that kind of statement too literally; while that may be the semantic meaning we are expressing in the query, we shouldn’t assume too much about how this meaning is implemented behind-the-scenes) and generate a “random” number in that range; if the number produced is less than the literal numeric threshold we provided, then and only then will the record be included. Since this strategy supposedly involves odds, one would think to expect some kind of margin of error with this approach.

Make sure you do the back of the envelope math first before running such a query. To do that, you can run a query like the following first to assess how many records are in the table:

// help/SampleLogs
TraceLogs | count

Let’s say that returns 158K records and you’d be okay with grabbing 1.5K or so records from the table. Maybe then you’d try to sample like this:

// help/SampleLogs
TraceLogs
| where rand() < 0.01

Content-Aware Filtering

We now enter the domain of content-aware filtering. I made up that term just like content-ignorant filtering :D

We’ll go over more of the comparison operators later, but here are a few examples of KQL queries utilizing where conditions:

// help/SampleLogs
TraceLogs
| where Component startswith "INGESTOR_EXECUTER"
| take 10
// help/SampleLogs
TraceLogs
| where ClientRequestId == "5a848f70-9996-eb17-15ed-21b8eb94bf0e" and Component startswith "INGESTOR_EXECUTER" | take 10

In this case you’re basically going to be using the where command to retrieve only the records in the table that meet a set of conditions you specify. If you’ve used languages like SQL before, you’ll probably be somewhat familiar with this part.

The thing you really have to be strictly careful of is avoiding the narrow set of situations where your where condition could inadvertently force a full table scan. I mean this is computer science after all, so you should always be considering how to avoid linear searches. In KQL with the where command, the main situation where this could happen is when you write a where condition that is comparing two columns to one another (as opposed to one column to a literal value as in the conditions above). The docs mention that situation as one which can’t use any indexes and so will force a full table scan. If you find yourself unable to avoid this situation, the advice given is to always place that where condition last in your sequence of where conditions, so that it is scanning the smallest table size possible.

The real thing you’re going to concern yourself with far more often with where clauses in KQL was just hinted at; it’s going to be putting your most restrictive filters as early as possible in the where chain. This means if you have columns A and B, and you have conditions pertaining to both of them, then the most sensical order for specifying those conditions is to put the condition first that tends to return fewer records.

My Most-Used where Comparison Operators: ==, startswith, and has

Of course I use other operators, but these are the most common. Each one does come with a corresponding negated version: !=, !startswith, !has

This2 is the best summary in the docs of the string comparison operators. As the docs mention, prefer case-sensitive operators wherever possible for performance; but if it’s your first encounter with a table and you don’t have a good handle on the kinds of values in it yet you may want to start with case-insensitive operators and try converting to case-sensitive later.

Let’s go through some examples involving some of these operators on the TraceLogs table.

Here we’re using a positive prefix filter and a negative suffix filter to try to find a few rows where the Component column starts with INGESTOR but ends with something besides “EXECUTER”. You should find a sample of rows where the Component columns contains values like: “INGESTOR_GATEWAY”

// help/SampleLogs
TraceLogs
| where Component startswith_cs "INGESTOR" and Component !endswith_cs "EXECUTER"
| take 10

There is a limit to prefix and suffix searching though. Sometimes you need a slightly more stateful approach to filtering.

For that, you can use matches regex. Just use it mindfully, as if it’s not needed you will probably pay a price for using it performance-wise. By the way, in case the regex I’m going to show you goes over your head, don’t feel bad; while I’m breezing through this now, there is a full intro to regex much later on, so consider this just a sneak peek.

Let’s say you want to pull only those records where the Node name ends in a number between 700 and 999 inclusive. For that you could search with a regex pattern like this:

// help/SampleLogs
TraceLogs
| where Node matches regex "^Engine[0]*[7-9][0-9]{2}$"

This pattern in natural language means each value in the Node column, from beginning to end, should look like this:

If you need more complicated logic pertaining to the Engine numbers, you may want to extract the number into its own column and continue filtering.

Let’s do one more example.

// help/SecurityLogs
OutboundBrowsing
| where url contains_cs "vertically-binds"
| take 10

This was one of the few cases I could legitimately find for contains instead of has. contains is supposed to be about matching any substring, while has is supposed to be about matching KQL’s interpretation of alphanumeric substrings. In theory that makes it sound like contains should be more powerful, right? Maybe in theory, but in practice, I find KQL’s contains support for matching punctuation to behave counterintuitively so I prefer to stick to the realm of has until I need to make the jump to full-on matches regex. That’s just personal preference though.

Matching Based on Set Membership

Sometimes, you’ll want to accept a record if the value in a particular column is a member of a particular set or list. There are a few ways.

The first way is simplest, where you provide a literal list of values (such as an array of strings), indicating that a record may be accepted if the value in the column you specify is somewhere within that list of strings. Note that the in operator has case-insensitive variants.

In this next case, we’re looking for a parent_process_name equal to (case insensitively) cmd.exe and with a process_name which is either chrome.exe or excel.exe (also case insensitive). You don’t have to make your list span multiple lines, I just like making lists that way to make reading the code friendlier to those on mobile devices.

// help/SecurityLogs
ProcessEvents
| where parent_process_name =~ "cmd.exe" and process_name in~ (
    "chrome.exe",
    "excel.exe"
)
| take 100

That’s the simplest way of matching based on set membership. It’s more of a static approach.

You can also do this but more dynamically. With the dynamic approach you might have your list populated with a table from another query. I guess you might call it a subquery. The subquery produces a table, then you check if the value in the column you specify is in that table produced by the subquery.

BUT, before we get there, I have to say at this point without teaching you some other essential KQL tools, the examples of that are going to be extremely contrived. So instead we’re going to shift gears now and talk about two other topics for a while before returning to this.

Producing Statistical Summaries

By now you’ve probably realized, cool we know how to filter and pare down a set of records, but we’re still humans. We’re not robots, are we? So the records, as humans, we have to read them one-by-one, right? It’s time-consuming. Our time is valuable. Also, as awesome as we are, we’re prone to fallacious thinking, falling prey to anecdotes, etc. Now that you have a set of records, you probably want to comprehend them in a few different ways. So this is where you start using KQL as a tool to understand things in a more statistically significant way. This is much faster and more accurate than just reading through a load of log files manually, just with the caveat that you have to have a sense of how to do it.

The main way you do this in KQL is with the summarize command. In other languages, it might be called other things, like stats for example.

The first thing you need for the summarize command (after your filtered record set you’re feeding it) is what’s called an aggregation function. An aggregation function is typically given a table, and sometimes a specific column from that table in particular, and in some valuable way aggregates the values for you.

Let’s go through some of these functions.

count

This function counts the number of records in the table piped to summarize. There are basically two ways to use it.

On its own, it does exactly what I just said above.

// help/SampleMetrics
TransformedServerMetrics
| where MetricType == "sqlserver_performance"
| summarize count()

When I run that I get a table with one column called count_, and a single value in that column is 259,364,288. If you wanted the column to be called something else, such as Count, you’d change it to this:

// help/SampleMetrics
TransformedServerMetrics
| where MetricType == "sqlserver_performance"
| summarize Count=count()

The other way to use this function is with a by clause. This is where you need to start thinking in terms of each column’s cardinality, which is a measure of how many unique elements it contains. This is important to consider because with a by clause the count function will run once for each unique value contained in the column you specify (or combination of unique valurles for multiple columns).

Here’s the same kind of query, but where we’re producing a count per DBName.

// help/SampleMetrics
TransformedServerMetrics
| where MetricType == "sqlserver_performance"
| summarize PerformanceRecordCount=count() by DBName

This gives us a table with a record count for each unique value in the DBName table.

You can also specify more than one column name in the by clause, which means every unique combination of the columns you specify. Just be wary of combinatorial explosion!

// help/SampleMetrics
TransformedServerMetrics
| where Host == "adx-vm"
| summarize count() by MetricType, DBName

So how can you preemptively assess the cardinality of a column? That’s the next aggregation function, dcount.

dcount

This is a distinct counting function. There’s a similar one called count_distinct, but I usually just use dcount. Instead of giving you the overall number of records in general in the table passed to it, it gives you the number of unique values contained in the column you specify.

// help/ContosoSales
Customers
| summarize UniqueCountryNames=dcount(RegionCountryName)

This function is a key tool for assessing the cardinality of data in tables. It’s valuable because there are many times when you’re very much interested in avoiding producing way too many combinations of things, which is easier to do than it sounds.

sum

It’s important to understand the difference between count and sum. These two functions are not the same. The sum function is for when you have a column that contains numerical values, and you want to produce a sum of those numbers. This is very different from counting the records in a table.

Let’s say you had a table with a single column and 3 records, where each value was the same value of 1,000,000. The count function would give you a value of 3, whereas the sum function would give you a value of 3,000,000. So, important to remember the difference.

Here’s an example:

// help/ContosoSales
SalesFact
| summarize TotalSalesRevenue=sum(SalesAmount), TotalSalesCost=sum(TotalCost)

We’re so close to understanding how much profit ContosoSales business has made! Soon enough we’ll discuss that too.

For now, let’s split these sums out by ProductKey to see which are the best and which are the worst.

// help/ContosoSales
SalesFact
| summarize TotalSalesRevenue=sum(SalesAmount), TotalSalesCost=sum(TotalCost) by ProductKey

Based on this it’s clear we have one ProductKey that Contoso needs to stop selling immediately! Can you figure it out? If not, we’ll revisit the question in the section on Adding Columns.

avg

This is perhaps the most used and also most abused function in statistics. People use the average function as though it is some kind of holistic summary of a list of numbers. While it can contribute to giving you a greater understanding of a list of numbers (compared to, for example, just scanning through the list), a proper summary of a set of numbers will also consider other statistical aggregation functions, but that’s a different statistics lesson I think.

Here’s how you can produce an average:

// help/ContosoSales
SalesTable
| summarize avg(SalesAmount) by Country

The French must be smart shoppers!

max and min

These give you the biggest and smallest figures respectively in a column of numbers. Here’s one query demonstrating both:

// help/ContosoSales
SalesTable
| summarize min(SalesAmount), max(SalesAmount) by Country

percentile and percentiles

If you have a list of numbers, asking for the nth percentile is generally speaking like asking for the number from that list where n% of the other values are less than it. That’s the general idea, but you should read more about percentiles and these functions in particular if what you’re doing is mission critical.

Here’s an example of asking for the 95th percentile, juxtaposing the 95th percentile with the figures produced by the average and maximum functions for comparison:

// help/ContosoSales
SalesTable
| summarize avg(SalesAmount), percentile(SalesAmount,95), max(SalesAmount)

If you understand how that works, then maybe you’re ready for the percentiles function, which is the same function just for specifying that you want multiple percentiles calculated in one shot (rather than having to specify separate calls to percentile for each one you want):

// help/ContosoSales
SalesTable
| summarize percentiles(SalesAmount,1,5,50,95,99)

stdev

The name of this function is short for standard deviation. The number produced by this function is a measure of the spread or dispersal of a list of numbers. Another way it is sometimes explained is that it’s kind of like if you were to calculate on average how far each data point in a list is from the mean. A bigger standard deviation typically means a list of numbers is more spread out, while a smaller one typically means the opposite. As with all of these functions, there are caveats, as well as pros and cons, but it’s important tool in certain situations.

// help/ContosoSales
SalesTable
| summarize stdev(SalesAmount), avg(SalesAmount)

There are a few limitations to this figure I should point out:

  1. As numbers get bigger, their average can increase and when that happens, so can the standard deviation. In some more objective sense, this doesn’t necessarily mean the numbers are more spread out. More on this later.
  2. Standard deviation might not be able to tell you if a list of numbers is skewed and more spread out on one side of the mean or median and not the other (e.g. more spread out on the high end). To detect such a thing, one could argue you’d be better served considering percentiles.

Simple Table Manipulations

You’ve been filtering tables and even producing tables of summary statistics up until now. At this point, it’s time to upgrade your ability to manipulate these tables and take your analysis capabilities to the next level!

Adding Columns

Sometimes the columns that come with the table aren’t enough, and you need to add another one. You do that by extending the table. The extend command adds a new column with the name you specify, setting the value to the result of a KQL expression. This expression can take a few forms, let’s look at some examples.

This one adds a field called NickName to every record, setting the value in every case to the literal string value provided.

// help/ContosoSales
Customers
| extend NickName="Rumpelstiltskin"
| take 10

Just to be clear, when you’re doing this, you’re not permanently adding a new column to the actual table in the data store, at least not with the kind of KQL we’re discussing. This is adding a new column to the in-memory table you’re working with in your pipeline of commands.

You might think the example I gave is useless; who would need the same literal value to be provided to every record? It’s true these fictional datasets can make it difficult to imagine, but there are real situations where you might need that.

Usually though, the extend command is used a bit more dynamically. For instance, the following query was used earlier to demonstrate standard deviation, but now we’re calculating a new field based on the values contained in the others:

// help/ContosoSales
SalesTable
| summarize s=stdev(SalesAmount), a=avg(SalesAmount) by Country
| extend CoefficientVariation= s / a

Coefficient of Variation is like the ratio of the Standard Deviation to the Mean. It overcomes the first weakness of the standard deviation discussed earlier and so can give a more objective perspective of data spread. In cases where that ratio exceeds 1, you might explain that as meaning the average distance of the data points from their mean is even greater in magnitude than the mean itself. That could be an indicator of significant variation in a dataset.

Let’s say you want to round these numbers a bit to make them easier to look at. You can modify the last query to the following, also re-writing the values of the standard deviation and average so that they’re rounded – all using the extend command.

// help/ContosoSales
SalesTable
| summarize s=stdev(SalesAmount), a=avg(SalesAmount) by Country
| extend CoefficientVariation=round(s / a, 2)
| extend s=round(s,2)
| extend a=round(a,2)

With our new power in hand, let’s revisit another query we ran earlier and improve it.

// help/ContosoSales
SalesFact
| summarize TotalSalesRevenue=sum(SalesAmount), TotalSalesCost=sum(TotalCost) by ProductKey
| extend TotalSalesProfit = round( TotalSalesRevenue - TotalSalesCost, 2 )

When I run this, I see ContosoSales has a loss of 37568169.31 on ProductKey 176. Yikes! Stop selling that one, Contoso.

Concatenation

A lot of times, you need to create a new field which is just the strings contained in other columns spliced onto one another, end-to-end; we call that concatenation. Again, it may be hard to imagine why you would need to do this. But I promise you, if you use KQL, eventually you will need to do this. One common scenario in which you’ll need to do this is when you’re building time series tables and graphs, but we’ll discuss that use case later. For now, trust me that you need to know how to concatenate multiple fields. The function we’ll use to accomplish this when extending is called strcat.

// help/SecurityLogs
FileCreationEvents
| extend NetworkFilePath=strcat(
    @"\\",
    hostname,
    @"\",
    replace_string(
        path, 
        ":", 
        "$"
    )
)
| summarize count() by NetworkFilePath
| top 100 by count_

Our new field is called NetworkFilePath. It has the values contained in hostname and path columns concatenated. However, note that we’re giving literal strings before and after hostname. This is a convention in Windows for network file sharing. I don’t know why you’d want to do this with such a table, but I thought it could come in useful for you.

So the syntax with the @ symbol before the double quotes you may not have seen before. That’s because with ordinary KQL strings delimited by double quotes, backslashes are an escape character and I didn’t trust myself to count correctly. So I avoided that by using the @ version of KQL strings, where backslashes are just literal backslashes; if you ever need to include a literal double quote inside an @ KQL string, you just need to escape the double quote with a second double quote (a convention shared with CSV file format).

The convention for network file sharing in Windows avoids the colon character traditionally associated with the drive letter in favor of a $ character, so I’m taking advantage of that to demo for you the replace_string function.

Finally, I’m using the top command to grab at most the top 100 by count from the summary table I produced, since I didn’t assess the cardinality of the resulting NetworkFilePath column beforehand.

Allowlisting Columns

Eventually, you’re doing so many operations in your pipeline, you may get the sense there’s a lot of data contained in your in-memory table that you don’t really need and never will. It can feel cumbersome to keep all that in the table through multiple stages in your query pipeline. So if you want to optimize for performance and memory, get rid of the unneeded columns.

One way of doing this is to provide a list of columns which you’d like to retain. I like this option the most because it feels most explicit.

// help/SecurityLogs
FileCreationEvents
| extend NetworkFilePath=strcat(
    @"\\",
    hostname,
    @"\",
    replace_string(
        path, 
        ":", 
        "$"
    )
)
| summarize count() by NetworkFilePath
| top 100 by count_
| project NetworkFilePath

There I modified the previous query to change the resulting table so that it doesn’t have the count column anymore and only keeps the NetworkFilePath column.

You can do other things with that command, too. Here’s an example of renaming the columns you keep:

// help/ContosoSales
SalesTable
| summarize stdev(SalesAmount), avg(SalesAmount)
| project stdv=stdev_SalesAmount, mean=avg_SalesAmount

In fact, in the same fell swoop we can also make other changes like rounding the figures, and even add new columns that didn’t exist before!

// help/ContosoSales
SalesTable
| summarize stdev(SalesAmount), avg(SalesAmount)
| project stdv=round(
    stdev_SalesAmount, 
    2
), mean=round(
    avg_SalesAmount, 
    2
), cv=round(
    stdev_SalesAmount/avg_SalesAmount, 
    2
)

There are other variants of the project command for you to experiment with, but I don’t use them as much. They offer other features like being able to select which columns to keep or get rid of with a wildcard search.

A Cheap-A^$ Intro to REGEX

Regex is a core tool supported by a lot of different platforms and programming languages, not just KQL. Regex is a technology related to the notion of a “regular language” in formal language theory. There are different flavors of regex with some differences as far as rules are concerned. But all flavors share some core logical features, even if they may at times have slightly different syntaxes or rules.

When you write a regex pattern you’re declaring a logical pattern to which any input text (i.e., sequence of characters) will or will not adhere. The point of the regex is to tell you if the input text adheres to that pattern or not, and also deliver to you any subsequences of characters from the input text which you asked it to remember for you.

Next, we’ll discuss the components of these logical patterns. While we won’t cover 100% of the logical components available to you in regex patterns, we’re going to cover enough of the core elements to get you up and running.

Character Classes

Character classes are just a syntax for you to define a logical set of characters. And to make it even simpler to understand, consider that underneath it all a character to a computer is on some level just a number; ergo, a character class is on some level just a set of numbers.

A character class is typically demarcated on the left and right by square brackets. Between, you define the characters which are a part of the set:

[aeiou]

Above is a set of lowercase vowels. If you provided only that character class as a regex pattern, it would require that the input text has (at least) one character somewhere which is one of those vowels.

You typically, but not always, differentiate between letter case. Below accepts either a lower or uppercase vowel:

[aeiouAEIOU]

You’re not limited to only letters:

[0123456789,.]

You see what I mean.

Easy to see how this could become cumbersome though, especially since regex does not have variables in the way that many Turing complete programming languages do. Thankfully there are some shortcuts. Remember how we said characters are just numbers? That means you can use range syntax to specify a range of characters rather than list them all manually:

[a-zA-Z0-9]

That’s better! Only 9 characters to specify the character ranges a through z, A through Z, and 0 through 9; that’s a set of 62 characters specified with only 9!

This syntax has some side effects, though.

For instance it means if you need a literal hyphen - character in your set, you need to use a slightly unusual syntax where it is at the end of the set:

[a-z-]

That’s a set that accepts the lowercase vowels and the literal hyphen character.

Sometimes even this can become too complicated or difficult to work around. When this happens, you can work around it by taking advantage of the fact that most regex flavors allow you to specify character ranges in the form of hexadecimal values or other schemes. That’s outside the scope of this introduction, though.

But two final tricks to help you.

First, sometimes it would be cumbersome to calculate the character ranges to accept, when really you’re more concerned about not allowing a particular character. There’s a syntax for this kind of negated character class, starting with a caret character ^. The following accepts any character except lowercase letters:

[^a-z]

Second, there are shorthands for many commonly used character classes built into most regex flavors. Here are a few, taken from here:3

The latter two, \s and \S, together comprise a conventional character class that overcomes the limitation of the . class listed above:

[\s\S]

This means the set of all whitespace and all non-whitespace characters, which is another way of saying, every character including newline characters.

Concatenation & Quantifiers

So now you know how to specify at least one character in regex.

Concatenation is a technique that allows you to define more complex logical patterns that span more than just one character. You accomplish this by specifying character classes one after another.

Here’s one way to specify a sequence of three characters where each character is one of the lowercase letters a, b, or c:

[abc][abc][abc]

The above will match all of the following input strings (and more): abc, aaa, bca

They don’t have to be the same though, and commonly they won’t be. Say you want to match a word where all letters are lowercase except the first letter, which could be lower or uppercase:

[dD][a][v][e]

Above matches the names dave or Dave. However, for instances where a character class is only comprised of one character and the character in question isn’t a special one as far as regex syntax is concerned, you can do away with the square brackets. The following is equivalent to the above:

[Dd]ave

What if the sequence of characters you want to match is extremely long? For instance, you want to match a sequence of your character class which is 100 in length? That would seem like a lot to type out!

For that, we would instead use a quantifier. A quantifier is a suffix given to your character class which indicates how many characters to match. Let’s say you wanted to match a string that’s 100 characters long and where each character is from the following set of characters: x, X, o, O

[xXoO]{100}

That will match exactly 100 characters from that set. You’re not that restricted though. What if you would accept anything between 75 and 200 characters in length? You can use the same syntax to provide a range:

[xXoO]{75,200}

Because quantifiers are so handy, there are a few commonly used shorthand alternatives to the curly bracket syntax:

All of these quantifiers with ranges shown above are known as greedy quantifiers. That means since they have a range of valid lengths, if given an option to match more or less characters from the character class, they’ll try to match as many as possible up to the limit. In the case of x’s and o’s, this means if you had 100 of them, a greedy quantifier wouldn’t stop matching once it got to 75 characters, it would keep matching all the way to 100!

The opposite is a lazy quantifier, which tries to match as few as possible. To turn any of the above quantifiers into lazy versions, just append a trailing ?:

[xXoO]{75,200}?

Contrary to the earlier example, if you had 100 x’s and o’s, this lazy quantifier would still match but it would stop matching once it got to 75 characters!

We’ll test this out in a little while, but first let’s discuss a few more logical operations.

Anchors

Anchors are also known as atomic zero-width assertions. They don’t consume any characters from the input text. Rather, if you visualize the current location of the regex as it’s analyzing an input text as a cursor, then it’s saying that at that specific point in the pattern, the cursor should occupy a particular kind of location.

Still seems abstract? Let me give you a few examples.

[abc]{3}
[^abc]{3}
^[abc]{3}

The above three patterns mean different things.

The first one indicates the input text must have a sequence of three characters somewhere inside of it, where each character is from the following set: a, b, and c

The second one means a sequence of three characters where each character is a member of the set comprised of anything except for one of these letters: a, b, c

The third one which starts with a caret ^ outside of the character class indicates that the input string must start with such a sequence of three characters, where each character is from the following set: a, b, and c

The caret is the “starting” anchor. It must be the first character in a regex pattern if it is used. I admonish you to not forget the difference between the starting anchor and negative character classes, as it is an important distinction.

There’s another anchor to ensure matches must abut the end of an input sequence: $

[abc]{3}$

Above means that the input string must end with a sequence of three characters, where each character is from the following set: a, b, and c

You can use these in conjunction, too. The following indicates a string which must be of length 3 with each character from the set we’ve been discussing:

^[abc]{3}$

When you use both a starting and ending anchor like this in the same regex pattern, the pattern is said to be fully anchored.

There are other anchors, but I’ll leave those for you to experiment with.

Non-Capturing Groups

Non-capturing groups allow you to specify many logical patterns that would otherwise be tedious (or impossible? that’s a question for the formal language theorists). And, because they’re non-capturing, they can be used even in regex patterns which aren’t designed to extract text. If that last sentence confused you, don’t worry; we’ll discuss capturing groups later.

One powerful capability of non-capturing groups is the ability to apply quantifiers to whole sequences of character classes.

Before I show you such a pattern, let’s develop a testing setup in the KQL workspace we’ve been using. It will be nice to be able to test out some of these abstract language ideas without requiring any specific kind of data in the test data tables provided by Microsoft.

datatable(data:string, expectsMatch:bool) [ 
    "aBC", true,
    "aBCBC", true,
    "aBCBCBC", true,
    "aBCBCBCBC", false
]

That’s essentially a fake or temporary in-memory table you can create on the fly populated with literal data. It’s a handy tool to test out different KQL and regex ideas you may have. Just like the named tables we referenced earlier, it can be “piped” to a KQL command pipeline. Such as in the following example, which has the first regex I want to demo:

datatable(data:string, expectsMatch:bool) [ 
    "aBC", true,
    "aBCBC", true,
    "aBCBCBC", true,
    "aBCBCBCBC", false
]
| where data matches regex "^a(?:BC){1,3}$"

Here the non-capturing group is demarcated at the start by (?: and at the end by ). Everything between those boundaries is a part of the group, in this case the sequence BC.

A quantifier is applied to the group, requiring between 1 and 3 repetitions of the sequence BC.

Finally, this pattern has both a start and end anchor.

Try running this, and you’ll see the 4th string where expectsTrue == false does not appear in the resulting table, as it does not fit the logical pattern we have defined. This is because it has 4 repetitions of BC, exeeding the upper bound of 3.

You can even nest these groups inside one another!

datatable(data:string, expectsMatch:bool) [ 
    "abccc", true,
    "abcbcbcc", true,
    "abcccccccccccccccbc", true,
    "abbc", false
]
| where data matches regex "^a(?:b(?:c)+)+$"

That pattern means something like this: an a followed by one or more of the following type of sequence; a single b followed by one or more c.

Alternation

Alternation is a technique that allows you to specify alternative options for a regex to try matching. It can be applied at a “top-level” within a regex pattern, or it can also be applied within groups.

I urge caution at this juncture if you find yourself tempted to use alternation (and even quantifiers in some situations). It can become confusing to understand how many times a regex will do certain things owing to features like “backtracking”. Spend time researching the rules and constraints you’re working with so you can avoid catastrophic backtracking.

Here’s one example of alternation, which is specified using the pipe | character:

datatable(data:string, expectsMatch:bool) [ 
    "abcd", true,
    "aBCD", true,
    "abCDbCDcdCDbcd", true,
    "abBcd", false,
    "abCd", false
]
| where data matches regex "^a(?:[bB](?:cd|CD)+)+$"

An a followed by one or more of following sequence: one b or B, followed by one or more of this kind of sequence: either cd or CD.

Capturing Groups

At long last, we make it to capturing groups!

These allow your regex to remember subsequences of text that are matched by the pattern, which afterwards can be new columns in your table.

This is a very powerful feature of KQL’s flavor of regex, as it allows you to use KQL to compose your own parsing routines for the weird custom file formats you might encounter in the wild! Using this new tool, no longer will the cardinality of data be obscured and obfuscated within the baroque file formats which contain it, and you will find yourself performing valid statistical analyses just where your peers least suspect it! MWAHAHAHAHAHAHAA!

Okay, delusions of grandeur over.

Let’s talk about how to use them. Capturing groups come in two categories. They can be named OR unnamed. The syntax for unnamed is easiest, as it is just a pair of parentheses. We will use it with the extract function next, which requires these arguments: the pattern, an index number (we provide 1, indicating we want the first captured value, rather than 0, which is the entire matched string), and finally the original string field to which we are applying the regex pattern.

In the following, we want to extract into a new column, from a sample of Employee email addresses, the parts before the @ symbol preceding the domain name:

// help/SecurityLogs
Employees
| take 100
| project email_addr
| extend emailPrefix=extract(
    "^([^@]+)[@]", 
    1, 
    email_addr
) 

This regex means, starting at the beginning of the string, match and capture a sequence of one or more characters made of any characters except an @. That sequence to capture must also be followed by an @ (but without capturing that final @).

Here’s the same query, but using a named capturing group. In this case, it’s named prefix. Note that the syntax for named capturing groups can vary a bit across the different regex engines out there.

// help/SecurityLogs
Employees
| take 100
| project email_addr
| extend emailPrefix=extract(
    "^(?P<prefix>[^@]+)[@]", 
    1, 
    email_addr
)

Let’s revisit an earlier example. We were earlier matching sequences of 75 to 200 x’s or o’s (upper and lowercase):

[xXoO]{75,200}?

This was a lazy match, right? Let’s test this out and prove with a capturing group that this will only capture 75 characters, even if there are more valid characters available:

datatable(data:string)[
    "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
]
| extend originalLength=strlen(data)
| extend extractedLength=strlen(
    tostring(
        extract(
            "^([xXoO]{75,200}?)",
            1,
            data
        )
    )
)

If you run this, you’ll see 80 characters are available, but only 75 are consumed. Remove the trailing ? from the pattern and you’ll find that in that case all 80 characters are consumed. This is the difference between regex eagerness and laziness.

Sometimes you will want your regex pattern to extract more than one value by using multiple capturing groups. This is permissible, but you need to use the extract_all function.

// help/SecurityLogs
Employees
| take 100
| project email_addr
| extend emailParts=extract_all(
    "^([^@]+)[@](.*)$",
    email_addr
)
| extend emailPrefix=emailParts[0][0]
| extend emailDomain=emailParts[0][1]
| project-away emailParts

Here we complete the regex pattern we made earlier by adding a second capturing group to grab any of the remaining characters up to the end of the string (.*)$.

The resulting value from the extract_all is an array containing another array with the captured values, hence why we are using multiple array accessors to get at the values and assign them to new fields called emailPrefix and emailDomain.

Does that seem weird that there exists a seemingly needlessly nested array? The reason is that, if you recall, groups may be nested. Hence, when extracting multiple things with multiple capturing groups, the return value also needs to be in a nested or nestable format.

Table Joins

There’s a lot to say about table joining, and we’re not going to go through everything because it would balloon this conversation to an intolerable length.

The best summary I can give is that table joining is about applying all the operations you can do with a venn diagram to tables. And how can you do that, well it’s based on multiple tables each having a column containing some of the same values. Those columns are what determine which part of which circle in the venn diagram a record occupies.

Here’s one of the simplest kinds of examples. Adding a little bit of Product details to a sample of records from a Sales table:

// help/ContosoSales
SalesTable
| take 100 
| join kind=inner Products on ProductKey

SalesTable and Products both have a ProductKey column. This is the simplest kind of join because you don’t even need to write a join- or sub-query. You just provide the name of the table and poof you get extra columns.

A common scenario is if you want to retrieve records from one table, call it table A, but only those records which are referenced (such as by an ID) by records in another table, call it table B.

Still seem abstract? Let’s look at an example:

// help/SecurityLogs
AuthenticationEvents
| take 500
| project timestamp, hostname, src_ip, username, result
| where result!="Successful Login"
| join kind=inner (
    Employees
    | project username, name, ip_addr, email_addr, role
) on username
| summarize count() by username, ip_addr, email_addr, role
| sort by count_ desc

Here we’re taking a sample of logs from the AuthenticationEvents table. We’re getting rid of some unneeded columns before filtering out all the successful logins so we are left with only failed login events.

Then the join starts, where we are providing a subquery to fetch a set of results from the Employees table. The column which is shared in both tables is the username column! So after we have fetched the Employees, we specify on username to indicate we are joining the two tables on the values in that column. You can cut the query off at that point in the pipeline if you like to see the result. We now have additional fields attached to our failed login events indicating, for example, the role of the user that failed to login.

Then we’re doing some summarizing and sorting to figure out who from our sample of 500 events was the worst offender as far as failed logins. When I ran this, this was at the top of the list:

mastephens  192.168.0.234   marissa_stephens@envolvelabs.com    IT associate    7

Somebody get Marissa Stephens on the horn!

So far you saw the inner join. This produces only the intersecting parts of the two tables when you’re considering that column that has values in both tables. In certain SQL situations, this is referred to as a “foreign key” column.

Just to help you understand the difference between an inner join and the various other types of joins, you can use the following setup to explore the difference in KQL:

let humans = datatable(id:int, name:string, pet_id:int)[
    1, "Huey", int(null),
    2, "Dewey", 4,
    3, "Louie", 5
];
let animals = datatable(id:int, name:string, species:string)[
    4, "Spot", "dog",
    5, "Ace", "dog",
    6, "Bruiser", "cat"
];
humans
| join kind=inner (
    animals
    | project pet_id=id, pet_name=name, species
) on pet_id

Right now, it’s set to inner. Replace kind=inner with kind=leftouter, and see how it also gives you the record from the humans table that does not have an associated record in the other table. This is an important difference, as sometimes it can make or break the logic of your query if you’re doing joins.

Since it can be counterintuitive, let me state it explicitly. The nature of joining is that the process has the potential to filter records out of both tables before producing the final result table. How and when exactly this filtering happens is dictated by the type of join that you choose.

Here are the different kind of joins you can try with the above table:

Check out the venn diagrams on this page4 showing the difference between the different types.

My proposal is that Huey adopt Bruiser!

Time Series

Earlier I introduced you to the summarize command for calculating statistics based on your table columns. If you play around with the summarize command enough, you’ll eventually find it’s easy to mislead yourself. For example, averages taken over a very large number of values in a column will often seem to conceal the amount of variation contained in that column. The maximum and minimum functions will seem meaningless for almost all but the smallest of tables or most specific of use cases. These kinds of problems, if not overcome, can easily get in the way of your analysis and hamper its accuracy and incisiveness.

This is why Santa Claus invented the time series. For those of us for whom one or two averages are not enough, he said hey why not calculate a separate average for each day of the year! That way when Santa is considering a child’s naughtiness, he can easily separate any particularly naughty outlier days from the rest, the kind of excessively naughty days which might otherwise screw a child’s average.

So that’s what a time series is really about. Instead of calculating a stats function like average, maximum, or minimum (or any of the others) once or twice, you first “bin” all the values in the table into separate bins of particular length in time (1 second, 1 minute, 1 hour, 6 hours, 5 days, etc), and then calculate that function once for each such bin of time across a larger span of time (1 day, 1 week, 1 month, 1 year, etc).

How does this “binning” process really work? Simpler than you think. Say we have a bunch of records, each with a really fine to-the-millisecond timestamp. If we are binning the records into bins with a span of 1 day, then that means any timestamp falling within the same 24 hour period will be “re-written” to the same string or number representing that particular day.

Let’s look at an example:

// help/SecurityLogs
AuthenticationEvents
| where  result!="Successful Login"
| extend timestamp=todatetime(timestamp)
| summarize count() by bin(timestamp,1m)
| render timechart

The summarize command here is a lot like previous runs I’ve shown you, only now the “by clause” contains, instead of a solo column name, a call to the bin function into which we pass the name of the timestamp column and the bin size. In this case the bin size is 1m or one minute.

Finally, we request the time series to be rendered as a timechart. You’ll see a graph drawn in one tab with time on the x-axis and quantity on the y-axis. The other tab contains the table which is visualized in the graph.

Honestly I usually just call it a timechart. Calling it a time series always felt like calling an attorney “esquire”… technically correct, but some folks might assume you’re a try-hard.

Timecharts are an important tool for your anomaly hunting arsenal. You can use any of the stats functions with them: avg, min, max, etc.

You can produce a separate time series column for each distinct value in a particular column.

// help/SecurityLogs
let topTenFailedUsers=AuthenticationEvents
| where  result!="Successful Login"
| extend timestamp=todatetime(timestamp)
| summarize count() by username
| top 10 by count_
| project username;
AuthenticationEvents
| where result!="Successful Login" and username in (topTenFailedUsers)
| extend timestamp=todatetime(timestamp)
| summarize count() by bin(timestamp,1m), username
| render columnchart

Notice how, after the by clause in the above time series summarization command, we specify a comma and then username? This would seem to suggest that we could keep expanding that list of columns to include more than just username. In that case, the meaning you’re trying to write into your query would be something like, for each time range bin, calculate the statistical functions once for each unique combination of the following list of one or more columns: username, …, etc.

However, you’ll find support after the bin function is limited to specifying only one column. This is not necessarily a shortcoming of KQL. Other languages have a similar limitation. It does not follow that the meaning you’re trying to imbue your query with is impossible; it just means you’ll need to take a slightly different approach to work around this limitation.

The key to working around this limitation when building time series is concatenation. At first it may feel like a cheesy trick, but in time it will become a key part of your behavioral analysis arsenal. In this approach, you want to produce a new round of statistical calculations within each time range bin and for each unique combination of values from a list of columns. To specify that unique combination of values, you just concatenate those columns into a single string and store that string in a new column; that new column becomes the column you specify after the call to bin.

As an example, let’s say in the above query, after running it, you realized you wanted not unique combinations of users, but unique combinations of users and their password_hash. This actually produces a lot of combinations, so to stay under the recommended limit in dataexplorer, I’m also changing the earlier call to the top command to produce only 5 users instead of 10. See how this approach is implemented with the strcat and extend commands below to concatenate the values of the other fields:

// help/SecurityLogs
let topTenFailedUsers=AuthenticationEvents
| where  result!="Successful Login"
| extend timestamp=todatetime(
  timestamp
)
| summarize count() by username
| top 5 by count_
| project username;
AuthenticationEvents
| where result!="Successful Login" and username in (topTenFailedUsers)
| extend timestamp=todatetime(
  timestamp
)
| extend username_passwdHash=strcat(
  username, 
  " - ", 
  password_hash
)
| summarize count() by bin(
  timestamp,
  1m
), username_passwdHash
| render columnchart

So far, these types of timecharts have focused on summarizing and visualizing data over time from one point in history to another. The form of the output table and shape of output graph bear a direct resemblance to the input table; for example, if the earliest record’s timestamp in the input table was two weeks ago and the latest record’s timestamp one week ago, then the timechart spans that time range in history exactly, from two weeks ago to one week ago.

Sometimes that kind of timechart does not quite scratch the itch you have in the anomaly hunting department. Sometimes there is a “seasonality” to phenomena that you want to understand more, maybe because no Sherlock was able to crack the case the first go-round and the problem keeps happening. I want to give you one more tool for such occasions.

This approach involves constructing custom time-range bin labels for your records. The type of label you’ll create doesn’t help you understand which slice of overall chronological time in history a record occupies. Rather, it helps you understand which recurring slice of chronological time a record occupies.

This is the difference between knowing that event X occurred at 7:13 AM on Tuesday the 23rd of January, 2024 and knowing that event X occurred at 7:13 AM on some Tuesday in history (but which Tuesday exactly is not known).

Humans often schedule things that repeat on an hourly or daily or weekly or monthly basis. You might want to know, for example, is there a trend as far as which hour of the day or hour of the week that a particular phenomenon is happening?

I believe when the aggregation function, which you are using in each bin, is the count function, then this may be considered a form of histogram. However, I encourage you to experiment with other aggregation functions using this approach as well to see what trends you can find.

Before we consider an example, we need to iron out the details of making custom time-range bin labels.

One key to constructing custom timestamp bin field values is to ensure their lexicographic order and their chronological order coincide. The advantage is that when a list of these bin field values is sorted using common lexical sorting techniques (i.e., like those in KQL table user interfaces or via KQL’s sort function or even in a spreadsheet program) they are in chronological order for displaying on a graph.

Commonly, this involves ensuring numbers are all a fixed-width in terms of characters/digits (regardless of magnitude), padding them with 0’s on the left side as needed.

As an example, if the largest number in your group is 100, then consider the number 9: 9 is less than 100 in a logical ordering according to magnitude, but in a lexicographic or alphabetic ordering 100 comes first. To fix this, pad the 9 with two 0’s on the left side: 009. Now both orderings will be the same.

Many languages have a function for padding strings on left or right, but since I didn’t see one in KQL we will construct a User-Defined Function (UDF) for these scalar operations:

let padStringLeft = (
  str:string,
  padChar:string,
  minLength:int
){
  let ch = iif(
    strlen(
      padChar
    ) < 1,
    "0",
    substring(
      padChar,
      0, 1
    )
  );
  iif(
    strlen(
      str
    ) >= minLength,
    str,
    strcat(
      strcat_array(
        repeat(
          ch,
          minLength - strlen(
            str
          )
        ),
        ""
      ),
      str
    )
  );
};

Our UDF is named padStringLeft, and it has three parameters defined:

  1. str - This is the string to pad. Naturally, this function is probably going to be used with numbers, so this just means to use this function you must cast your numerical value tostring(myNum) first.
  2. padChar - This is the character intended to be used for padding on left-hand side of string. You pass in a string that should be one character long, but if not it will be truncated or coerced to a default value of 0.
  3. minLength - This number refers to the minimum length of the string to output; typically, you will input here the length of the longest string in your set of number strings (e.g., if max number is 1000, then the string “1000” is 4 characters/digits long, so 4 will be this argument).

Inside the UDF body, the assignment expression let ch = iif(...); does what I described above. It checks if the padChar string is the empty string with length of 0, and if so gives a default value of 0. Otherwise, we grab the first character from the specified padChar string using the substring function (just in case it’s longer than one character).

The remaining iif(...); call produces the ultimate value returned from the function, our padded string. We check if the length of str is already greater than or equal to the minimum length; if so, we just return the value of str as it is already. Otherwise, we need to produce a properly padded version of the string.

There are a few layers to producing the padded string. The outer strcat() call has two arguments to concatenate:

  1. The “pad” string - e.g., when making “009” from “9”, the pad string is “00”.
  2. The original input string - e.g., when making “009” from “9”, this is “9”.

To produce the pad string, we use the repeat(...) function to generate an array where each element is the pad character ch we derived earlier and where the length of the array is the difference between the minimum length and the length of the string that needs padding. Then we take this array and concatenate all of its values (with an empty string delimiter ““) into a single string using the strcat_array(...) function.

You can test the function like this:

print(padStringLeft(
  tostring(9),
  "0",
  4
))

Now that we can pad numbers, let’s consider an example. Let’s convert an earlier time series query we made to this format:

let topTenFailedUsers=AuthenticationEvents
| where  result!="Successful Login"
| extend timestamp=todatetime(timestamp)
| summarize count() by username
| top 5 by count_
| project username;
AuthenticationEvents
| where result!="Successful Login" and username in (topTenFailedUsers)
| extend timestamp=todatetime(timestamp)
| extend username_passwdHash=strcat(username, " - ", password_hash)
| extend tsHr=datetime_part("hour", timestamp)
| extend tsWeekday=dayofweek(timestamp)
| extend customBinLabel=strcat(
  toint(
    tsWeekday / 1d
  ),
  " ~ ",
  padStringLeft(
    tostring(tsHr),
    "0",
    2
  )
)
| summarize count() by customBinLabel, username_passwdHash
| sort by customBinLabel asc
| render columnchart

Our customBinLabel values each represent one of the possible 168 hours of the week. We create this new field for each record by first extracting via datetime_part(...) the hour value contained in the original timestamp. See here5 the different possible values you can use with that function.

One of those possible values is not the day of the week (1 through 7), so we use a different function called dayofweek to extract that value from the original timestamp.

Next we concatenate those two new fields into a single string with a delimeter of ” ~ ” between (arbitrarily chosen) to make the custom bin label. When concatenating the values, we take care to pad the hour values first using our UDF, and the day of the week value must finally be cast to an integer and divided by 1 single increment of its unit (d for day) to give us a number rather than the range we are originally given by the dayOfWeek function.

At last we replace our original bin(...) call while summarizing with our new custom bin label field, and perform one last command as well to sort the table by the new field so everything is in relative chronological order.

Notice any trends as far as which hour of the week tends to have more failed login events from top ten failed users?

Sure, this is a contrived example from fake data sets, so there may be no real conclusions we can draw here. But remember this tool nonetheless!

Parsing JSON

JSON is pretty ubiquitous at the time I’m writing this so I assume you’ve been exposed to it. But if not, it’s just a text-based file format used for many purposes. Given how common it is, it’s pretty handy that KQL comes with functions out-of-the-box for parsing JSON-encoded data into KQL fields.

Since KQL in some circumstances (I believe based on ingestion settings) automatically parses JSON fields into the dynamic type, I provided the following fake data table of JSON strings. You can see by the call to getschema that indeed these JSON strings are still really just strings and therefore JSON.

let computers = datatable (json:string)[
'{"id": 1, "brand": "TechMaster", "model": "X9000", "cpu": "QuadCore 3.0GHz", "ram": "16GB", "storage": "512GB SSD", "graphics": "NVIDIA GTX 1080", "price": "$1299.99"}',
'{"id": 2, "brand": "QuantumTech", "model": "UltraCore", "cpu": "OctaCore 2.5GHz", "ram": "32GB", "storage": "1TB HDD", "graphics": "AMD RX 5700", "price": "$1499.99"}',
'{"id": 3, "brand": "CyberFusion", "model": "StealthX", "cpu": "HexaCore 2.8GHz", "ram": "64GB", "storage": "256GB NVMe", "graphics": "GeForce RTX 3060", "price": "$1699.99"}',
'{"id": 4, "brand": "SysGenius", "model": "ElitePro", "cpu": "QuadCore 3.2GHz", "ram": "8GB", "storage": "1TB SSD", "graphics": "Intel UHD Graphics", "price": "$1099.99"}',
'{"id": 5, "brand": "MegaByte", "model": "PowerSpeed", "cpu": "OctaCore 3.5GHz", "ram": "128GB", "storage": "2TB HDD", "graphics": "NVIDIA RTX 3080", "price": "$1999.99"}',
'{"id": 6, "brand": "NeoGaming", "model": "TitanX", "cpu": "HexaCore 3.0GHz", "ram": "16GB", "storage": "512GB NVMe", "graphics": "AMD RX 5600 XT", "price": "$1299.99"}',
'{"id": 7, "brand": "Futurix", "model": "FusionWave", "cpu": "QuadCore 2.7GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "GeForce GTX 1660", "price": "$1399.99"}',
'{"id": 8, "brand": "TechWave", "model": "SwiftForce", "cpu": "OctaCore 2.2GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "NVIDIA GTX 1070", "price": "$1599.99"}',
'{"id": 9, "brand": "QuantumPulse", "model": "Xplorer", "cpu": "HexaCore 2.5GHz", "ram": "8GB", "storage": "512GB HDD", "graphics": "AMD RX 550", "price": "$1099.99"}',
'{"id": 10, "brand": "EpicGen", "model": "ThunderBlaze", "cpu": "QuadCore 3.5GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "Intel Iris Xe", "price": "$1899.99"}',
'{"id": 11, "brand": "CyberPulse", "model": "NovaSpeed", "cpu": "OctaCore 3.2GHz", "ram": "16GB", "storage": "1TB NVMe", "graphics": "NVIDIA RTX 3070", "price": "$1799.99"}',
'{"id": 12, "brand": "TechFusion", "model": "VelocityX", "cpu": "HexaCore 2.8GHz", "ram": "32GB", "storage": "512GB SSD", "graphics": "AMD RX 570", "price": "$1299.99"}',
'{"id": 13, "brand": "InnoTech", "model": "BlazePro", "cpu": "QuadCore 2.5GHz", "ram": "64GB", "storage": "1TB HDD", "graphics": "GeForce GTX 1650", "price": "$1499.99"}',
'{"id": 14, "brand": "FusionTech", "model": "SwiftNova", "cpu": "OctaCore 3.0GHz", "ram": "8GB", "storage": "256GB NVMe", "graphics": "NVIDIA GTX 1060", "price": "$1199.99"}',
'{"id": 15, "brand": "MegaTech", "model": "TurboWave", "cpu": "HexaCore 2.2GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "AMD RX 580", "price": "$1699.99"}',
'{"id": 16, "brand": "QuantumFusion", "model": "RapidX", "cpu": "QuadCore 3.2GHz", "ram": "16GB", "storage": "512GB HDD", "graphics": "Intel UHD Graphics", "price": "$1299.99"}',
'{"id": 17, "brand": "EpicPulse", "model": "BlitzTech", "cpu": "OctaCore 2.8GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "NVIDIA RTX 3060", "price": "$1499.99"}',
'{"id": 18, "brand": "TechNova", "model": "XtremeSpeed", "cpu": "HexaCore 3.5GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "GeForce GTX 1660", "price": "$1599.99"}',
'{"id": 19, "brand": "CyberBlaze", "model": "ThunderX", "cpu": "QuadCore 2.2GHz", "ram": "8GB", "storage": "512GB SSD", "graphics": "AMD RX 5600 XT", "price": "$1199.99"}',
'{"id": 20, "brand": "FutureTech", "model": "SwiftFusion", "cpu": "HexaCore 3.0GHz", "ram": "128GB", "storage": "1TB NVMe", "graphics": "NVIDIA GTX 1070", "price": "$1799.99"}'
];
computers | getschema

Right now the table only has one column, a string. It has a lot of different fields encoded in that one string though, and we would like to pull them out into separate columns so we can enjoy the full set of advantages associated with proper tabular formats, like ease of calculating statistics.

We will not use regex alone to pull these values out into columns because KQL has dedicated functions for JSON parsing out-of-the-box.

computers 
| extend parsed=parse_json(json)

Now you’ll see the new parsed column has a dynamic object of which you can access individual members. Let’s use project to split these things into separate columns, additionally splitting out further details nested in some of the strings via regex:

computers 
| extend parsed=parse_json(json)
| project id=toint(
    parsed.id
), brand=tostring(
    parsed.brand
), model=tostring(
    parsed.model
), cpuSpecs=extract_all(
    "^([Qq][Uu][Aa][Dd]|[Hh][Ee][Xx][Aa]|[Oo][Cc][Tt][Aa])[Cc][Oo][Rr][Ee][ ](\\d+[.]?\\d*)GHz$",
    tostring(
        parsed.cpu
    )
), ramGB=extract(
    "^([0-9]+)[Gg][Bb]$",
    1,
    tostring(
        parsed.ram
    )
), storage=extract_all(
    "^([0-9]+)([GgTt][Bb])[ ]([A-Za-z0-9_.-]+)$",
    tostring(
        parsed.storage
    )
), graphics=tostring(
    parsed.graphics
), priceDollars=extract(
    "^[$](.*)$",
    1,
    tostring(
        parsed.price
    )
)

Where are we going with this regex business? Well, some of the values in those strings are numerical values we might want to sort the table of computers by. Tack the following onto your command pipeline, and you’ll see how we create new columns with the values we extracted:

| project id, brand, model, graphics, priceDollars, cpuCores=case(
    cpuSpecs[0][0] =~ "quad", 4,
    cpuSpecs[0][0] =~ "hexa", 6,
    cpuSpecs[0][0] =~ "octa", 8,
    -1
), cpuGHz=todouble(
    cpuSpecs[0][1]
),
storageSize=toint(
    storage[0][0]
), storageUnit=storage[0][1], storageType=storage[0][2]
| extend storageGB=case(
    storageUnit =~ "GB",
    storageSize,
    storageSize*1000
)
| project-away storageSize, storageUnit

This kind of became about more than just JSON parsing. BUT, do you see all the stuff we could do that we otherwise couldn’t without the parsing step? At least, we couldn’t have done so unless we wanted to get a lot more complicated with regex and parsing logic.

Sometimes, the data in KQL tables changes. Some dev you haven’t met made a change, and now your query is broken because the records in the table look like this now:

datatable(data:string)[
    @"{""SKU"":""CMP067"",""data"":""{\""id\"": 1, \""brand\"": \""TechMaster\"", \""model\"": \""X9000\"", \""cpu\"": \""QuadCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""NVIDIA GTX 1080\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP048"",""data"":""{\""id\"": 2, \""brand\"": \""QuantumTech\"", \""model\"": \""UltraCore\"", \""cpu\"": \""OctaCore 2.5GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""AMD RX 5700\"", \""price\"": \""$1499.99\""}""}",
    @"{""SKU"":""CMP088"",""data"":""{\""id\"": 3, \""brand\"": \""CyberFusion\"", \""model\"": \""StealthX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""GeForce RTX 3060\"", \""price\"": \""$1699.99\""}""}",
    @"{""SKU"":""CMP090"",""data"":""{\""id\"": 4, \""brand\"": \""SysGenius\"", \""model\"": \""ElitePro\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1099.99\""}""}",
    @"{""SKU"":""CMP017"",""data"":""{\""id\"": 5, \""brand\"": \""MegaByte\"", \""model\"": \""PowerSpeed\"", \""cpu\"": \""OctaCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB HDD\"", \""graphics\"": \""NVIDIA RTX 3080\"", \""price\"": \""$1999.99\""}""}",
    @"{""SKU"":""CMP096"",""data"":""{\""id\"": 6, \""brand\"": \""NeoGaming\"", \""model\"": \""TitanX\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB NVMe\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP094"",""data"":""{\""id\"": 7, \""brand\"": \""Futurix\"", \""model\"": \""FusionWave\"", \""cpu\"": \""QuadCore 2.7GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1399.99\""}""}",
    @"{""SKU"":""CMP017"",""data"":""{\""id\"": 8, \""brand\"": \""TechWave\"", \""model\"": \""SwiftForce\"", \""cpu\"": \""OctaCore 2.2GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1599.99\""}""}",
    @"{""SKU"":""CMP067"",""data"":""{\""id\"": 9, \""brand\"": \""QuantumPulse\"", \""model\"": \""Xplorer\"", \""cpu\"": \""HexaCore 2.5GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""AMD RX 550\"", \""price\"": \""$1099.99\""}""}",
    @"{""SKU"":""CMP024"",""data"":""{\""id\"": 10, \""brand\"": \""EpicGen\"", \""model\"": \""ThunderBlaze\"", \""cpu\"": \""QuadCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""Intel Iris Xe\"", \""price\"": \""$1899.99\""}""}",
    @"{""SKU"":""CMP069"",""data"":""{\""id\"": 11, \""brand\"": \""CyberPulse\"", \""model\"": \""NovaSpeed\"", \""cpu\"": \""OctaCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA RTX 3070\"", \""price\"": \""$1799.99\""}""}",
    @"{""SKU"":""CMP071"",""data"":""{\""id\"": 12, \""brand\"": \""TechFusion\"", \""model\"": \""VelocityX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 570\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP019"",""data"":""{\""id\"": 13, \""brand\"": \""InnoTech\"", \""model\"": \""BlazePro\"", \""cpu\"": \""QuadCore 2.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""GeForce GTX 1650\"", \""price\"": \""$1499.99\""}""}",
    @"{""SKU"":""CMP056"",""data"":""{\""id\"": 14, \""brand\"": \""FusionTech\"", \""model\"": \""SwiftNova\"", \""cpu\"": \""OctaCore 3.0GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""NVIDIA GTX 1060\"", \""price\"": \""$1199.99\""}""}",
    @"{""SKU"":""CMP044"",""data"":""{\""id\"": 15, \""brand\"": \""MegaTech\"", \""model\"": \""TurboWave\"", \""cpu\"": \""HexaCore 2.2GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""AMD RX 580\"", \""price\"": \""$1699.99\""}""}",
    @"{""SKU"":""CMP019"",""data"":""{\""id\"": 16, \""brand\"": \""QuantumFusion\"", \""model\"": \""RapidX\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP014"",""data"":""{\""id\"": 17, \""brand\"": \""EpicPulse\"", \""model\"": \""BlitzTech\"", \""cpu\"": \""OctaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""NVIDIA RTX 3060\"", \""price\"": \""$1499.99\""}""}",
    @"{""SKU"":""CMP084"",""data"":""{\""id\"": 18, \""brand\"": \""TechNova\"", \""model\"": \""XtremeSpeed\"", \""cpu\"": \""HexaCore 3.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1599.99\""}""}",
    @"{""SKU"":""CMP046"",""data"":""{\""id\"": 19, \""brand\"": \""CyberBlaze\"", \""model\"": \""ThunderX\"", \""cpu\"": \""QuadCore 2.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1199.99\""}""}",
    @"{""SKU"":""CMP016"",""data"":""{\""id\"": 20, \""brand\"": \""FutureTech\"", \""model\"": \""SwiftFusion\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1799.99\""}""}"
]
| project data=parse_json(data)

Even after you parse the JSON, most of the data is still inside a single string!

This is nested JSON. It means after an object was serialized to JSON, a developer assigned that JSON string to a field inside a different object. Then, that other object was serialized to JSON.

No worries, just run the parser again on that nested JSON payload! Append the following to your pipeline:

| project data=parse_json(data)
| project SKU=tostring(data.SKU), parsed=parse_json(tostring(data.data))

See the (seemingly) extra call to tostring near the end there? That has to happen because parse_json requires a string as input and at that point in the pipeline data.data is a dynamic type.

Note that this whole thing can just be inserted before most of the rest of the old query to make it work just like it used to work, owing to our choice of column names.

I call this “shimming the old query”, though there’s probably already a term for this. It’s just where your table changed so, to avoid re-engineering a whole query, you just insert some logic early in the query to make the table “fit” the old query. It’s a handy technique that allows you to preserve the intentions of a previous query author without inserting your own opinions.

Only a slight modification must be made to allow the SKU to persist to the final form of the table. Here’s the final form of our pipeline, including the duplicate SKUs ;)

datatable(data:string)[
    @"{""SKU"":""CMP067"",""data"":""{\""id\"": 1, \""brand\"": \""TechMaster\"", \""model\"": \""X9000\"", \""cpu\"": \""QuadCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""NVIDIA GTX 1080\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP048"",""data"":""{\""id\"": 2, \""brand\"": \""QuantumTech\"", \""model\"": \""UltraCore\"", \""cpu\"": \""OctaCore 2.5GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""AMD RX 5700\"", \""price\"": \""$1499.99\""}""}",
    @"{""SKU"":""CMP088"",""data"":""{\""id\"": 3, \""brand\"": \""CyberFusion\"", \""model\"": \""StealthX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""GeForce RTX 3060\"", \""price\"": \""$1699.99\""}""}",
    @"{""SKU"":""CMP090"",""data"":""{\""id\"": 4, \""brand\"": \""SysGenius\"", \""model\"": \""ElitePro\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1099.99\""}""}",
    @"{""SKU"":""CMP017"",""data"":""{\""id\"": 5, \""brand\"": \""MegaByte\"", \""model\"": \""PowerSpeed\"", \""cpu\"": \""OctaCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB HDD\"", \""graphics\"": \""NVIDIA RTX 3080\"", \""price\"": \""$1999.99\""}""}",
    @"{""SKU"":""CMP096"",""data"":""{\""id\"": 6, \""brand\"": \""NeoGaming\"", \""model\"": \""TitanX\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB NVMe\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP094"",""data"":""{\""id\"": 7, \""brand\"": \""Futurix\"", \""model\"": \""FusionWave\"", \""cpu\"": \""QuadCore 2.7GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1399.99\""}""}",
    @"{""SKU"":""CMP017"",""data"":""{\""id\"": 8, \""brand\"": \""TechWave\"", \""model\"": \""SwiftForce\"", \""cpu\"": \""OctaCore 2.2GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1599.99\""}""}",
    @"{""SKU"":""CMP067"",""data"":""{\""id\"": 9, \""brand\"": \""QuantumPulse\"", \""model\"": \""Xplorer\"", \""cpu\"": \""HexaCore 2.5GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""AMD RX 550\"", \""price\"": \""$1099.99\""}""}",
    @"{""SKU"":""CMP024"",""data"":""{\""id\"": 10, \""brand\"": \""EpicGen\"", \""model\"": \""ThunderBlaze\"", \""cpu\"": \""QuadCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""Intel Iris Xe\"", \""price\"": \""$1899.99\""}""}",
    @"{""SKU"":""CMP069"",""data"":""{\""id\"": 11, \""brand\"": \""CyberPulse\"", \""model\"": \""NovaSpeed\"", \""cpu\"": \""OctaCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA RTX 3070\"", \""price\"": \""$1799.99\""}""}",
    @"{""SKU"":""CMP071"",""data"":""{\""id\"": 12, \""brand\"": \""TechFusion\"", \""model\"": \""VelocityX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 570\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP019"",""data"":""{\""id\"": 13, \""brand\"": \""InnoTech\"", \""model\"": \""BlazePro\"", \""cpu\"": \""QuadCore 2.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""GeForce GTX 1650\"", \""price\"": \""$1499.99\""}""}",
    @"{""SKU"":""CMP056"",""data"":""{\""id\"": 14, \""brand\"": \""FusionTech\"", \""model\"": \""SwiftNova\"", \""cpu\"": \""OctaCore 3.0GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""NVIDIA GTX 1060\"", \""price\"": \""$1199.99\""}""}",
    @"{""SKU"":""CMP044"",""data"":""{\""id\"": 15, \""brand\"": \""MegaTech\"", \""model\"": \""TurboWave\"", \""cpu\"": \""HexaCore 2.2GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""AMD RX 580\"", \""price\"": \""$1699.99\""}""}",
    @"{""SKU"":""CMP019"",""data"":""{\""id\"": 16, \""brand\"": \""QuantumFusion\"", \""model\"": \""RapidX\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1299.99\""}""}",
    @"{""SKU"":""CMP014"",""data"":""{\""id\"": 17, \""brand\"": \""EpicPulse\"", \""model\"": \""BlitzTech\"", \""cpu\"": \""OctaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""NVIDIA RTX 3060\"", \""price\"": \""$1499.99\""}""}",
    @"{""SKU"":""CMP084"",""data"":""{\""id\"": 18, \""brand\"": \""TechNova\"", \""model\"": \""XtremeSpeed\"", \""cpu\"": \""HexaCore 3.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1599.99\""}""}",
    @"{""SKU"":""CMP046"",""data"":""{\""id\"": 19, \""brand\"": \""CyberBlaze\"", \""model\"": \""ThunderX\"", \""cpu\"": \""QuadCore 2.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1199.99\""}""}",
    @"{""SKU"":""CMP016"",""data"":""{\""id\"": 20, \""brand\"": \""FutureTech\"", \""model\"": \""SwiftFusion\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1799.99\""}""}"
]
| project data=parse_json(data)
| project SKU=tostring(data.SKU), parsed=parse_json(tostring(data.data))
| project SKU, id=toint(
    parsed.id
), brand=tostring(
    parsed.brand
), model=tostring(
    parsed.model
), cpuSpecs=extract_all(
    "^([Qq][Uu][Aa][Dd]|[Hh][Ee][Xx][Aa]|[Oo][Cc][Tt][Aa])[Cc][Oo][Rr][Ee][ ](\\d+[.]?\\d*)GHz$",
    tostring(
        parsed.cpu
    )
), ramGB=extract(
    "^([0-9]+)[Gg][Bb]$",
    1,
    tostring(
        parsed.ram
    )
), storage=extract_all(
    "^([0-9]+)([GgTt][Bb])[ ]([A-Za-z0-9_.-]+)$",
    tostring(
        parsed.storage
    )
), graphics=tostring(
    parsed.graphics
), priceDollars=extract(
    "^[$](.*)$",
    1,
    tostring(
        parsed.price
    )
)
| project SKU, id, brand, model, graphics, priceDollars, cpuCores=case(
    cpuSpecs[0][0] =~ "quad", 4,
    cpuSpecs[0][0] =~ "hexa", 6,
    cpuSpecs[0][0] =~ "octa", 8,
    -1
), cpuGHz=todouble(
    cpuSpecs[0][1]
),
storageSize=toint(
    storage[0][0]
), storageUnit=storage[0][1], storageType=storage[0][2]
| extend storageGB=case(
    storageUnit =~ "GB",
    storageSize,
    storageSize*1000
)
| project-away storageSize, storageUnit

Custom File Formats

There are many developers you may encounter out in the wild who don’t have an overly refined sense of when they are departing from the safer realms of well-defined, grammatically correct file formats. This isn’t a dig on such folks, but they often have other priorities. Consequently, you may find the data you’re looking for is encapsulated within a file format that doesn’t have a ready-made parser for you within KQL. When that happens, you’ll have to effectively compose an ad-hoc parsing routine in order to extract the data in question into columns.

The example that follows is admittedly contrived. BUT, it’s a perfect example of the kind of grammatically ill-defined language you might encounter in the wild. It would be ideal for us if all logging formats were in perfectly tabular formats, but alas the real world is dirtier than that. This lesson is about not letting unusual types of string encoding and logging formats get in the way of your quest to comprehend data in a statistically valid way.

Given the following data table, you may be tempted to try to write a master regex pattern to extract all of the relevant data at once. Not so fast! Part of being a good user of regex is knowing its limits. Even if it were possible given the limited nature of the next example, imagine the following table instead had 40 million records; in such situations, it can be commonplace to encounter bizarre edge cases. I submit to you that it’s good to be able to use regex (and we will) but also just as important to use standard parsing features when they’re already available.

Before we start coding, let’s have a gander at the lunacy in the table below:

let looneyLogs = datatable(doc:string)[
    "computer = server016; <batches><batch>{\"DurationMS\": 90, \"Payload\": \"- Name: John Doe\\n  DOB: 1990-05-15\\n  Email: john.doe@email.com\\n- Name: Jane Smith\\n  DOB: 1985-08-22\\n  Email: jane.smith@email.com\\n- Name: Alex Brown\\n  DOB: 1995-11-10\\n  Email: alex.brown@email.com\"}</batch></batches>",
    "computer = server081; <batches><batch>{\"DurationMS\": 81, \"Payload\": \"- name: Emily White\\n  dob: 1992-09-18\\n  email: emily.white@email.com\\n- name: Mark Johnson\\n  dob: 1980-03-05\\n  email: mark.johnson@email.com\\n- name: Sarah Miller\\n  dob: 1995-11-30\\n  email: sarah.miller@email.com\"}</batch><batch>{\"DurationMS\": 65, \"Payload\": \"- name: Alex Turner\\n  dob: 1987-06-24\\n  email: alex.turner@email.com\\n- name: Jessica Adams\\n  dob: 1994-12-15\\n  email: jessica.adams@email.com\\n- name: Brian Parker\\n  dob: 1982-08-03\\n  email: brian.parker@email.com\\n- name: Rachel Lee\\n  dob: 1998-02-09\\n  email: rachel.lee@email.com\\n- name: Kevin Scott\\n  dob: 1989-05-17\\n  email: kevin.scott@email.com\"}</batch></batches>",
    "computer = server056; <batches><batch>{\"DurationMS\": 72, \"Payload\": \"- name: Olivia Jenkins\\n  dob: 1991-10-07\\n  email: olivia.jenkins@email.com\\n- name: Marcus Turner\\n  dob: 1985-04-18\\n  email: marcus.turner@email.com\"}</batch></batches>",
    "computer = server081; <batches><batch>{\"DurationMS\": 15, \"Payload\": \"- name: Laura Davis\\n  dob: 1993-08-26\\n  email: laura.davis@email.com\\n- name: Ethan Wilson\\n  dob: 1984-12-03\\n  email: ethan.wilson@email.com\\n- name: Amanda Parker\\n  dob: 1990-02-14\\n  email: amanda.parker@email.com\\n- name: Tyler Mitchell\\n  dob: 1988-06-11\\n  email: tyler.mitchell@email.com\"}</batch><batch>{\"DurationMS\": 48, \"Payload\": \"- name: Grace Roberts\\n  dob: 1997-03-29\\n  email: grace.roberts@email.com\"}</batch><batch>{\"DurationMS\": 7, \"Payload\": \"- name: Benjamin Hall\\n  dob: 1981-09-22\\n  email: benjamin.hall@email.com\\n- name: Natalie Adams\\n  dob: 1996-11-15\\n  email: natalie.adams@email.com\\n- name: Christopher Hill\\n  dob: 1986-07-01\\n  email: christopher.hill@email.com\"}</batch></batches>",
    "computer = server056; <batches><batch>{\"DurationMS\": 33, \"Payload\": \"- name: Lily Moore\\n  dob: 1999-01-12\\n  email: lily.moore@email.com\\n- name: Jackson Brown\\n  dob: 1983-05-04\\n  email: jackson.brown@email.com\\n- name: Chloe Taylor\\n  dob: 1995-09-17\\n  email: chloe.taylor@email.com\"}</batch><batch>{\"DurationMS\": 13, \"Payload\": \"- name: Daniel Wright\\n  dob: 1989-12-30\\n  email: daniel.wright@email.com\\n- name: Sophia Smith\\n  dob: 1992-04-23\\n  email: sophia.smith@email.com\"}</batch></batches>",
    "computer = server016; <batches><batch>{\"DurationMS\": 9, \"Payload\": \"- name: Ryan Miller\\n  dob: 1987-11-08\\n  email: ryan.miller@email.com\"}</batch></batches>",
    "computer = server091; <batches><batch>{\"DurationMS\": 8, \"Payload\": \"- name: Emma Turner\\n  dob: 1994-06-20\\n  email: emma.turner@email.com\\n- name: Noah Johnson\\n  dob: 1980-02-01\\n  email: noah.johnson@email.com\\n- name: Madison Scott\\n  dob: 1998-07-14\\n  email: madison.scott@email.com\\n- name: Owen Harris\\n  dob: 1982-10-27\\n  email: owen.harris@email.com\"}</batch></batches>"];

First, it’s kind of a lot of text so I’ve assigned it to a variable name looneyLogs, and hereafter in this article will just refer to that table name. When you’re practicing feel free to pipe the table directly to the rest of your query. I’m only doing this to avoid excessive copying and pasting.

Second, these logs are kind of an eyesore! True to the situations you might find yourself in out in the wilds of computing, I’m not going to provide you with a formal grammar decribing the language in these strings/documents.

But, I will give you an informal grammar, the kind you might come up with scanning through the list:

We have our work cut out for us. Let’s parse out the info contained in these monstrous strings into a beautiful little table. If you’re coding along at home and don’t want the answer spoiled, now’s the time to pause and try it out yourself!

looneyLogs
| extend docId=new_guid()

Since I’m not 100% sure if it’s important to logically associate the ultimate rows in the resulting table to the original records they came from, I’m playing it safe and assigning each a unique ID called docId.

If you hadn’t heard of it before, a guid is a globally unique id, which is a kind of really large random value you can use as an id with very low risk of accidentally repeating the number. Most languages and platforms have a way of making these.

Next, let’s use some regex to save the computer name and xml document parts of the string into their own columns! First the computer name:

| extend computerName=extract(
    "^[cC]omputer[ ]*[=][ ]*([^;]+)[;][ ]*", 
    1, 
    doc
)

We’re letting it start with a lower or uppercase c to be permissive. Keeping with the permissive theme, we’ll allow any number of spaces (including 0) before and after the =.

We want to start capturing a sequence of one or more characters and keep doing so as long as each character is not a semicolon. That will be our computer name! Then the final semicolon and any optional spaces afterward.

That’s it for capturing the computer name. Next, we will re-use, modify, and extend the same pattern to extract the xml part of the document. Technically, we could use a single regex pattern to extract both things in one shot, but it’s also okay to repeat yourself a little bit in the early stages of trying to solve a problem. You can always tidy up and optimize later.

| extend xmlDocument=extract(
    "^[cC]omputer[ ]*[=][ ]*[^;]+[;][ ]*([\\s\\S]+)$",
    1,
    doc
)
| project docId, computerName, xmlDocument

We use same strategy to match all the way through what we matched last time, only minus the capturing aspect for the portion that matches the computer name. Afterwards we append a capturing group to match all characters (both whitespace and non-whitespace) through the end of the string. The actual capturing group is ([\s\S]+), but in the kind of string we are using to specify our regex pattern backslashes are an escape character so we needed to provide double backslashes instead.

It’s true there is some redundancy within these two consecutive regex patterns. The two of them could be consolidates into a single pattern, but it is okay to keep things sketchy when you’re first developing solutions.

Finally, we list the only columns we still care about to get rid of the original doc column. Congrats! We’re well on our way to exorcising the formatting demons out of this poor information.

We’re not writing a custom parsing routine for XML, especially since there is already such functionality out of the box in KQL. Though that’s a fun idea for another day. Invoking the XML parser is easy:

| extend parsedXML=parse_xml(xmlDocument)

Great, now the new parsedXML column is a kind of KQL data structure containing the contents of the XML document. I say data structure, but of course I mean that in the more colloquial sense of a hodgepodge of hash tables and arrays.

We can see inside it now the keys representing the old batches and batch elements. It looks like the parser consolidated the cases of multiple batch elements into a single key called batch holding an array where each element is the inner content from one of the batch elements (i.e., a JSON string). That makes these batch elements what is known as a multi-valued field, since that field in a given single record can itself contain multiple values.

So how do we cope with this? At this point, the way forward is going to be to split these records based on that multi-value field. A record with n values in its multi-valued field will be “cloned” (I use that word loosely, we never want to assume too much about what KQL does under the hood) so there are n copies of it in the table; each of the n records will be given one of the values from that multi-valued field.

Make sense? If not, try running the pipeline of commands we have built with and without the following mv-expand command to develop a sense of what is happening:

| mv-expand jsonDocument=parsedXML.batches.batch to typeof(string)

You can see that jsonDocument is the name of the new column, and we’re assigning to it a reference to the multi-valued batch field: parsedXML.batches.batch. And just to be clear, after this multi-value field is “expanded”, we want the resulting type stored in the jsonDocument column to be a string.

Whew!

Let’s give each of the records an id to differentiate which batch element any further data came from, in case we split out more data using multi-valued field techniques (ahem, foreshadowing, ahem). Also, it’s maybe a good time for a fresh call to project to declutter our table a bit before proceeding.

| extend batchId=new_guid()
| project docId, computerName, jsonDocument, batchId

Now we have to pull values out of the JSON.

Are we going to use custom regex-based parsing logic to pull the key-value pairs out of the JSON strings?

No, we are not. Mainly because KQL already has off-the-shelf JSON parsing functions (similar to the XML one we used), the JSON strings do seem well-formed and so compatible with said functions, and also because we aren’t crazy.

| extend parsedJSON=parse_json(jsonDocument)
| extend DurationMS=parsedJSON.DurationMS
| extend Payload=parsedJSON.Payload

Boom! Now we pulled out the two fields from each JSON string into their own columns. We are in the home stretch.

The Payload column contains what we earlier surmised to be sets of YAML formatted records. Feels like we’ll need to separate these records, this time without the help of a dedicated parsing function.

This part can be done a number of different ways. My approach will be tailored to demonstrating another multi-value function called mv-apply:

| extend lines=split(Payload, "\n")
| mv-apply with_itemindex=index recordIndexes=lines on (
    where index % 3 == 0
    | extend recordId=new_guid()
    | project index, recordId
)

We split the Payload field on newline characters and assign the resulting array of strings to the lines column. mv-apply is similar to mv-extend, but we gain the ability to run a subquery (between parentheses) which can reduce the number of records the original multi-valued field is expanded into.

In our case, we want one record returned for each logical record in the YAML string. A logical record in the YAML string comprises three lines, or three slots in the lines array. We may not know how many records are in each YAML string, but it’s a safe bet the line count of all the records will be a multiple of three.

So if we have n lines, we want to split the multi-valued field into n / 3 records, where each record has a new column containing one of the starting indices of the records in the YAML string. Using that piece of information along with the lines array, we can derive whole records. In addition to the index column we generate in the subquery, we also generate an ID for each record in the YAML string.

Before we move on, note the index field in the where clause within the subquery. This field contains the index within the multi-value field array the subquery is currently on; unless that value is the first array slot of zero or a multiple of three, we don’t want it!

Now we can use the index to target the appropriate slots in the lines array to extract the remaining values via regex. Starting with name:

| extend name=extract(
    "^[-][ ][nN]ame[:][ ]+([\\s\\S]+)$",
    1,
    tostring(lines[index])
)

Nothing new is introduced in this regex pattern, it should be old hat to you by now! The pattern targets the string in the lines array which can be accessed with the value in the index column.

The process for extracting the DOB is similar except it being in the subsequent slot of the lines array. And finally, email can be extracted from the slot just after that.

| extend DOB=extract(
    "^[ ]+[dD][oO][bB][:][ ]+([\\s\\S]+)$",
    1,
    tostring(lines[index+1])
)
| extend email=extract(
    "^[ ]+[eE]mail[:][ ]+([\\s\\S]+)$",
    1,
    tostring(lines[index+2])
)

Great! We literally have all our columns now! Let’s clean up the table a bit:

| project-away jsonDocument, parsedJSON, Payload, lines, index

And the final version of the table is beautiful to behold. Remember the monstrosity we started with? This is way better! We could actually do stats on these values now.

Here’s our final query:

looneyLogs
| extend docId=new_guid()
| extend computerName=extract(
    "^[cC]omputer[ ]*[=][ ]*([^;]+)[;][ ]*", 
    1, 
    doc
)
| extend xmlDocument=extract(
    "^[cC]omputer[ ]*[=][ ]*[^;]+[;][ ]*([\\s\\S]+)$",
    1,
    doc
)
| project docId, computerName, xmlDocument
| extend parsedXML=parse_xml(xmlDocument)
| mv-expand jsonDocument=parsedXML.batches.batch to typeof(string)
| extend batchId=new_guid()
| project docId, computerName, jsonDocument, batchId
| extend parsedJSON=parse_json(jsonDocument)
| extend DurationMS=parsedJSON.DurationMS
| extend Payload=parsedJSON.Payload
| extend lines=split(Payload, "\n")
| mv-apply with_itemindex=index recordIndexes=lines on (
    where index== 0 or index % 3 == 0
    | extend recordId=new_guid()
    | project index, recordId
)
| extend name=extract(
    "^[-][ ][nN]ame[:][ ]+([\\s\\S]+)$",
    1,
    tostring(lines[index])
)
| extend DOB=extract(
    "^[ ]+[dD][oO][bB][:][ ]+([\\s\\S]+)$",
    1,
    tostring(lines[index+1])
)
| extend email=extract(
    "^[ ]+[eE]mail[:][ ]+([\\s\\S]+)$",
    1,
    tostring(lines[index+2])
)
| project-away jsonDocument, parsedJSON, Payload, lines, index

In the end, it’s important to note, this table might not even be the ideal ultimate destination format. There are a few criticisms you could make depending on what your goals are. For one, the dates could be split into separate numerical values for years, months, days. I leave such goals as optional exercises for the reader.

Footnotes


  1. https://dataexplorer.azure.com/clusters/help/databases/SampleLogs↩︎

  2. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/datatypes-string-operators↩︎

  3. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/re2↩︎

  4. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/join-operator?pivots=azuredataexplorer↩︎

  5. https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/datetime-part-function↩︎