There is a 4+ hour version of this on YouTube. I encourage you to check it out.
The point of this is to introduce you to Kusto Query Language, or KQL. This isn’t a 100% coverage course on how every last corner of the language functions. Rather the intention is just to get you to a place of a reasonable degree of functional fluency that should allow you to start profiling the behavior of some system which is generating data in a store you can query with KQL. This probably shouldn’t be your first exposure to programming, but who am I to stop an eager learner.
My background with KQL and also similar query languages is mostly in an investigative capacity, troubleshooting problems by way of telemetry. My aim is to be general enough in my approach though that this will be a suitable introduction regardless of your use-case.
With such query languages, and maybe even programming languages in general, but especially with these types of query languages, I think it’s healthiest to approach them as you typically would a Role Playing Game, or RPG. In such games, there are different categories of skillsets, and usually you aim to level up all of the categories of skills at approximately the same rate for a balanced gameplay experience.
If I were to draft a list of such skill categories for KQL, I’d probably make it like this:
I’m going to try to keep this curriculum balanced so you develop each of the first three skillset categories to a working fluency. In the real world, domain knowledge about the business meaning of the data sets you’re querying will also be useful to you.
I advise you to follow along in the Azure Data Explorer Help Clusters.1 There are a bunch of test data tables in there provided by Microsoft, and in some cases I’ll provide fake tables of data I conjured up in a smattering of different scripting languages. It’ll be easier to understand the example queries I’m giving if you run them so you can see how tables are transformed rather than just imagining.
KQL is a language whose focus is transforming tables. The general idea is to provide a pipeline of commands, where the table resulting from the previous command is given as input to the next command in the pipeline. The operations you define in your commands are often performed on each row in the table at that point in the pipeline before moving on to the next command.
The first step to success with KQL is not grabbing the whole table, especially if the data you’re querying is absolutely massive in volume. This is also colloquially known as getting your head in the game. Try to think of yourself like a hunter, not one of those deep sea trawlers, okay? You’re like a detective hot on the very heels of a demented serial killer, not a band of hopeless traffic cops doing a dragnet by setting up roadblocks all over town. Seriously, get a grip!
There are two general ways to approach the problem of not grabbing an entire table at once.
Important distinction.
The first skill you should learn is limiting the results of your search regardless of the nature of the table’s data. There are a few ways to do this. A common time that you might use this skill is during your first encounter with a table you haven’t used before; if you haven’t seen the table before, you probably need to get a sense of what the records are like before writing more detailed queries. So you start by using the following 2 strategies to constrain the amount of data you’re pulling from the table.
take
and limit
are synonyms and do the same
thing. This is the command in this category I use most commonly. You
just give it an integer and in return it gives you that number of
records from the table. Even if the table doesn’t change in between
subsequent runs of this command, you might not get the same set of
records each time you run it. If I never used a particular table before
and just need a quick glance at what kind of things might be inside of
it, I’m probably going to use this to return a small constant quantity
of records to have a gander.
// help/SampleLogs
TraceLogs
| take 10
Let’s say instead of a constant number, you need a slightly more relative way of returning a limited quantity of records from the table. You want to limit how much data you’re grabbing, but rather than having to specify a literal number of rows you’d instead prefer to specify an approximate percentage of the records from the table.
The rand
function as called below returns a random (I’m
assuming pseudo-random) number in the range of 0.0 to 1.0. The
where
expression will run once for each record in the table
(also don’t take that kind of statement too literally; while that may be
the semantic meaning we are expressing in the query, we shouldn’t assume
too much about how this meaning is implemented behind-the-scenes) and
generate a “random” number in that range; if the number produced is less
than the literal numeric threshold we provided, then and only then will
the record be included. Since this strategy supposedly involves odds,
one would think to expect some kind of margin of error with this
approach.
Make sure you do the back of the envelope math first before running such a query. To do that, you can run a query like the following first to assess how many records are in the table:
// help/SampleLogs
TraceLogs | count
Let’s say that returns 158K records and you’d be okay with grabbing 1.5K or so records from the table. Maybe then you’d try to sample like this:
// help/SampleLogs
TraceLogs
| where rand() < 0.01
We now enter the domain of content-aware filtering. I made up that term just like content-ignorant filtering :D
We’ll go over more of the comparison operators later, but here are a
few examples of KQL queries utilizing where
conditions:
// help/SampleLogs
TraceLogs
| where Component startswith "INGESTOR_EXECUTER"
| take 10
// help/SampleLogs
TraceLogs
| where ClientRequestId == "5a848f70-9996-eb17-15ed-21b8eb94bf0e" and Component startswith "INGESTOR_EXECUTER" | take 10
In this case you’re basically going to be using the where command to retrieve only the records in the table that meet a set of conditions you specify. If you’ve used languages like SQL before, you’ll probably be somewhat familiar with this part.
The thing you really have to be strictly careful of is avoiding the
narrow set of situations where your where
condition could
inadvertently force a full table scan. I mean this is computer science
after all, so you should always be considering how to avoid linear
searches. In KQL with the where
command, the main situation
where this could happen is when you write a where
condition
that is comparing two columns to one another (as opposed to one column
to a literal value as in the conditions above). The docs mention that
situation as one which can’t use any indexes and so will force a full
table scan. If you find yourself unable to avoid this situation, the
advice given is to always place that where
condition last
in your sequence of where
conditions, so that it is
scanning the smallest table size possible.
The real thing you’re going to concern yourself with far more often
with where
clauses in KQL was just hinted at; it’s going to
be putting your most restrictive filters as early as possible in the
where
chain. This means if you have columns A and B, and
you have conditions pertaining to both of them, then the most sensical
order for specifying those conditions is to put the condition first that
tends to return fewer records.
where
Comparison Operators: ==
,
startswith
, and has
Of course I use other operators, but these are the most common. Each
one does come with a corresponding negated version: !=
,
!startswith
, !has
This2 is the best summary in the docs of the string comparison operators. As the docs mention, prefer case-sensitive operators wherever possible for performance; but if it’s your first encounter with a table and you don’t have a good handle on the kinds of values in it yet you may want to start with case-insensitive operators and try converting to case-sensitive later.
Let’s go through some examples involving some of these operators on the TraceLogs table.
Here we’re using a positive prefix filter and a negative suffix filter to try to find a few rows where the Component column starts with INGESTOR but ends with something besides “EXECUTER”. You should find a sample of rows where the Component columns contains values like: “INGESTOR_GATEWAY”
// help/SampleLogs
TraceLogs
| where Component startswith_cs "INGESTOR" and Component !endswith_cs "EXECUTER"
| take 10
There is a limit to prefix and suffix searching though. Sometimes you need a slightly more stateful approach to filtering.
For that, you can use matches regex
. Just use it
mindfully, as if it’s not needed you will probably pay a price for using
it performance-wise. By the way, in case the regex I’m going to show you
goes over your head, don’t feel bad; while I’m breezing through this
now, there is a full intro to regex much later on, so consider this just
a sneak peek.
Let’s say you want to pull only those records where the Node name ends in a number between 700 and 999 inclusive. For that you could search with a regex pattern like this:
// help/SampleLogs
TraceLogs
| where Node matches regex "^Engine[0]*[7-9][0-9]{2}$"
This pattern in natural language means each value in the Node column, from beginning to end, should look like this:
Engine
- The text Engine
[0]*
- 0 or more of the character 0
[7-9]
- Exactly one of these three characters:
7
, 8
, or 9
[0-9]{2}
- Exactly 2 characters, each from the set of
digits, 0
through 9
If you need more complicated logic pertaining to the Engine numbers, you may want to extract the number into its own column and continue filtering.
Let’s do one more example.
// help/SecurityLogs
OutboundBrowsing
| where url contains_cs "vertically-binds"
| take 10
This was one of the few cases I could legitimately find for
contains
instead of has
. contains
is supposed to be about matching any substring, while has
is supposed to be about matching KQL’s interpretation of alphanumeric
substrings. In theory that makes it sound like contains
should be more powerful, right? Maybe in theory, but in practice, I find
KQL’s contains
support for matching punctuation to behave
counterintuitively so I prefer to stick to the realm of has
until I need to make the jump to full-on matches regex
.
That’s just personal preference though.
Sometimes, you’ll want to accept a record if the value in a particular column is a member of a particular set or list. There are a few ways.
The first way is simplest, where you provide a literal list of values
(such as an array of strings), indicating that a record may be accepted
if the value in the column you specify is somewhere within that list of
strings. Note that the in
operator has case-insensitive
variants.
In this next case, we’re looking for a parent_process_name equal to
(case insensitively) cmd.exe
and with a process_name which
is either chrome.exe
or excel.exe
(also case
insensitive). You don’t have to make your list span multiple lines, I
just like making lists that way to make reading the code friendlier to
those on mobile devices.
// help/SecurityLogs
ProcessEvents
| where parent_process_name =~ "cmd.exe" and process_name in~ (
"chrome.exe",
"excel.exe"
)
| take 100
That’s the simplest way of matching based on set membership. It’s more of a static approach.
You can also do this but more dynamically. With the dynamic approach you might have your list populated with a table from another query. I guess you might call it a subquery. The subquery produces a table, then you check if the value in the column you specify is in that table produced by the subquery.
BUT, before we get there, I have to say at this point without teaching you some other essential KQL tools, the examples of that are going to be extremely contrived. So instead we’re going to shift gears now and talk about two other topics for a while before returning to this.
By now you’ve probably realized, cool we know how to filter and pare down a set of records, but we’re still humans. We’re not robots, are we? So the records, as humans, we have to read them one-by-one, right? It’s time-consuming. Our time is valuable. Also, as awesome as we are, we’re prone to fallacious thinking, falling prey to anecdotes, etc. Now that you have a set of records, you probably want to comprehend them in a few different ways. So this is where you start using KQL as a tool to understand things in a more statistically significant way. This is much faster and more accurate than just reading through a load of log files manually, just with the caveat that you have to have a sense of how to do it.
The main way you do this in KQL is with the summarize
command. In other languages, it might be called other things, like
stats
for example.
The first thing you need for the summarize
command
(after your filtered record set you’re feeding it) is what’s called an
aggregation function. An aggregation function is typically given a
table, and sometimes a specific column from that table in particular,
and in some valuable way aggregates the values for you.
Let’s go through some of these functions.
count
This function counts the number of records in the table piped to
summarize
. There are basically two ways to use it.
On its own, it does exactly what I just said above.
// help/SampleMetrics
TransformedServerMetrics
| where MetricType == "sqlserver_performance"
| summarize count()
When I run that I get a table with one column called
count_
, and a single value in that column is 259,364,288.
If you wanted the column to be called something else, such as
Count
, you’d change it to this:
// help/SampleMetrics
TransformedServerMetrics
| where MetricType == "sqlserver_performance"
| summarize Count=count()
The other way to use this function is with a by
clause.
This is where you need to start thinking in terms of each column’s
cardinality, which is a measure of how many unique elements it
contains. This is important to consider because with a by
clause the count function will run once for each unique value contained
in the column you specify (or combination of unique valurles for
multiple columns).
Here’s the same kind of query, but where we’re producing a count per DBName.
// help/SampleMetrics
TransformedServerMetrics
| where MetricType == "sqlserver_performance"
| summarize PerformanceRecordCount=count() by DBName
This gives us a table with a record count for each unique value in the DBName table.
You can also specify more than one column name in the by clause, which means every unique combination of the columns you specify. Just be wary of combinatorial explosion!
// help/SampleMetrics
TransformedServerMetrics
| where Host == "adx-vm"
| summarize count() by MetricType, DBName
So how can you preemptively assess the cardinality of a column?
That’s the next aggregation function, dcount
.
dcount
This is a distinct counting function. There’s a similar one called
count_distinct
, but I usually just use dcount
.
Instead of giving you the overall number of records in general in the
table passed to it, it gives you the number of unique values contained
in the column you specify.
// help/ContosoSales
Customers
| summarize UniqueCountryNames=dcount(RegionCountryName)
This function is a key tool for assessing the cardinality of data in tables. It’s valuable because there are many times when you’re very much interested in avoiding producing way too many combinations of things, which is easier to do than it sounds.
sum
It’s important to understand the difference between
count
and sum
. These two functions are
not the same. The sum
function is for when you
have a column that contains numerical values, and you want to produce a
sum of those numbers. This is very different from counting the records
in a table.
Let’s say you had a table with a single column and 3 records, where
each value was the same value of 1,000,000. The count
function would give you a value of 3, whereas the sum
function would give you a value of 3,000,000. So, important to remember
the difference.
Here’s an example:
// help/ContosoSales
SalesFact
| summarize TotalSalesRevenue=sum(SalesAmount), TotalSalesCost=sum(TotalCost)
We’re so close to understanding how much profit ContosoSales business has made! Soon enough we’ll discuss that too.
For now, let’s split these sums out by ProductKey to see which are the best and which are the worst.
// help/ContosoSales
SalesFact
| summarize TotalSalesRevenue=sum(SalesAmount), TotalSalesCost=sum(TotalCost) by ProductKey
Based on this it’s clear we have one ProductKey that Contoso needs to stop selling immediately! Can you figure it out? If not, we’ll revisit the question in the section on Adding Columns.
avg
This is perhaps the most used and also most abused function in statistics. People use the average function as though it is some kind of holistic summary of a list of numbers. While it can contribute to giving you a greater understanding of a list of numbers (compared to, for example, just scanning through the list), a proper summary of a set of numbers will also consider other statistical aggregation functions, but that’s a different statistics lesson I think.
Here’s how you can produce an average:
// help/ContosoSales
SalesTable
| summarize avg(SalesAmount) by Country
The French must be smart shoppers!
max
and min
These give you the biggest and smallest figures respectively in a column of numbers. Here’s one query demonstrating both:
// help/ContosoSales
SalesTable
| summarize min(SalesAmount), max(SalesAmount) by Country
percentile
and
percentiles
If you have a list of numbers, asking for the nth percentile is generally speaking like asking for the number from that list where n% of the other values are less than it. That’s the general idea, but you should read more about percentiles and these functions in particular if what you’re doing is mission critical.
Here’s an example of asking for the 95th percentile, juxtaposing the 95th percentile with the figures produced by the average and maximum functions for comparison:
// help/ContosoSales
SalesTable
| summarize avg(SalesAmount), percentile(SalesAmount,95), max(SalesAmount)
If you understand how that works, then maybe you’re ready for the
percentiles
function, which is the same function just for
specifying that you want multiple percentiles calculated in one shot
(rather than having to specify separate calls to percentile
for each one you want):
// help/ContosoSales
SalesTable
| summarize percentiles(SalesAmount,1,5,50,95,99)
stdev
The name of this function is short for standard deviation. The number produced by this function is a measure of the spread or dispersal of a list of numbers. Another way it is sometimes explained is that it’s kind of like if you were to calculate on average how far each data point in a list is from the mean. A bigger standard deviation typically means a list of numbers is more spread out, while a smaller one typically means the opposite. As with all of these functions, there are caveats, as well as pros and cons, but it’s important tool in certain situations.
// help/ContosoSales
SalesTable
| summarize stdev(SalesAmount), avg(SalesAmount)
There are a few limitations to this figure I should point out:
You’ve been filtering tables and even producing tables of summary statistics up until now. At this point, it’s time to upgrade your ability to manipulate these tables and take your analysis capabilities to the next level!
Sometimes the columns that come with the table aren’t enough, and you
need to add another one. You do that by extending the table. The
extend
command adds a new column with the name you specify,
setting the value to the result of a KQL expression. This expression can
take a few forms, let’s look at some examples.
This one adds a field called NickName
to every record,
setting the value in every case to the literal string value
provided.
// help/ContosoSales
Customers
| extend NickName="Rumpelstiltskin"
| take 10
Just to be clear, when you’re doing this, you’re not permanently adding a new column to the actual table in the data store, at least not with the kind of KQL we’re discussing. This is adding a new column to the in-memory table you’re working with in your pipeline of commands.
You might think the example I gave is useless; who would need the same literal value to be provided to every record? It’s true these fictional datasets can make it difficult to imagine, but there are real situations where you might need that.
Usually though, the extend
command is used a bit more
dynamically. For instance, the following query was used earlier to
demonstrate standard deviation, but now we’re calculating a new field
based on the values contained in the others:
// help/ContosoSales
SalesTable
| summarize s=stdev(SalesAmount), a=avg(SalesAmount) by Country
| extend CoefficientVariation= s / a
Coefficient of Variation is like the ratio of the Standard Deviation to the Mean. It overcomes the first weakness of the standard deviation discussed earlier and so can give a more objective perspective of data spread. In cases where that ratio exceeds 1, you might explain that as meaning the average distance of the data points from their mean is even greater in magnitude than the mean itself. That could be an indicator of significant variation in a dataset.
Let’s say you want to round these numbers a bit to make them easier to look at. You can modify the last query to the following, also re-writing the values of the standard deviation and average so that they’re rounded – all using the extend command.
// help/ContosoSales
SalesTable
| summarize s=stdev(SalesAmount), a=avg(SalesAmount) by Country
| extend CoefficientVariation=round(s / a, 2)
| extend s=round(s,2)
| extend a=round(a,2)
With our new power in hand, let’s revisit another query we ran earlier and improve it.
// help/ContosoSales
SalesFact
| summarize TotalSalesRevenue=sum(SalesAmount), TotalSalesCost=sum(TotalCost) by ProductKey
| extend TotalSalesProfit = round( TotalSalesRevenue - TotalSalesCost, 2 )
When I run this, I see ContosoSales has a loss of 37568169.31 on ProductKey 176. Yikes! Stop selling that one, Contoso.
A lot of times, you need to create a new field which is just the
strings contained in other columns spliced onto one another, end-to-end;
we call that concatenation. Again, it may be hard to imagine why you
would need to do this. But I promise you, if you use KQL, eventually you
will need to do this. One common scenario in which you’ll need to do
this is when you’re building time series tables and graphs, but we’ll
discuss that use case later. For now, trust me that you need to know how
to concatenate multiple fields. The function we’ll use to accomplish
this when extend
ing is called strcat
.
// help/SecurityLogs
FileCreationEvents
| extend NetworkFilePath=strcat(
@"\\",
hostname,
@"\",
replace_string(
path,
":",
"$"
)
)
| summarize count() by NetworkFilePath
| top 100 by count_
Our new field is called NetworkFilePath
. It has the
values contained in hostname
and path
columns
concatenated. However, note that we’re giving literal strings before and
after hostname. This is a convention in Windows for network file
sharing. I don’t know why you’d want to do this with such a table, but I
thought it could come in useful for you.
So the syntax with the @
symbol before the double quotes
you may not have seen before. That’s because with ordinary KQL strings
delimited by double quotes, backslashes are an escape character and I
didn’t trust myself to count correctly. So I avoided that by using the
@
version of KQL strings, where backslashes are just
literal backslashes; if you ever need to include a literal double quote
inside an @
KQL string, you just need to escape the double
quote with a second double quote (a convention shared with CSV file
format).
The convention for network file sharing in Windows avoids the colon
character traditionally associated with the drive letter in favor of a
$
character, so I’m taking advantage of that to demo for
you the replace_string
function.
Finally, I’m using the top command to grab at most the top 100 by count from the summary table I produced, since I didn’t assess the cardinality of the resulting NetworkFilePath column beforehand.
Eventually, you’re doing so many operations in your pipeline, you may get the sense there’s a lot of data contained in your in-memory table that you don’t really need and never will. It can feel cumbersome to keep all that in the table through multiple stages in your query pipeline. So if you want to optimize for performance and memory, get rid of the unneeded columns.
One way of doing this is to provide a list of columns which you’d like to retain. I like this option the most because it feels most explicit.
// help/SecurityLogs
FileCreationEvents
| extend NetworkFilePath=strcat(
@"\\",
hostname,
@"\",
replace_string(
path,
":",
"$"
)
)
| summarize count() by NetworkFilePath
| top 100 by count_
| project NetworkFilePath
There I modified the previous query to change the resulting table so
that it doesn’t have the count column anymore and only keeps the
NetworkFilePath
column.
You can do other things with that command, too. Here’s an example of renaming the columns you keep:
// help/ContosoSales
SalesTable
| summarize stdev(SalesAmount), avg(SalesAmount)
| project stdv=stdev_SalesAmount, mean=avg_SalesAmount
In fact, in the same fell swoop we can also make other changes like rounding the figures, and even add new columns that didn’t exist before!
// help/ContosoSales
SalesTable
| summarize stdev(SalesAmount), avg(SalesAmount)
| project stdv=round(
stdev_SalesAmount,
2
), mean=round(
avg_SalesAmount,
2
), cv=round(
stdev_SalesAmount/avg_SalesAmount,
2
)
There are other variants of the project
command for you
to experiment with, but I don’t use them as much. They offer other
features like being able to select which columns to keep or get rid of
with a wildcard search.
Regex is a core tool supported by a lot of different platforms and programming languages, not just KQL. Regex is a technology related to the notion of a “regular language” in formal language theory. There are different flavors of regex with some differences as far as rules are concerned. But all flavors share some core logical features, even if they may at times have slightly different syntaxes or rules.
When you write a regex pattern you’re declaring a logical pattern to which any input text (i.e., sequence of characters) will or will not adhere. The point of the regex is to tell you if the input text adheres to that pattern or not, and also deliver to you any subsequences of characters from the input text which you asked it to remember for you.
Next, we’ll discuss the components of these logical patterns. While we won’t cover 100% of the logical components available to you in regex patterns, we’re going to cover enough of the core elements to get you up and running.
Character classes are just a syntax for you to define a logical set of characters. And to make it even simpler to understand, consider that underneath it all a character to a computer is on some level just a number; ergo, a character class is on some level just a set of numbers.
A character class is typically demarcated on the left and right by square brackets. Between, you define the characters which are a part of the set:
[aeiou]
Above is a set of lowercase vowels. If you provided only that character class as a regex pattern, it would require that the input text has (at least) one character somewhere which is one of those vowels.
You typically, but not always, differentiate between letter case. Below accepts either a lower or uppercase vowel:
[aeiouAEIOU]
You’re not limited to only letters:
[0123456789,.]
You see what I mean.
Easy to see how this could become cumbersome though, especially since regex does not have variables in the way that many Turing complete programming languages do. Thankfully there are some shortcuts. Remember how we said characters are just numbers? That means you can use range syntax to specify a range of characters rather than list them all manually:
[a-zA-Z0-9]
That’s better! Only 9 characters to specify the character ranges
a
through z
, A
through
Z
, and 0
through 9
; that’s a set
of 62 characters specified with only 9!
This syntax has some side effects, though.
For instance it means if you need a literal hyphen -
character in your set, you need to use a slightly unusual syntax where
it is at the end of the set:
[a-z-]
That’s a set that accepts the lowercase vowels and the literal hyphen character.
Sometimes even this can become too complicated or difficult to work around. When this happens, you can work around it by taking advantage of the fact that most regex flavors allow you to specify character ranges in the form of hexadecimal values or other schemes. That’s outside the scope of this introduction, though.
But two final tricks to help you.
First, sometimes it would be cumbersome to calculate the character
ranges to accept, when really you’re more concerned about not allowing a
particular character. There’s a syntax for this kind of negated
character class, starting with a caret character ^
. The
following accepts any character except lowercase letters:
[^a-z]
Second, there are shorthands for many commonly used character classes built into most regex flavors. Here are a few, taken from here:3
.
- “any” character, though commonly does not match
newline characters\d
- the set of digits\D
- any characters except digits\s
- the set of whitespace characters\S
- any characters except whitespace charactersThe latter two, \s
and \S
, together
comprise a conventional character class that overcomes the limitation of
the .
class listed above:
[\s\S]
This means the set of all whitespace and all non-whitespace characters, which is another way of saying, every character including newline characters.
So now you know how to specify at least one character in regex.
Concatenation is a technique that allows you to define more complex logical patterns that span more than just one character. You accomplish this by specifying character classes one after another.
Here’s one way to specify a sequence of three characters where each
character is one of the lowercase letters a
,
b
, or c
:
[abc][abc][abc]
The above will match all of the following input strings (and more):
abc
, aaa
, bca
They don’t have to be the same though, and commonly they won’t be. Say you want to match a word where all letters are lowercase except the first letter, which could be lower or uppercase:
[dD][a][v][e]
Above matches the names dave
or Dave
.
However, for instances where a character class is only comprised of one
character and the character in question isn’t a special one as far as
regex syntax is concerned, you can do away with the square brackets. The
following is equivalent to the above:
[Dd]ave
What if the sequence of characters you want to match is extremely long? For instance, you want to match a sequence of your character class which is 100 in length? That would seem like a lot to type out!
For that, we would instead use a quantifier. A
quantifier is a suffix given to your character class which indicates how
many characters to match. Let’s say you wanted to match a string that’s
100 characters long and where each character is from the following set
of characters: x
, X
, o
,
O
[xXoO]{100}
That will match exactly 100 characters from that set. You’re not that restricted though. What if you would accept anything between 75 and 200 characters in length? You can use the same syntax to provide a range:
[xXoO]{75,200}
Because quantifiers are so handy, there are a few commonly used shorthand alternatives to the curly bracket syntax:
*
- the equivalent of {0,}
; the absence of
a number on the right side of the comma indicates no upper bound, so
this means 0 or more characters from the preceding character class+
- the equivalent of {1,}
; this means 1
or more characters from the preceding character class?
- the equivalent of {0,1}
; this means 0
or 1 character from the preceding character classAll of these quantifiers with ranges shown above are known as greedy quantifiers. That means since they have a range of valid lengths, if given an option to match more or less characters from the character class, they’ll try to match as many as possible up to the limit. In the case of x’s and o’s, this means if you had 100 of them, a greedy quantifier wouldn’t stop matching once it got to 75 characters, it would keep matching all the way to 100!
The opposite is a lazy quantifier, which tries to match as
few as possible. To turn any of the above quantifiers into lazy
versions, just append a trailing ?
:
[xXoO]{75,200}?
Contrary to the earlier example, if you had 100 x’s and o’s, this lazy quantifier would still match but it would stop matching once it got to 75 characters!
We’ll test this out in a little while, but first let’s discuss a few more logical operations.
Anchors are also known as atomic zero-width assertions. They don’t consume any characters from the input text. Rather, if you visualize the current location of the regex as it’s analyzing an input text as a cursor, then it’s saying that at that specific point in the pattern, the cursor should occupy a particular kind of location.
Still seems abstract? Let me give you a few examples.
[abc]{3}
[^abc]{3}
^[abc]{3}
The above three patterns mean different things.
The first one indicates the input text must have a sequence of three
characters somewhere inside of it, where each character is from
the following set: a
, b
, and
c
The second one means a sequence of three characters where each
character is a member of the set comprised of anything except
for one of these letters: a
, b
,
c
The third one which starts with a caret ^
outside of the
character class indicates that the input string must start with such a
sequence of three characters, where each character is from the following
set: a
, b
, and c
The caret is the “starting” anchor. It must be the first character in a regex pattern if it is used. I admonish you to not forget the difference between the starting anchor and negative character classes, as it is an important distinction.
There’s another anchor to ensure matches must abut the end of an
input sequence: $
[abc]{3}$
Above means that the input string must end with a sequence of three
characters, where each character is from the following set:
a
, b
, and c
You can use these in conjunction, too. The following indicates a string which must be of length 3 with each character from the set we’ve been discussing:
^[abc]{3}$
When you use both a starting and ending anchor like this in the same regex pattern, the pattern is said to be fully anchored.
There are other anchors, but I’ll leave those for you to experiment with.
Non-capturing groups allow you to specify many logical patterns that would otherwise be tedious (or impossible? that’s a question for the formal language theorists). And, because they’re non-capturing, they can be used even in regex patterns which aren’t designed to extract text. If that last sentence confused you, don’t worry; we’ll discuss capturing groups later.
One powerful capability of non-capturing groups is the ability to apply quantifiers to whole sequences of character classes.
Before I show you such a pattern, let’s develop a testing setup in the KQL workspace we’ve been using. It will be nice to be able to test out some of these abstract language ideas without requiring any specific kind of data in the test data tables provided by Microsoft.
datatable(data:string, expectsMatch:bool) [
"aBC", true,
"aBCBC", true,
"aBCBCBC", true,
"aBCBCBCBC", false
]
That’s essentially a fake or temporary in-memory table you can create on the fly populated with literal data. It’s a handy tool to test out different KQL and regex ideas you may have. Just like the named tables we referenced earlier, it can be “piped” to a KQL command pipeline. Such as in the following example, which has the first regex I want to demo:
datatable(data:string, expectsMatch:bool) [
"aBC", true,
"aBCBC", true,
"aBCBCBC", true,
"aBCBCBCBC", false
]
| where data matches regex "^a(?:BC){1,3}$"
Here the non-capturing group is demarcated at the start by
(?:
and at the end by )
. Everything between
those boundaries is a part of the group, in this case the sequence
BC
.
A quantifier is applied to the group, requiring between 1 and 3
repetitions of the sequence BC
.
Finally, this pattern has both a start and end anchor.
Try running this, and you’ll see the 4th string where
expectsTrue == false
does not appear in the resulting
table, as it does not fit the logical pattern we have defined. This is
because it has 4 repetitions of BC
, exeeding the upper
bound of 3.
You can even nest these groups inside one another!
datatable(data:string, expectsMatch:bool) [
"abccc", true,
"abcbcbcc", true,
"abcccccccccccccccbc", true,
"abbc", false
]
| where data matches regex "^a(?:b(?:c)+)+$"
That pattern means something like this: an a
followed by
one or more of the following type of sequence; a single b
followed by one or more c
.
Alternation is a technique that allows you to specify alternative options for a regex to try matching. It can be applied at a “top-level” within a regex pattern, or it can also be applied within groups.
I urge caution at this juncture if you find yourself tempted to use alternation (and even quantifiers in some situations). It can become confusing to understand how many times a regex will do certain things owing to features like “backtracking”. Spend time researching the rules and constraints you’re working with so you can avoid catastrophic backtracking.
Here’s one example of alternation, which is specified using the pipe
|
character:
datatable(data:string, expectsMatch:bool) [
"abcd", true,
"aBCD", true,
"abCDbCDcdCDbcd", true,
"abBcd", false,
"abCd", false
]
| where data matches regex "^a(?:[bB](?:cd|CD)+)+$"
An a
followed by one or more of following sequence: one
b
or B
, followed by one or more of this kind
of sequence: either cd
or CD
.
At long last, we make it to capturing groups!
These allow your regex to remember subsequences of text that are matched by the pattern, which afterwards can be new columns in your table.
This is a very powerful feature of KQL’s flavor of regex, as it allows you to use KQL to compose your own parsing routines for the weird custom file formats you might encounter in the wild! Using this new tool, no longer will the cardinality of data be obscured and obfuscated within the baroque file formats which contain it, and you will find yourself performing valid statistical analyses just where your peers least suspect it! MWAHAHAHAHAHAHAA!
Okay, delusions of grandeur over.
Let’s talk about how to use them. Capturing groups come in two
categories. They can be named OR unnamed. The syntax for unnamed is
easiest, as it is just a pair of parentheses. We will use it with the
extract
function next, which requires these arguments: the
pattern, an index number (we provide 1, indicating we want the first
captured value, rather than 0, which is the entire matched string), and
finally the original string field to which we are applying the regex
pattern.
In the following, we want to extract into a new column, from a sample
of Employee email addresses, the parts before the @
symbol
preceding the domain name:
// help/SecurityLogs
Employees
| take 100
| project email_addr
| extend emailPrefix=extract(
"^([^@]+)[@]",
1,
email_addr
)
This regex means, starting at the beginning of the string, match and
capture a sequence of one or more characters made of any characters
except an @
. That sequence to capture must also be followed
by an @
(but without capturing that final
@
).
Here’s the same query, but using a named capturing group. In this
case, it’s named prefix
. Note that the syntax for named
capturing groups can vary a bit across the different regex engines out
there.
// help/SecurityLogs
Employees
| take 100
| project email_addr
| extend emailPrefix=extract(
"^(?P<prefix>[^@]+)[@]",
1,
email_addr
)
Let’s revisit an earlier example. We were earlier matching sequences of 75 to 200 x’s or o’s (upper and lowercase):
[xXoO]{75,200}?
This was a lazy match, right? Let’s test this out and prove with a capturing group that this will only capture 75 characters, even if there are more valid characters available:
datatable(data:string)[
"xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
]
| extend originalLength=strlen(data)
| extend extractedLength=strlen(
tostring(
extract(
"^([xXoO]{75,200}?)",
1,
data
)
)
)
If you run this, you’ll see 80 characters are available, but only 75
are consumed. Remove the trailing ?
from the pattern and
you’ll find that in that case all 80 characters are consumed. This is
the difference between regex eagerness and laziness.
Sometimes you will want your regex pattern to extract more than one
value by using multiple capturing groups. This is permissible, but you
need to use the extract_all
function.
// help/SecurityLogs
Employees
| take 100
| project email_addr
| extend emailParts=extract_all(
"^([^@]+)[@](.*)$",
email_addr
)
| extend emailPrefix=emailParts[0][0]
| extend emailDomain=emailParts[0][1]
| project-away emailParts
Here we complete the regex pattern we made earlier by adding a second
capturing group to grab any of the remaining characters up to the end of
the string (.*)$
.
The resulting value from the extract_all
is an array
containing another array with the captured values, hence why we are
using multiple array accessors to get at the values and assign them to
new fields called emailPrefix
and
emailDomain
.
Does that seem weird that there exists a seemingly needlessly nested array? The reason is that, if you recall, groups may be nested. Hence, when extracting multiple things with multiple capturing groups, the return value also needs to be in a nested or nestable format.
There’s a lot to say about table joining, and we’re not going to go through everything because it would balloon this conversation to an intolerable length.
The best summary I can give is that table joining is about applying all the operations you can do with a venn diagram to tables. And how can you do that, well it’s based on multiple tables each having a column containing some of the same values. Those columns are what determine which part of which circle in the venn diagram a record occupies.
Here’s one of the simplest kinds of examples. Adding a little bit of Product details to a sample of records from a Sales table:
// help/ContosoSales
SalesTable
| take 100
| join kind=inner Products on ProductKey
SalesTable and Products both have a ProductKey
column.
This is the simplest kind of join because you don’t even need to write a
join- or sub-query. You just provide the name of the table and
poof you get extra columns.
A common scenario is if you want to retrieve records from one table, call it table A, but only those records which are referenced (such as by an ID) by records in another table, call it table B.
Still seem abstract? Let’s look at an example:
// help/SecurityLogs
AuthenticationEvents
| take 500
| project timestamp, hostname, src_ip, username, result
| where result!="Successful Login"
| join kind=inner (
Employees
| project username, name, ip_addr, email_addr, role
) on username
| summarize count() by username, ip_addr, email_addr, role
| sort by count_ desc
Here we’re taking a sample of logs from the AuthenticationEvents table. We’re getting rid of some unneeded columns before filtering out all the successful logins so we are left with only failed login events.
Then the join starts, where we are providing a subquery to fetch a
set of results from the Employees table. The column which is shared in
both tables is the username
column! So after we have
fetched the Employees, we specify on username
to indicate
we are joining the two tables on the values in that column. You can cut
the query off at that point in the pipeline if you like to see the
result. We now have additional fields attached to our failed login
events indicating, for example, the role of the user that failed to
login.
Then we’re doing some summarizing and sorting to figure out who from our sample of 500 events was the worst offender as far as failed logins. When I ran this, this was at the top of the list:
mastephens 192.168.0.234 marissa_stephens@envolvelabs.com IT associate 7
Somebody get Marissa Stephens on the horn!
So far you saw the inner join. This produces only the intersecting parts of the two tables when you’re considering that column that has values in both tables. In certain SQL situations, this is referred to as a “foreign key” column.
Just to help you understand the difference between an inner join and the various other types of joins, you can use the following setup to explore the difference in KQL:
let humans = datatable(id:int, name:string, pet_id:int)[
1, "Huey", int(null),
2, "Dewey", 4,
3, "Louie", 5
];
let animals = datatable(id:int, name:string, species:string)[
4, "Spot", "dog",
5, "Ace", "dog",
6, "Bruiser", "cat"
];
humans
| join kind=inner (
animals
| project pet_id=id, pet_name=name, species
) on pet_id
Right now, it’s set to inner
. Replace
kind=inner
with kind=leftouter
, and see how it
also gives you the record from the humans
table that does
not have an associated record in the other table. This is an important
difference, as sometimes it can make or break the logic of your query if
you’re doing joins.
Since it can be counterintuitive, let me state it explicitly. The nature of joining is that the process has the potential to filter records out of both tables before producing the final result table. How and when exactly this filtering happens is dictated by the type of join that you choose.
Here are the different kind of joins you can try with the above table:
leftouter
leftsemi
leftanti
fullouter
inner
innerunique
rightouter
rightsemi
rightanti
Check out the venn diagrams on this page4 showing the difference between the different types.
My proposal is that Huey adopt Bruiser!
Earlier I introduced you to the summarize
command for
calculating statistics based on your table columns. If you play around
with the summarize command enough, you’ll eventually find it’s easy to
mislead yourself. For example, averages taken over a very large number
of values in a column will often seem to conceal the amount of variation
contained in that column. The maximum and minimum functions will seem
meaningless for almost all but the smallest of tables or most specific
of use cases. These kinds of problems, if not overcome, can easily get
in the way of your analysis and hamper its accuracy and
incisiveness.
This is why Santa Claus invented the time series. For those of us for whom one or two averages are not enough, he said hey why not calculate a separate average for each day of the year! That way when Santa is considering a child’s naughtiness, he can easily separate any particularly naughty outlier days from the rest, the kind of excessively naughty days which might otherwise screw a child’s average.
So that’s what a time series is really about. Instead of calculating a stats function like average, maximum, or minimum (or any of the others) once or twice, you first “bin” all the values in the table into separate bins of particular length in time (1 second, 1 minute, 1 hour, 6 hours, 5 days, etc), and then calculate that function once for each such bin of time across a larger span of time (1 day, 1 week, 1 month, 1 year, etc).
How does this “binning” process really work? Simpler than you think. Say we have a bunch of records, each with a really fine to-the-millisecond timestamp. If we are binning the records into bins with a span of 1 day, then that means any timestamp falling within the same 24 hour period will be “re-written” to the same string or number representing that particular day.
Let’s look at an example:
// help/SecurityLogs
AuthenticationEvents
| where result!="Successful Login"
| extend timestamp=todatetime(timestamp)
| summarize count() by bin(timestamp,1m)
| render timechart
The summarize
command here is a lot like previous runs
I’ve shown you, only now the “by clause” contains, instead of a solo
column name, a call to the bin
function into which we pass
the name of the timestamp column and the bin size. In this case the bin
size is 1m
or one minute.
Finally, we request the time series to be rendered as a timechart. You’ll see a graph drawn in one tab with time on the x-axis and quantity on the y-axis. The other tab contains the table which is visualized in the graph.
Honestly I usually just call it a timechart. Calling it a time series always felt like calling an attorney “esquire”… technically correct, but some folks might assume you’re a try-hard.
Timecharts are an important tool for your anomaly hunting arsenal.
You can use any of the stats functions with them: avg
,
min
, max
, etc.
You can produce a separate time series column for each distinct value in a particular column.
// help/SecurityLogs
let topTenFailedUsers=AuthenticationEvents
| where result!="Successful Login"
| extend timestamp=todatetime(timestamp)
| summarize count() by username
| top 10 by count_
| project username;
AuthenticationEvents
| where result!="Successful Login" and username in (topTenFailedUsers)
| extend timestamp=todatetime(timestamp)
| summarize count() by bin(timestamp,1m), username
| render columnchart
Notice how, after the by
clause in the above time series
summarization command, we specify a comma and then
username
? This would seem to suggest that we could keep
expanding that list of columns to include more than just
username
. In that case, the meaning you’re trying to write
into your query would be something like, for each time range bin,
calculate the statistical functions once for each unique combination of
the following list of one or more columns: username
, …,
etc.
However, you’ll find support after the bin
function is
limited to specifying only one column. This is not necessarily a
shortcoming of KQL. Other languages have a similar limitation. It does
not follow that the meaning you’re trying to imbue your query with is
impossible; it just means you’ll need to take a slightly different
approach to work around this limitation.
The key to working around this limitation when building time series
is concatenation. At first it may feel like a cheesy trick, but in time
it will become a key part of your behavioral analysis arsenal. In this
approach, you want to produce a new round of statistical calculations
within each time range bin and for each unique combination of values
from a list of columns. To specify that unique combination of
values, you just concatenate those columns into a single string and
store that string in a new column; that new column becomes the column
you specify after the call to bin
.
As an example, let’s say in the above query, after running it, you
realized you wanted not unique combinations of users, but unique
combinations of users and their password_hash
. This
actually produces a lot of combinations, so to stay under the
recommended limit in dataexplorer, I’m also changing the earlier call to
the top
command to produce only 5 users instead of 10. See
how this approach is implemented with the strcat
and
extend
commands below to concatenate the values of the
other fields:
// help/SecurityLogs
let topTenFailedUsers=AuthenticationEvents
| where result!="Successful Login"
| extend timestamp=todatetime(
timestamp
)
| summarize count() by username
| top 5 by count_
| project username;
AuthenticationEvents
| where result!="Successful Login" and username in (topTenFailedUsers)
| extend timestamp=todatetime(
timestamp
)
| extend username_passwdHash=strcat(
username,
" - ",
password_hash
)
| summarize count() by bin(
timestamp,
1m
), username_passwdHash
| render columnchart
So far, these types of timecharts have focused on summarizing and visualizing data over time from one point in history to another. The form of the output table and shape of output graph bear a direct resemblance to the input table; for example, if the earliest record’s timestamp in the input table was two weeks ago and the latest record’s timestamp one week ago, then the timechart spans that time range in history exactly, from two weeks ago to one week ago.
Sometimes that kind of timechart does not quite scratch the itch you have in the anomaly hunting department. Sometimes there is a “seasonality” to phenomena that you want to understand more, maybe because no Sherlock was able to crack the case the first go-round and the problem keeps happening. I want to give you one more tool for such occasions.
This approach involves constructing custom time-range bin labels for your records. The type of label you’ll create doesn’t help you understand which slice of overall chronological time in history a record occupies. Rather, it helps you understand which recurring slice of chronological time a record occupies.
This is the difference between knowing that event X occurred at 7:13 AM on Tuesday the 23rd of January, 2024 and knowing that event X occurred at 7:13 AM on some Tuesday in history (but which Tuesday exactly is not known).
Humans often schedule things that repeat on an hourly or daily or weekly or monthly basis. You might want to know, for example, is there a trend as far as which hour of the day or hour of the week that a particular phenomenon is happening?
I believe when the aggregation function, which you are using in each bin, is the count function, then this may be considered a form of histogram. However, I encourage you to experiment with other aggregation functions using this approach as well to see what trends you can find.
Before we consider an example, we need to iron out the details of making custom time-range bin labels.
One key to constructing custom timestamp bin field values is to ensure their lexicographic order and their chronological order coincide. The advantage is that when a list of these bin field values is sorted using common lexical sorting techniques (i.e., like those in KQL table user interfaces or via KQL’s sort function or even in a spreadsheet program) they are in chronological order for displaying on a graph.
Commonly, this involves ensuring numbers are all a fixed-width in terms of characters/digits (regardless of magnitude), padding them with 0’s on the left side as needed.
As an example, if the largest number in your group is 100, then consider the number 9: 9 is less than 100 in a logical ordering according to magnitude, but in a lexicographic or alphabetic ordering 100 comes first. To fix this, pad the 9 with two 0’s on the left side: 009. Now both orderings will be the same.
Many languages have a function for padding strings on left or right, but since I didn’t see one in KQL we will construct a User-Defined Function (UDF) for these scalar operations:
let padStringLeft = (
str:string,
padChar:string,
minLength:int
){
let ch = iif(
strlen(
padChar
) < 1,
"0",
substring(
padChar,
0, 1
)
);
iif(
strlen(
str
) >= minLength,
str,
strcat(
strcat_array(
repeat(
ch,
minLength - strlen(
str
)
),
""
),
str
)
);
};
Our UDF is named padStringLeft
, and it has three
parameters defined:
str
- This is the string to pad. Naturally, this
function is probably going to be used with numbers, so this just means
to use this function you must cast your numerical value
tostring(myNum)
first.padChar
- This is the character intended to be used for
padding on left-hand side of string. You pass in a string that should be
one character long, but if not it will be truncated or coerced
to a default value of 0
.minLength
- This number refers to the minimum length of
the string to output; typically, you will input here the length of the
longest string in your set of number strings (e.g., if max number is
1000, then the string “1000” is 4 characters/digits long, so 4 will be
this argument).Inside the UDF body, the assignment expression
let ch = iif(...);
does what I described above. It checks
if the padChar
string is the empty string with length of 0,
and if so gives a default value of 0
. Otherwise, we grab
the first character from the specified padChar
string using
the substring
function (just in case it’s longer than one
character).
The remaining iif(...);
call produces the ultimate value
returned from the function, our padded string. We check if the length of
str
is already greater than or equal to the minimum length;
if so, we just return the value of str
as it is already.
Otherwise, we need to produce a properly padded version of the
string.
There are a few layers to producing the padded string. The outer
strcat()
call has two arguments to concatenate:
To produce the pad string, we use the repeat(...)
function to generate an array where each element is the pad character
ch
we derived earlier and where the length of the array is
the difference between the minimum length and the length of the string
that needs padding. Then we take this array and concatenate all of its
values (with an empty string delimiter ““) into a single string using
the strcat_array(...)
function.
You can test the function like this:
print(padStringLeft(
tostring(9),
"0",
4
))
Now that we can pad numbers, let’s consider an example. Let’s convert an earlier time series query we made to this format:
let topTenFailedUsers=AuthenticationEvents
| where result!="Successful Login"
| extend timestamp=todatetime(timestamp)
| summarize count() by username
| top 5 by count_
| project username;
AuthenticationEvents
| where result!="Successful Login" and username in (topTenFailedUsers)
| extend timestamp=todatetime(timestamp)
| extend username_passwdHash=strcat(username, " - ", password_hash)
| extend tsHr=datetime_part("hour", timestamp)
| extend tsWeekday=dayofweek(timestamp)
| extend customBinLabel=strcat(
toint(
tsWeekday / 1d
),
" ~ ",
padStringLeft(
tostring(tsHr),
"0",
2
)
)
| summarize count() by customBinLabel, username_passwdHash
| sort by customBinLabel asc
| render columnchart
Our customBinLabel
values each represent one of the
possible 168 hours of the week. We create this new field for each record
by first extracting via datetime_part(...)
the hour value
contained in the original timestamp. See here5 the different possible values you
can use with that function.
One of those possible values is not the day of the week (1 through
7), so we use a different function called dayofweek
to
extract that value from the original timestamp.
Next we concatenate those two new fields into a single string with a
delimeter of ” ~ ” between (arbitrarily chosen) to make the custom bin
label. When concatenating the values, we take care to pad the hour
values first using our UDF, and the day of the week value must finally
be cast to an integer and divided by 1 single increment of its unit
(d
for day) to give us a number rather than the range we
are originally given by the dayOfWeek
function.
At last we replace our original bin(...)
call while
summarizing with our new custom bin label field, and perform one last
command as well to sort the table by the new field so everything is in
relative chronological order.
Notice any trends as far as which hour of the week tends to have more failed login events from top ten failed users?
Sure, this is a contrived example from fake data sets, so there may be no real conclusions we can draw here. But remember this tool nonetheless!
JSON is pretty ubiquitous at the time I’m writing this so I assume you’ve been exposed to it. But if not, it’s just a text-based file format used for many purposes. Given how common it is, it’s pretty handy that KQL comes with functions out-of-the-box for parsing JSON-encoded data into KQL fields.
Since KQL in some circumstances (I believe based on ingestion
settings) automatically parses JSON fields into the dynamic
type, I provided the following fake data table of JSON strings. You can
see by the call to getschema
that indeed these JSON strings
are still really just strings and therefore JSON.
let computers = datatable (json:string)[
'{"id": 1, "brand": "TechMaster", "model": "X9000", "cpu": "QuadCore 3.0GHz", "ram": "16GB", "storage": "512GB SSD", "graphics": "NVIDIA GTX 1080", "price": "$1299.99"}',
'{"id": 2, "brand": "QuantumTech", "model": "UltraCore", "cpu": "OctaCore 2.5GHz", "ram": "32GB", "storage": "1TB HDD", "graphics": "AMD RX 5700", "price": "$1499.99"}',
'{"id": 3, "brand": "CyberFusion", "model": "StealthX", "cpu": "HexaCore 2.8GHz", "ram": "64GB", "storage": "256GB NVMe", "graphics": "GeForce RTX 3060", "price": "$1699.99"}',
'{"id": 4, "brand": "SysGenius", "model": "ElitePro", "cpu": "QuadCore 3.2GHz", "ram": "8GB", "storage": "1TB SSD", "graphics": "Intel UHD Graphics", "price": "$1099.99"}',
'{"id": 5, "brand": "MegaByte", "model": "PowerSpeed", "cpu": "OctaCore 3.5GHz", "ram": "128GB", "storage": "2TB HDD", "graphics": "NVIDIA RTX 3080", "price": "$1999.99"}',
'{"id": 6, "brand": "NeoGaming", "model": "TitanX", "cpu": "HexaCore 3.0GHz", "ram": "16GB", "storage": "512GB NVMe", "graphics": "AMD RX 5600 XT", "price": "$1299.99"}',
'{"id": 7, "brand": "Futurix", "model": "FusionWave", "cpu": "QuadCore 2.7GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "GeForce GTX 1660", "price": "$1399.99"}',
'{"id": 8, "brand": "TechWave", "model": "SwiftForce", "cpu": "OctaCore 2.2GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "NVIDIA GTX 1070", "price": "$1599.99"}',
'{"id": 9, "brand": "QuantumPulse", "model": "Xplorer", "cpu": "HexaCore 2.5GHz", "ram": "8GB", "storage": "512GB HDD", "graphics": "AMD RX 550", "price": "$1099.99"}',
'{"id": 10, "brand": "EpicGen", "model": "ThunderBlaze", "cpu": "QuadCore 3.5GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "Intel Iris Xe", "price": "$1899.99"}',
'{"id": 11, "brand": "CyberPulse", "model": "NovaSpeed", "cpu": "OctaCore 3.2GHz", "ram": "16GB", "storage": "1TB NVMe", "graphics": "NVIDIA RTX 3070", "price": "$1799.99"}',
'{"id": 12, "brand": "TechFusion", "model": "VelocityX", "cpu": "HexaCore 2.8GHz", "ram": "32GB", "storage": "512GB SSD", "graphics": "AMD RX 570", "price": "$1299.99"}',
'{"id": 13, "brand": "InnoTech", "model": "BlazePro", "cpu": "QuadCore 2.5GHz", "ram": "64GB", "storage": "1TB HDD", "graphics": "GeForce GTX 1650", "price": "$1499.99"}',
'{"id": 14, "brand": "FusionTech", "model": "SwiftNova", "cpu": "OctaCore 3.0GHz", "ram": "8GB", "storage": "256GB NVMe", "graphics": "NVIDIA GTX 1060", "price": "$1199.99"}',
'{"id": 15, "brand": "MegaTech", "model": "TurboWave", "cpu": "HexaCore 2.2GHz", "ram": "128GB", "storage": "2TB SSD", "graphics": "AMD RX 580", "price": "$1699.99"}',
'{"id": 16, "brand": "QuantumFusion", "model": "RapidX", "cpu": "QuadCore 3.2GHz", "ram": "16GB", "storage": "512GB HDD", "graphics": "Intel UHD Graphics", "price": "$1299.99"}',
'{"id": 17, "brand": "EpicPulse", "model": "BlitzTech", "cpu": "OctaCore 2.8GHz", "ram": "32GB", "storage": "1TB SSD", "graphics": "NVIDIA RTX 3060", "price": "$1499.99"}',
'{"id": 18, "brand": "TechNova", "model": "XtremeSpeed", "cpu": "HexaCore 3.5GHz", "ram": "64GB", "storage": "256GB SSD", "graphics": "GeForce GTX 1660", "price": "$1599.99"}',
'{"id": 19, "brand": "CyberBlaze", "model": "ThunderX", "cpu": "QuadCore 2.2GHz", "ram": "8GB", "storage": "512GB SSD", "graphics": "AMD RX 5600 XT", "price": "$1199.99"}',
'{"id": 20, "brand": "FutureTech", "model": "SwiftFusion", "cpu": "HexaCore 3.0GHz", "ram": "128GB", "storage": "1TB NVMe", "graphics": "NVIDIA GTX 1070", "price": "$1799.99"}'
];
computers | getschema
Right now the table only has one column, a string. It has a lot of different fields encoded in that one string though, and we would like to pull them out into separate columns so we can enjoy the full set of advantages associated with proper tabular formats, like ease of calculating statistics.
We will not use regex alone to pull these values out into columns because KQL has dedicated functions for JSON parsing out-of-the-box.
computers
| extend parsed=parse_json(json)
Now you’ll see the new parsed
column has a dynamic
object of which you can access individual members. Let’s use project to
split these things into separate columns, additionally splitting out
further details nested in some of the strings via regex:
computers
| extend parsed=parse_json(json)
| project id=toint(
parsed.id
), brand=tostring(
parsed.brand
), model=tostring(
parsed.model
), cpuSpecs=extract_all(
"^([Qq][Uu][Aa][Dd]|[Hh][Ee][Xx][Aa]|[Oo][Cc][Tt][Aa])[Cc][Oo][Rr][Ee][ ](\\d+[.]?\\d*)GHz$",
tostring(
parsed.cpu
)
), ramGB=extract(
"^([0-9]+)[Gg][Bb]$",
1,
tostring(
parsed.ram
)
), storage=extract_all(
"^([0-9]+)([GgTt][Bb])[ ]([A-Za-z0-9_.-]+)$",
tostring(
parsed.storage
)
), graphics=tostring(
parsed.graphics
), priceDollars=extract(
"^[$](.*)$",
1,
tostring(
parsed.price
)
)
Where are we going with this regex business? Well, some of the values in those strings are numerical values we might want to sort the table of computers by. Tack the following onto your command pipeline, and you’ll see how we create new columns with the values we extracted:
| project id, brand, model, graphics, priceDollars, cpuCores=case(
cpuSpecs[0][0] =~ "quad", 4,
cpuSpecs[0][0] =~ "hexa", 6,
cpuSpecs[0][0] =~ "octa", 8,
-1
), cpuGHz=todouble(
cpuSpecs[0][1]
),
storageSize=toint(
storage[0][0]
), storageUnit=storage[0][1], storageType=storage[0][2]
| extend storageGB=case(
storageUnit =~ "GB",
storageSize,
storageSize*1000
)
| project-away storageSize, storageUnit
This kind of became about more than just JSON parsing. BUT, do you see all the stuff we could do that we otherwise couldn’t without the parsing step? At least, we couldn’t have done so unless we wanted to get a lot more complicated with regex and parsing logic.
Sometimes, the data in KQL tables changes. Some dev you haven’t met made a change, and now your query is broken because the records in the table look like this now:
datatable(data:string)[
@"{""SKU"":""CMP067"",""data"":""{\""id\"": 1, \""brand\"": \""TechMaster\"", \""model\"": \""X9000\"", \""cpu\"": \""QuadCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""NVIDIA GTX 1080\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP048"",""data"":""{\""id\"": 2, \""brand\"": \""QuantumTech\"", \""model\"": \""UltraCore\"", \""cpu\"": \""OctaCore 2.5GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""AMD RX 5700\"", \""price\"": \""$1499.99\""}""}",
@"{""SKU"":""CMP088"",""data"":""{\""id\"": 3, \""brand\"": \""CyberFusion\"", \""model\"": \""StealthX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""GeForce RTX 3060\"", \""price\"": \""$1699.99\""}""}",
@"{""SKU"":""CMP090"",""data"":""{\""id\"": 4, \""brand\"": \""SysGenius\"", \""model\"": \""ElitePro\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1099.99\""}""}",
@"{""SKU"":""CMP017"",""data"":""{\""id\"": 5, \""brand\"": \""MegaByte\"", \""model\"": \""PowerSpeed\"", \""cpu\"": \""OctaCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB HDD\"", \""graphics\"": \""NVIDIA RTX 3080\"", \""price\"": \""$1999.99\""}""}",
@"{""SKU"":""CMP096"",""data"":""{\""id\"": 6, \""brand\"": \""NeoGaming\"", \""model\"": \""TitanX\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB NVMe\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP094"",""data"":""{\""id\"": 7, \""brand\"": \""Futurix\"", \""model\"": \""FusionWave\"", \""cpu\"": \""QuadCore 2.7GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1399.99\""}""}",
@"{""SKU"":""CMP017"",""data"":""{\""id\"": 8, \""brand\"": \""TechWave\"", \""model\"": \""SwiftForce\"", \""cpu\"": \""OctaCore 2.2GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1599.99\""}""}",
@"{""SKU"":""CMP067"",""data"":""{\""id\"": 9, \""brand\"": \""QuantumPulse\"", \""model\"": \""Xplorer\"", \""cpu\"": \""HexaCore 2.5GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""AMD RX 550\"", \""price\"": \""$1099.99\""}""}",
@"{""SKU"":""CMP024"",""data"":""{\""id\"": 10, \""brand\"": \""EpicGen\"", \""model\"": \""ThunderBlaze\"", \""cpu\"": \""QuadCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""Intel Iris Xe\"", \""price\"": \""$1899.99\""}""}",
@"{""SKU"":""CMP069"",""data"":""{\""id\"": 11, \""brand\"": \""CyberPulse\"", \""model\"": \""NovaSpeed\"", \""cpu\"": \""OctaCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA RTX 3070\"", \""price\"": \""$1799.99\""}""}",
@"{""SKU"":""CMP071"",""data"":""{\""id\"": 12, \""brand\"": \""TechFusion\"", \""model\"": \""VelocityX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 570\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP019"",""data"":""{\""id\"": 13, \""brand\"": \""InnoTech\"", \""model\"": \""BlazePro\"", \""cpu\"": \""QuadCore 2.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""GeForce GTX 1650\"", \""price\"": \""$1499.99\""}""}",
@"{""SKU"":""CMP056"",""data"":""{\""id\"": 14, \""brand\"": \""FusionTech\"", \""model\"": \""SwiftNova\"", \""cpu\"": \""OctaCore 3.0GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""NVIDIA GTX 1060\"", \""price\"": \""$1199.99\""}""}",
@"{""SKU"":""CMP044"",""data"":""{\""id\"": 15, \""brand\"": \""MegaTech\"", \""model\"": \""TurboWave\"", \""cpu\"": \""HexaCore 2.2GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""AMD RX 580\"", \""price\"": \""$1699.99\""}""}",
@"{""SKU"":""CMP019"",""data"":""{\""id\"": 16, \""brand\"": \""QuantumFusion\"", \""model\"": \""RapidX\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP014"",""data"":""{\""id\"": 17, \""brand\"": \""EpicPulse\"", \""model\"": \""BlitzTech\"", \""cpu\"": \""OctaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""NVIDIA RTX 3060\"", \""price\"": \""$1499.99\""}""}",
@"{""SKU"":""CMP084"",""data"":""{\""id\"": 18, \""brand\"": \""TechNova\"", \""model\"": \""XtremeSpeed\"", \""cpu\"": \""HexaCore 3.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1599.99\""}""}",
@"{""SKU"":""CMP046"",""data"":""{\""id\"": 19, \""brand\"": \""CyberBlaze\"", \""model\"": \""ThunderX\"", \""cpu\"": \""QuadCore 2.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1199.99\""}""}",
@"{""SKU"":""CMP016"",""data"":""{\""id\"": 20, \""brand\"": \""FutureTech\"", \""model\"": \""SwiftFusion\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1799.99\""}""}"
]
| project data=parse_json(data)
Even after you parse the JSON, most of the data is still inside a single string!
This is nested JSON. It means after an object was serialized to JSON, a developer assigned that JSON string to a field inside a different object. Then, that other object was serialized to JSON.
No worries, just run the parser again on that nested JSON payload! Append the following to your pipeline:
| project data=parse_json(data)
| project SKU=tostring(data.SKU), parsed=parse_json(tostring(data.data))
See the (seemingly) extra call to tostring
near the end
there? That has to happen because parse_json
requires a
string as input and at that point in the pipeline data.data
is a dynamic type.
Note that this whole thing can just be inserted before most of the rest of the old query to make it work just like it used to work, owing to our choice of column names.
I call this “shimming the old query”, though there’s probably already a term for this. It’s just where your table changed so, to avoid re-engineering a whole query, you just insert some logic early in the query to make the table “fit” the old query. It’s a handy technique that allows you to preserve the intentions of a previous query author without inserting your own opinions.
Only a slight modification must be made to allow the SKU to persist to the final form of the table. Here’s the final form of our pipeline, including the duplicate SKUs ;)
datatable(data:string)[
@"{""SKU"":""CMP067"",""data"":""{\""id\"": 1, \""brand\"": \""TechMaster\"", \""model\"": \""X9000\"", \""cpu\"": \""QuadCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""NVIDIA GTX 1080\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP048"",""data"":""{\""id\"": 2, \""brand\"": \""QuantumTech\"", \""model\"": \""UltraCore\"", \""cpu\"": \""OctaCore 2.5GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""AMD RX 5700\"", \""price\"": \""$1499.99\""}""}",
@"{""SKU"":""CMP088"",""data"":""{\""id\"": 3, \""brand\"": \""CyberFusion\"", \""model\"": \""StealthX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""GeForce RTX 3060\"", \""price\"": \""$1699.99\""}""}",
@"{""SKU"":""CMP090"",""data"":""{\""id\"": 4, \""brand\"": \""SysGenius\"", \""model\"": \""ElitePro\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1099.99\""}""}",
@"{""SKU"":""CMP017"",""data"":""{\""id\"": 5, \""brand\"": \""MegaByte\"", \""model\"": \""PowerSpeed\"", \""cpu\"": \""OctaCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB HDD\"", \""graphics\"": \""NVIDIA RTX 3080\"", \""price\"": \""$1999.99\""}""}",
@"{""SKU"":""CMP096"",""data"":""{\""id\"": 6, \""brand\"": \""NeoGaming\"", \""model\"": \""TitanX\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB NVMe\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP094"",""data"":""{\""id\"": 7, \""brand\"": \""Futurix\"", \""model\"": \""FusionWave\"", \""cpu\"": \""QuadCore 2.7GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1399.99\""}""}",
@"{""SKU"":""CMP017"",""data"":""{\""id\"": 8, \""brand\"": \""TechWave\"", \""model\"": \""SwiftForce\"", \""cpu\"": \""OctaCore 2.2GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1599.99\""}""}",
@"{""SKU"":""CMP067"",""data"":""{\""id\"": 9, \""brand\"": \""QuantumPulse\"", \""model\"": \""Xplorer\"", \""cpu\"": \""HexaCore 2.5GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""AMD RX 550\"", \""price\"": \""$1099.99\""}""}",
@"{""SKU"":""CMP024"",""data"":""{\""id\"": 10, \""brand\"": \""EpicGen\"", \""model\"": \""ThunderBlaze\"", \""cpu\"": \""QuadCore 3.5GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""Intel Iris Xe\"", \""price\"": \""$1899.99\""}""}",
@"{""SKU"":""CMP069"",""data"":""{\""id\"": 11, \""brand\"": \""CyberPulse\"", \""model\"": \""NovaSpeed\"", \""cpu\"": \""OctaCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA RTX 3070\"", \""price\"": \""$1799.99\""}""}",
@"{""SKU"":""CMP071"",""data"":""{\""id\"": 12, \""brand\"": \""TechFusion\"", \""model\"": \""VelocityX\"", \""cpu\"": \""HexaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 570\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP019"",""data"":""{\""id\"": 13, \""brand\"": \""InnoTech\"", \""model\"": \""BlazePro\"", \""cpu\"": \""QuadCore 2.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""1TB HDD\"", \""graphics\"": \""GeForce GTX 1650\"", \""price\"": \""$1499.99\""}""}",
@"{""SKU"":""CMP056"",""data"":""{\""id\"": 14, \""brand\"": \""FusionTech\"", \""model\"": \""SwiftNova\"", \""cpu\"": \""OctaCore 3.0GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""256GB NVMe\"", \""graphics\"": \""NVIDIA GTX 1060\"", \""price\"": \""$1199.99\""}""}",
@"{""SKU"":""CMP044"",""data"":""{\""id\"": 15, \""brand\"": \""MegaTech\"", \""model\"": \""TurboWave\"", \""cpu\"": \""HexaCore 2.2GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""2TB SSD\"", \""graphics\"": \""AMD RX 580\"", \""price\"": \""$1699.99\""}""}",
@"{""SKU"":""CMP019"",""data"":""{\""id\"": 16, \""brand\"": \""QuantumFusion\"", \""model\"": \""RapidX\"", \""cpu\"": \""QuadCore 3.2GHz\"", \""ram\"": \""16GB\"", \""storage\"": \""512GB HDD\"", \""graphics\"": \""Intel UHD Graphics\"", \""price\"": \""$1299.99\""}""}",
@"{""SKU"":""CMP014"",""data"":""{\""id\"": 17, \""brand\"": \""EpicPulse\"", \""model\"": \""BlitzTech\"", \""cpu\"": \""OctaCore 2.8GHz\"", \""ram\"": \""32GB\"", \""storage\"": \""1TB SSD\"", \""graphics\"": \""NVIDIA RTX 3060\"", \""price\"": \""$1499.99\""}""}",
@"{""SKU"":""CMP084"",""data"":""{\""id\"": 18, \""brand\"": \""TechNova\"", \""model\"": \""XtremeSpeed\"", \""cpu\"": \""HexaCore 3.5GHz\"", \""ram\"": \""64GB\"", \""storage\"": \""256GB SSD\"", \""graphics\"": \""GeForce GTX 1660\"", \""price\"": \""$1599.99\""}""}",
@"{""SKU"":""CMP046"",""data"":""{\""id\"": 19, \""brand\"": \""CyberBlaze\"", \""model\"": \""ThunderX\"", \""cpu\"": \""QuadCore 2.2GHz\"", \""ram\"": \""8GB\"", \""storage\"": \""512GB SSD\"", \""graphics\"": \""AMD RX 5600 XT\"", \""price\"": \""$1199.99\""}""}",
@"{""SKU"":""CMP016"",""data"":""{\""id\"": 20, \""brand\"": \""FutureTech\"", \""model\"": \""SwiftFusion\"", \""cpu\"": \""HexaCore 3.0GHz\"", \""ram\"": \""128GB\"", \""storage\"": \""1TB NVMe\"", \""graphics\"": \""NVIDIA GTX 1070\"", \""price\"": \""$1799.99\""}""}"
]
| project data=parse_json(data)
| project SKU=tostring(data.SKU), parsed=parse_json(tostring(data.data))
| project SKU, id=toint(
parsed.id
), brand=tostring(
parsed.brand
), model=tostring(
parsed.model
), cpuSpecs=extract_all(
"^([Qq][Uu][Aa][Dd]|[Hh][Ee][Xx][Aa]|[Oo][Cc][Tt][Aa])[Cc][Oo][Rr][Ee][ ](\\d+[.]?\\d*)GHz$",
tostring(
parsed.cpu
)
), ramGB=extract(
"^([0-9]+)[Gg][Bb]$",
1,
tostring(
parsed.ram
)
), storage=extract_all(
"^([0-9]+)([GgTt][Bb])[ ]([A-Za-z0-9_.-]+)$",
tostring(
parsed.storage
)
), graphics=tostring(
parsed.graphics
), priceDollars=extract(
"^[$](.*)$",
1,
tostring(
parsed.price
)
)
| project SKU, id, brand, model, graphics, priceDollars, cpuCores=case(
cpuSpecs[0][0] =~ "quad", 4,
cpuSpecs[0][0] =~ "hexa", 6,
cpuSpecs[0][0] =~ "octa", 8,
-1
), cpuGHz=todouble(
cpuSpecs[0][1]
),
storageSize=toint(
storage[0][0]
), storageUnit=storage[0][1], storageType=storage[0][2]
| extend storageGB=case(
storageUnit =~ "GB",
storageSize,
storageSize*1000
)
| project-away storageSize, storageUnit
There are many developers you may encounter out in the wild who don’t have an overly refined sense of when they are departing from the safer realms of well-defined, grammatically correct file formats. This isn’t a dig on such folks, but they often have other priorities. Consequently, you may find the data you’re looking for is encapsulated within a file format that doesn’t have a ready-made parser for you within KQL. When that happens, you’ll have to effectively compose an ad-hoc parsing routine in order to extract the data in question into columns.
The example that follows is admittedly contrived. BUT, it’s a perfect example of the kind of grammatically ill-defined language you might encounter in the wild. It would be ideal for us if all logging formats were in perfectly tabular formats, but alas the real world is dirtier than that. This lesson is about not letting unusual types of string encoding and logging formats get in the way of your quest to comprehend data in a statistically valid way.
Given the following data table, you may be tempted to try to write a master regex pattern to extract all of the relevant data at once. Not so fast! Part of being a good user of regex is knowing its limits. Even if it were possible given the limited nature of the next example, imagine the following table instead had 40 million records; in such situations, it can be commonplace to encounter bizarre edge cases. I submit to you that it’s good to be able to use regex (and we will) but also just as important to use standard parsing features when they’re already available.
Before we start coding, let’s have a gander at the lunacy in the table below:
let looneyLogs = datatable(doc:string)[
"computer = server016; <batches><batch>{\"DurationMS\": 90, \"Payload\": \"- Name: John Doe\\n DOB: 1990-05-15\\n Email: john.doe@email.com\\n- Name: Jane Smith\\n DOB: 1985-08-22\\n Email: jane.smith@email.com\\n- Name: Alex Brown\\n DOB: 1995-11-10\\n Email: alex.brown@email.com\"}</batch></batches>",
"computer = server081; <batches><batch>{\"DurationMS\": 81, \"Payload\": \"- name: Emily White\\n dob: 1992-09-18\\n email: emily.white@email.com\\n- name: Mark Johnson\\n dob: 1980-03-05\\n email: mark.johnson@email.com\\n- name: Sarah Miller\\n dob: 1995-11-30\\n email: sarah.miller@email.com\"}</batch><batch>{\"DurationMS\": 65, \"Payload\": \"- name: Alex Turner\\n dob: 1987-06-24\\n email: alex.turner@email.com\\n- name: Jessica Adams\\n dob: 1994-12-15\\n email: jessica.adams@email.com\\n- name: Brian Parker\\n dob: 1982-08-03\\n email: brian.parker@email.com\\n- name: Rachel Lee\\n dob: 1998-02-09\\n email: rachel.lee@email.com\\n- name: Kevin Scott\\n dob: 1989-05-17\\n email: kevin.scott@email.com\"}</batch></batches>",
"computer = server056; <batches><batch>{\"DurationMS\": 72, \"Payload\": \"- name: Olivia Jenkins\\n dob: 1991-10-07\\n email: olivia.jenkins@email.com\\n- name: Marcus Turner\\n dob: 1985-04-18\\n email: marcus.turner@email.com\"}</batch></batches>",
"computer = server081; <batches><batch>{\"DurationMS\": 15, \"Payload\": \"- name: Laura Davis\\n dob: 1993-08-26\\n email: laura.davis@email.com\\n- name: Ethan Wilson\\n dob: 1984-12-03\\n email: ethan.wilson@email.com\\n- name: Amanda Parker\\n dob: 1990-02-14\\n email: amanda.parker@email.com\\n- name: Tyler Mitchell\\n dob: 1988-06-11\\n email: tyler.mitchell@email.com\"}</batch><batch>{\"DurationMS\": 48, \"Payload\": \"- name: Grace Roberts\\n dob: 1997-03-29\\n email: grace.roberts@email.com\"}</batch><batch>{\"DurationMS\": 7, \"Payload\": \"- name: Benjamin Hall\\n dob: 1981-09-22\\n email: benjamin.hall@email.com\\n- name: Natalie Adams\\n dob: 1996-11-15\\n email: natalie.adams@email.com\\n- name: Christopher Hill\\n dob: 1986-07-01\\n email: christopher.hill@email.com\"}</batch></batches>",
"computer = server056; <batches><batch>{\"DurationMS\": 33, \"Payload\": \"- name: Lily Moore\\n dob: 1999-01-12\\n email: lily.moore@email.com\\n- name: Jackson Brown\\n dob: 1983-05-04\\n email: jackson.brown@email.com\\n- name: Chloe Taylor\\n dob: 1995-09-17\\n email: chloe.taylor@email.com\"}</batch><batch>{\"DurationMS\": 13, \"Payload\": \"- name: Daniel Wright\\n dob: 1989-12-30\\n email: daniel.wright@email.com\\n- name: Sophia Smith\\n dob: 1992-04-23\\n email: sophia.smith@email.com\"}</batch></batches>",
"computer = server016; <batches><batch>{\"DurationMS\": 9, \"Payload\": \"- name: Ryan Miller\\n dob: 1987-11-08\\n email: ryan.miller@email.com\"}</batch></batches>",
"computer = server091; <batches><batch>{\"DurationMS\": 8, \"Payload\": \"- name: Emma Turner\\n dob: 1994-06-20\\n email: emma.turner@email.com\\n- name: Noah Johnson\\n dob: 1980-02-01\\n email: noah.johnson@email.com\\n- name: Madison Scott\\n dob: 1998-07-14\\n email: madison.scott@email.com\\n- name: Owen Harris\\n dob: 1982-10-27\\n email: owen.harris@email.com\"}</batch></batches>"];
First, it’s kind of a lot of text so I’ve assigned it to a variable
name looneyLogs
, and hereafter in this article will just
refer to that table name. When you’re practicing feel free to pipe the
table directly to the rest of your query. I’m only doing this to avoid
excessive copying and pasting.
Second, these logs are kind of an eyesore! True to the situations you might find yourself in out in the wilds of computing, I’m not going to provide you with a formal grammar decribing the language in these strings/documents.
But, I will give you an informal grammar, the kind you might come up with scanning through the list:
batches
, the closing tag of which is located at the end of
the string.batches
is one or more batch
elements. The inner content of each batch
element looks
like a JSON document.DurationMS
, and a string called Payload
.Payload
strings appear to contain user info, with a
seemingly random number of user records per string. Are these strings
formatted in YAML?We have our work cut out for us. Let’s parse out the info contained in these monstrous strings into a beautiful little table. If you’re coding along at home and don’t want the answer spoiled, now’s the time to pause and try it out yourself!
looneyLogs
| extend docId=new_guid()
Since I’m not 100% sure if it’s important to logically associate the
ultimate rows in the resulting table to the original records they came
from, I’m playing it safe and assigning each a unique ID called
docId
.
If you hadn’t heard of it before, a guid is a globally unique id, which is a kind of really large random value you can use as an id with very low risk of accidentally repeating the number. Most languages and platforms have a way of making these.
Next, let’s use some regex to save the computer name and xml document parts of the string into their own columns! First the computer name:
| extend computerName=extract(
"^[cC]omputer[ ]*[=][ ]*([^;]+)[;][ ]*",
1,
doc
)
We’re letting it start with a lower or uppercase c
to be
permissive. Keeping with the permissive theme, we’ll allow any number of
spaces (including 0) before and after the =
.
We want to start capturing a sequence of one or more characters and keep doing so as long as each character is not a semicolon. That will be our computer name! Then the final semicolon and any optional spaces afterward.
That’s it for capturing the computer name. Next, we will re-use, modify, and extend the same pattern to extract the xml part of the document. Technically, we could use a single regex pattern to extract both things in one shot, but it’s also okay to repeat yourself a little bit in the early stages of trying to solve a problem. You can always tidy up and optimize later.
| extend xmlDocument=extract(
"^[cC]omputer[ ]*[=][ ]*[^;]+[;][ ]*([\\s\\S]+)$",
1,
doc
)
| project docId, computerName, xmlDocument
We use same strategy to match all the way through what we matched
last time, only minus the capturing aspect for the portion that matches
the computer name. Afterwards we append a capturing group to match all
characters (both whitespace and non-whitespace) through the end of the
string. The actual capturing group is ([\s\S]+)
, but in the
kind of string we are using to specify our regex pattern backslashes are
an escape character so we needed to provide double backslashes
instead.
It’s true there is some redundancy within these two consecutive regex patterns. The two of them could be consolidates into a single pattern, but it is okay to keep things sketchy when you’re first developing solutions.
Finally, we list the only columns we still care about to get rid of
the original doc
column. Congrats! We’re well on our way to
exorcising the formatting demons out of this poor information.
We’re not writing a custom parsing routine for XML, especially since there is already such functionality out of the box in KQL. Though that’s a fun idea for another day. Invoking the XML parser is easy:
| extend parsedXML=parse_xml(xmlDocument)
Great, now the new parsedXML
column is a kind of KQL
data structure containing the contents of the XML document. I say
data structure, but of course I mean that in the more
colloquial sense of a hodgepodge of hash tables and arrays.
We can see inside it now the keys representing the old
batches
and batch
elements. It looks like the
parser consolidated the cases of multiple batch
elements
into a single key called batch
holding an array where each
element is the inner content from one of the batch
elements
(i.e., a JSON string). That makes these batch
elements what
is known as a multi-valued field, since that field in a
given single record can itself contain multiple values.
So how do we cope with this? At this point, the way forward is going to be to split these records based on that multi-value field. A record with n values in its multi-valued field will be “cloned” (I use that word loosely, we never want to assume too much about what KQL does under the hood) so there are n copies of it in the table; each of the n records will be given one of the values from that multi-valued field.
Make sense? If not, try running the pipeline of commands we have
built with and without the following mv-expand
command to
develop a sense of what is happening:
| mv-expand jsonDocument=parsedXML.batches.batch to typeof(string)
You can see that jsonDocument
is the name of the new
column, and we’re assigning to it a reference to the multi-valued
batch
field: parsedXML.batches.batch
. And just
to be clear, after this multi-value field is “expanded”, we want the
resulting type stored in the jsonDocument
column to be a
string.
Whew!
Let’s give each of the records an id to differentiate which batch
element any further data came from, in case we split out more data using
multi-valued field techniques (ahem, foreshadowing, ahem).
Also, it’s maybe a good time for a fresh call to project
to
declutter our table a bit before proceeding.
| extend batchId=new_guid()
| project docId, computerName, jsonDocument, batchId
Now we have to pull values out of the JSON.
Are we going to use custom regex-based parsing logic to pull the key-value pairs out of the JSON strings?
No, we are not. Mainly because KQL already has off-the-shelf JSON parsing functions (similar to the XML one we used), the JSON strings do seem well-formed and so compatible with said functions, and also because we aren’t crazy.
| extend parsedJSON=parse_json(jsonDocument)
| extend DurationMS=parsedJSON.DurationMS
| extend Payload=parsedJSON.Payload
Boom! Now we pulled out the two fields from each JSON string into their own columns. We are in the home stretch.
The Payload
column contains what we earlier surmised to
be sets of YAML formatted records. Feels like we’ll need to separate
these records, this time without the help of a dedicated parsing
function.
This part can be done a number of different ways. My approach will be
tailored to demonstrating another multi-value function called
mv-apply
:
| extend lines=split(Payload, "\n")
| mv-apply with_itemindex=index recordIndexes=lines on (
where index % 3 == 0
| extend recordId=new_guid()
| project index, recordId
)
We split the Payload field on newline characters and assign the
resulting array of strings to the lines
column.
mv-apply
is similar to mv-extend
, but we gain
the ability to run a subquery (between parentheses) which can reduce the
number of records the original multi-valued field is expanded into.
In our case, we want one record returned for each logical record in
the YAML string. A logical record in the YAML string comprises three
lines, or three slots in the lines
array. We may not know
how many records are in each YAML string, but it’s a safe bet the line
count of all the records will be a multiple of three.
So if we have n lines, we want to split the multi-valued
field into n / 3
records, where each record has a new
column containing one of the starting indices of the records in the YAML
string. Using that piece of information along with the
lines
array, we can derive whole records. In addition to
the index
column we generate in the subquery, we also
generate an ID for each record in the YAML string.
Before we move on, note the index
field in the where
clause within the subquery. This field contains the index within the
multi-value field array the subquery is currently on; unless that value
is the first array slot of zero or a multiple of three, we don’t want
it!
Now we can use the index to target the appropriate slots in the lines
array to extract the remaining values via regex. Starting with
name
:
| extend name=extract(
"^[-][ ][nN]ame[:][ ]+([\\s\\S]+)$",
1,
tostring(lines[index])
)
Nothing new is introduced in this regex pattern, it should be old hat to you by now! The pattern targets the string in the lines array which can be accessed with the value in the index column.
The process for extracting the DOB
is similar except it
being in the subsequent slot of the lines array. And finally,
email
can be extracted from the slot just after that.
| extend DOB=extract(
"^[ ]+[dD][oO][bB][:][ ]+([\\s\\S]+)$",
1,
tostring(lines[index+1])
)
| extend email=extract(
"^[ ]+[eE]mail[:][ ]+([\\s\\S]+)$",
1,
tostring(lines[index+2])
)
Great! We literally have all our columns now! Let’s clean up the table a bit:
| project-away jsonDocument, parsedJSON, Payload, lines, index
And the final version of the table is beautiful to behold. Remember the monstrosity we started with? This is way better! We could actually do stats on these values now.
Here’s our final query:
looneyLogs
| extend docId=new_guid()
| extend computerName=extract(
"^[cC]omputer[ ]*[=][ ]*([^;]+)[;][ ]*",
1,
doc
)
| extend xmlDocument=extract(
"^[cC]omputer[ ]*[=][ ]*[^;]+[;][ ]*([\\s\\S]+)$",
1,
doc
)
| project docId, computerName, xmlDocument
| extend parsedXML=parse_xml(xmlDocument)
| mv-expand jsonDocument=parsedXML.batches.batch to typeof(string)
| extend batchId=new_guid()
| project docId, computerName, jsonDocument, batchId
| extend parsedJSON=parse_json(jsonDocument)
| extend DurationMS=parsedJSON.DurationMS
| extend Payload=parsedJSON.Payload
| extend lines=split(Payload, "\n")
| mv-apply with_itemindex=index recordIndexes=lines on (
where index== 0 or index % 3 == 0
| extend recordId=new_guid()
| project index, recordId
)
| extend name=extract(
"^[-][ ][nN]ame[:][ ]+([\\s\\S]+)$",
1,
tostring(lines[index])
)
| extend DOB=extract(
"^[ ]+[dD][oO][bB][:][ ]+([\\s\\S]+)$",
1,
tostring(lines[index+1])
)
| extend email=extract(
"^[ ]+[eE]mail[:][ ]+([\\s\\S]+)$",
1,
tostring(lines[index+2])
)
| project-away jsonDocument, parsedJSON, Payload, lines, index
In the end, it’s important to note, this table might not even be the ideal ultimate destination format. There are a few criticisms you could make depending on what your goals are. For one, the dates could be split into separate numerical values for years, months, days. I leave such goals as optional exercises for the reader.
https://dataexplorer.azure.com/clusters/help/databases/SampleLogs↩︎
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/datatypes-string-operators↩︎
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/re2↩︎
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/join-operator?pivots=azuredataexplorer↩︎
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/datetime-part-function↩︎