SQL team Monin

SQL Bad Practices Caught in the Wild – Part 2

I’m Peter Kruis, SQL engineer at Monin and in this blog series, I’m highlighting some of the bad practices I frequently encounter when visiting our customers. In the previous post, I discussed the pitfalls of using SELECT * and why it may lead to problems. If you missed it, you can catch up here.

This time, I’m diving into another common troublemaker: User-Defined Scalar Functions. I’m going to discuss the following topics.

  • What are user defined scalar functions and how can we use them
  • What is the problem with using them
  • How can we fix those problems

A small introduction to User Defined Scalar Functions

When I was a starting professional, I worked as a .NET developer. I was taught the importance of using functions in software development. Functions make your code more maintainable, and they come with some benefits like:

  • Reusability: Functions can be used to encapsulate (complex) logic and calculations that are frequently used across multiple places in the code. This gives the possibility to reuse and simplification of code.
  • Consistency: By centralizing logic in a single function, you only have one place to update when requirements change, which helps keeping your code consistent.

When I switched over to more SQL Development, I was happy to see that MSSQL also provided a way to encapsulate logic in reusable functions, just as I had done in .NET. These functions allow us to streamline the queries by reusing the logic in our database.

With the use of SQL Server Management Studio, you can find Scalar Functions within a database here:

With the use of SQL Server Management Studio, you can find Scalar Functions within a database here

To have a good understanding of how a Scalar Function works, let’s take a look at an example functions named ‘dbo.IsPostOld’. Here’s the creation script of the function:

Script functions

This functions takes a ‘PostId’ as input and checks if the ‘CreationDate’ of the corresponding post is within the past year. When this is the case, it will return ‘0’ which indicates that the post is not old. Otherwise, it returns ‘1’, indicating that the post is old.

This seems like a nice and clean way to encapsulate our business logic, as we know, can be subject to change.

We can execute the function for a single item like this:

SELECT dbo.IsPostOld(123456) AS IsPostOld;

Which will return:

Screenshot

Or, we can include it in a SELECT statement:

Which will return:

Alternatively, it can be used in the WHERE statement, Computed Columns, Constraints and some more. (Please don’t do this 😉).

Okay, that all looks great, so what’s the deal?!

The issue is that while Scalar Functions seem clean and convenient, they can have a great impact on your query performance. Let’s dive into the problems.

Row-by-row execution

One of the biggest performance killers is that the Scalar Functions are evaluated row by row. This means that for every single row in your query, SQL Server must execute the function separately. While this might not seem like a big deal for small datasets, this can have a huge impact when the number of rows increases.

Let’s illustrate this with an example, when we use our ‘dbo.IsPostOld’ function in a query for a user with almost 10.000 comments, we can see that for each row in the result set, the function is being executed:

This is not what we would like to see. SQL Server us designed to work efficiently with sets of data, but this is being executed row-by-row, which is much less performant.

When we look at the execution statistics, we can see that the query runs in about 1.3 seconds to execute. This might seem not a big of an issue, but when this query is running a lot, a small difference can already be a good improvement. (also, compared to what it can be, 1.3 seconds is a lot).

Prevention of parallelism

SQL Server can use parallelism to divide a query’s workload across multiple processors when it needs to do a lot of work. This will speed up execution for complex or heavy queries. But here’s the catch: Scalar Functions disable parallelism in your query.

For example, let’s count all the comments in the dbo.Comments table. This query runs in parallel:

The query executes in about 2 seconds, and we can see the parallelism sign across the operators.

Now, let’s modify the query to use the scalar function, even for just a single hard-coded value:

This query takes more than 9 seconds to execute. Looking at the plan afterwards, we can see that there is no parallelism at all!

When we open the properties of the ‘SELECT’ operator, we see that parallelism is disabled due to the use of a scalar function. And it get’s worse; if the scalar function is used as a computed column in your table, then all related queries will be forced to execute serially as well.

Since SQL Server 2019 there’s been some improvements on the performance with scalar functions due to the addition of Scalar Function Inlining. If your database is in the correct compability mode (150 or higher), SQL Server may inline the function, threating it as a part of the main query and therefor removing the downsides of the scalar function. However, this functionality is limited. For example, in our case, the use of GETDATE() inside the function prevents inlining. You can read more about the requirements here.

Other issues

Scalar Functions also hide information, such as reads and other metrics, making it harder to pinpoint performance problems. For example, in our initial query, we can see information about the reads on the Comments table, but the reads from the Posts table are hidden. SQL Server still retrieves the data from Posts, but this activity is masked by the Scalar Function, making trouble shooting more difficult.

So how can we fix this problem?

There are 2 possible solutions.

  1. Upgrade to SQL Server 2019 or later
  2. Rewrite the query to avoid using scalar functions

 

Upgrading to SQL Server 2019 or newer can help in some cases, thanks to Scalar Function Inlining. Let’s try this:

  1. First we set the compatibility mode for our database.
  2. Run the same query again

Unfortunately, there is no change in performance here. Why? Because the use of GETDATE() in our function, it is prevented from being inlined.

The most effective solution is to rewrite the query and avoid Scalar Functions altogether. This might seem inefficient at first, but the performance benefits are most of the times worth it.

Let’s replace the Scalar Function with its logic, embedded directly into the query. Here’s how:

This query is now set-based and allows SQL Server to optimize it properly. When it is being executed, the query runs 10 times faster than the original one using the Scalar Function.

 

Key takeaways

 

  • Scalar Functions might seem like a great help, but they are often performance killers due to row-by-row execution and the prevention of parallelism.
  • SQL Server 2019 and later introduces Scalar Function Inlining, but it has limitations and doesn’t always help.
  • The best way to improve performance is to avoid Scalar Functions altogether and rewrite your queries to be set-based.

With this approach, you can significantly improve query performance and reduce the resource burden on your database.

You can have some quick wins with upgrading to SQL Server 2019, so you might benefit from Scalar Function Inlining, or go for the best (but more time consuming) solution, and rewrite the queries suffering from scalar functions. In both cases, we are happy to help, whether it is on upgrading your SQL Server environment, or tracking and rewriting your queries. When you need to upgrade, you can start with our other whitepaper.