Natural Born Killers (The series) – Day 1 Statistics

To those that don’t know me “Natural Born Killers (The Series)” sounds like a very strange choice for a SQL Server blog title and to be fair it is.

Those who do know me, will know that I have been presenting a session this year titled “Natural Born Killers, performance issues to avoid” around the UK. The “Series” element of the title is there because I have decided to serialise the session for the benefit of those who have been unable to see it thus far.

The plan is to release one topic from the presentation each day of SQL Relay 2013. Today is the first day of SQL Relay held in Glasgow, unfortunately I can’t make it there but I will be in Leeds tomorrow. If you have registered for the Birmingham, Norwich or Cardiff events you will be able to see it live, so you might want to read this after the event as a reminder of everything that was covered.

Before we start digging into the crux of the post I should point out the caveat that I am just looking at statistics from a fairly simplistic angle as the idea of the talk is to introduce concepts that if you don’t know about and do some further research on will kill the performance of the database on which your applications sit. In this post there is a further caveat that SQL Server 2014 has been announced and it’s rumoured that there are improvements to the way statistics work so the information posted here will need to be checked moving forward. It’s a sound practice to test everything you read anyway; 1000 monkeys, 1000 typewriters and all that.

Introducing Statistics

When it comes to creating a query plan it doesn’t matter how good the optimizer is, if you don’t have the right details about your data, you will never get a fantastic plan. Why is this?

The optimizer used by SQL Server is a cost based optimizer, this means it works out how much an operation might cost if it were to perform it against a number of similar operators for the same amount of data.

And how does it know how much data there is and how it’s distributed, or the cardinality of the data to give it its proper term? The answer is of course, Statistics

Thankfully by default statistics are automatically maintained by SQL Server. You can see in the screenshot that there are a couple of different options to help create and maintain them.

image

 

Creating Statistics

Setting Auto create statistics will create statistics on a single column or attribute when the optimizer feels that it is required. The key thing here is that if you required statistics across multiple attributes then you would either need to create the statistics for them manually, or if you created an index with those attributes the optimizer will create statistics for you, which is a very good reason for indexing foreign keys!

If you did want to create statistics manually then you can use the CREATE STATISTICS command or the stored procedure sp_createstats which will create a statistic for each eligible attribute in each user table in the database. Swings and roundabouts on this one really, the benefit of doing this would be that each attribute would have its own set of statistics and you may not have to wait for statistics to be created when the optimizer is compiling a plan. The downside is of course that SQL Server then has to maintain the statistics on that attribute which may never be used as a predicate, which could cause an overhead (albeit negligible).

 

Updating Statistics

As with creating statistics, there are two way of updating statistics. Manually and automatically.

Automatically updating statistics

Automatic updates don’t occur every time data is changed as this could be very costly. Instead there are set thresholds and the statistics are only updated when they reach one of these. The threshold levels are listed in the table below, what is interesting is that the threshold are slightly different for temporary tables (note I didn’t say table variables, more on that on day 7) and permanent tables.

Permanent Table Temporary table
1st record 1st record
6 records
LT 500 recs, 500 changes LT 500 recs, 500 changes
GT 500 recs, 500 changes + 20% GT 500 recs, 500 changes + 20%

If you are lucky enough to be running SQL Server 2008 R2 SP1 or above then there is a trace flag 2371 which will lower the threshold limits, you can read more about this here.

The Asynchronous option will allow programmable code to trigger the fact that statistics need to be updated, but will not stop that iteration of the code and make it wait for the statistics to be updated it will use the old stats instead.

Manually updating statistics

As with creating statistics manually there are again two options for updating your statistics manually, these are to run the stored procedure “Sp_updatestats” This runs update statistics against all user and internal tables in the current database where data has changed since the last time statistics were created/updated and “UPDATE STATISTICS”. For more information on specific parameters for these options please refer to books online.

 

Further resources:

Books Online: Statistics

SQLCAT Team: Using Filtered Statistics with Partitioned Tables.

Christian Bolton presenting “Vital Statistics” at SQLBits (1hr Video)

Controlling Autostat (AUTO_UPDATE_STATISTICS) behavior in SQL Server

Statistics on Ascending Columns

ToadWord: Updating Statistics

 

I hope you have found this a useful introduction and hope to see you at a community event soon.