Earlier this week I wrote about a perplexing problem I was having where identical servers were producing different execution plans for the same ad-hoc query. I got a lot of great feedback, and Wes Brown (Blog | Twitter) even contacted Conor Cunningham (Blog), query optimizer team lead, for me. But it was this suggestion from Paul Randal (Blog | Twitter) and an IM from Jonathan Kehayias (Blog | Twitter) that helped me uncover what was going on:
Quick Recap
Recall that my problem started out like this: A simple ad-hoc query executed against 9 servers with (more or less) identical hardware, identical schema, identical data, and identical index statistics produced different execution plans. I looked at a variety of things to try and figure it out: SET options on the connection, collations, compatibility levels, server configuration options, build versions, and index statistics. Ultimately everyone who looked at it figured that the root cause of the problem was probably related to this: In Plan "A" the estimated number of rows for an index seek differed from Plan "B", and that in turn lead the query optimizer to choose an index seek + key lookup vs. an index scan on a different table later in the execution plan.
Doesn't That Look Like A Problem With Statistics?
Yes, you'd think that…except that I updated statistics for the index in question (well, for every index on the table actually) with FULLSCAN specified. That should have taken care of the problem, right? Not exactly. One thing I didn't think about was that AUTO_CREATE_STATISTICS was set to ON for every database (read more about this option here). With this option on the query optimizer will create statistics on individual columns, as necessary, to improve estimates for query plans. You can see which columns have had statistics created for them by using sp_helpstats (BOL entry here).
A Eureka Moment
A check against the 9 servers revealed that the 4 servers currently with plan "B" had statistics auto created for the column used in the query's join and the 5 servers with plan "A" did not. I picked one of the servers using plan "B", dropped the auto created statistics, and…the query optimizer was now picking plan "A"! I did the same against every server using plan "B", and every server switched to using plan "A". Ahh, I love repeatability! (Note that dropping auto created statistics isn't something you want to casually do – the query optimizer created them for a reason, after all)
So The Answer Is…
In the end, it had nothing to do with build versions, slight variations in CPU models, fragmentation, SET options, server options, or any of the multitude of things I looked at…it was the auto created column statistics that affected which plan the query optimizer choose. The takeaway? Add that as one of the things to check when a query isn't behaving like you think it should.
Thanks again to Paul Randal and Jonathan Kehayias for pointing me in the right direction. I can once again sleep at night now knowing that there's a reasonable explanation for what was happening.
8 comments
So then why did some servers create the stats on that column, while others didn't? Doesn't tracking this down become your next issue?
Great sleuthing, great post. Thanks for sharing.
Nice find Kendall!
Nice post. Glad to hear you came to some explanation. Curious, did the comment on your original post about the query optimizer terminating early and not completing its optimization end up having anything to do with it?
Good follow up, though seems like still a piece or two of mystery remaining. And didn't Jack and I ask if the stats match??!
Well, I've tried to reproduce this numerous times, and get very inconsistent results. It's not difficult to get the optimiser to choose a non-optimal plan with both sets of stats in place, but dropping the column stats doesn't always make it revert to the good plan. Also, running UPDATE STATISTICS almost always seems to resolve the "baad-plan" issue too.
Just once, I saw behaviour that seemed to mirror what you describe in your article, but I've only managed it once, even when running a single test script for repeatability and consistency.
Furthermore, I see very inconsistent results when trying to get the column stats to update automatically. Even following large table updates (5,000 rows - 50% of the table), the auto stats remain unchanged
Interesting thoughhts
Post a Comment