Tuesday, May 5, 2015

Optimizer statistics vs stable plans

The topic was raised by Arup Nanda:
Well, statistics collection is tricky. I'll share my 10-years OLTP-mostly experience on the subject, and provide recommendations.

I focus on application development part here, not "database support" part. In other words, I provide recommendations for application developers to avoid performance problems in production.

There are definitely cases when DBA has to deal with given software as is while trying to make bad queries run faster. It is a bit different story, and it is out of scope of the current post.
I think if application follows the listed approach, it would be easier to support.

I would prefer to fold the post into a single tweet, but I do not have enough time. Hope you enjoy the reading.

Do we need plan changes?

To make a long story short: "one typically should want stable execution plans that are repeatable across DEV/QA/UAT/PROD environments". The reasoning is below.

Optimizer statistics as well as database parameters, baselines, sql patches, etc influence query optimizer when it produces execution plans.

One man says: "collect system, and table stats often to get good execution plans".
This good and bad piece of advice at the same time.

Suppose you develop a new database query. You know what data is supposed to be in tables (number of rows, number of distinct values, etc). You know how many rows a query is supposed to return. You know your target hardware, so you know the optimal execution plan for the query. At the end, you do stress volume testing, don't you?

So, you've just created the optimal execution plan for the particular query via applying proper indices and hints. Here goes the question: do you think stats gathering/optimizer tuning would improve the query?
How can it be the case provided the query has already been tuned?
The best stats gathering could do is "to keep the plan intact". What is the purpose of the gathering if we know if at the end it would "at best" "keep execution plans intact" and possibly "destroy the performance"? I think there is no much purpose.

Do you like performance issues that are reproducible "only in production"?
I do not.

Rough approximation of the optimizer is a function: QUERY_PLAN=OPTIMIZER(sql_fulltext, table_statistics, system_statistics, optimizer_parameters, hardware_configuration, ...).
In other words, given the same statistics, and the same parameters, optimizer would come up with the same plan for the given query. That is good. All we need to get the same execution plan is just to use production statistics, production parameters, etc.

Corner-cases

There are lots of corner-cases. Please feel free to add in comments if I miss something.

A. "I do not have a production data dump, so a plan from my dev server would not be relevant for the production"

  1. You can obtain a dump with all sensitive information replaced with ***. That works good for testing query duration, different indices, etc.
  2. You still better have some stress volume testing (SVT) environment to validate solution performance. Suppose the query ran fast in SVT. Why would you want it to have an execution plan that is different from production?
  3. SQL is declarative in theory only. In practice it is a leaky abstraction. You have to know business logic, ensure the plan is good (as in "as a sensible person would want the query to be executed"), and you need to ensure the query plan is stable. Even for "well modeled" database scheme Oracle cannot collect statistics for 2-table join. So you'd better put hints to ensure query plan uses appropriate for the task indices.

B. "My development server does not have enough capacity to hold production dump. My statistics and plans would be off anyway"

  1. Even if you can't afford full dump, it is very easy to afford "production statistics" in all environments.
    dbms_stats.export_schema_stats, and dbms_stats.import_schema_stats allow you to export statistics from one server and apply it to another one.
    It somewhat fine to have a tiny dev server while statistics lies Oracle that tables are huge. At least the plans would be much closer to the production ones.
  2. Gathering full statistics on production system is both time and resource consuming. You can't safely use sampling otherwise you might miss common values. You cannot gather statistics often with 100% estimate since it requires resources to scan all the tables and all the indices.

C. "I do not have production server yet. It's just start of the project"

  1. Consider generating some data, so you can test query performance.
  2. If you replace a system, consider using migrated data as a foundation of your development environment.

D. "Wait, you just said I should never collect statistics. Should I live with all defaults?"

  1. Definitely you do not want to live without statistics. This will cause dynamic sampling, causing excessive load, and unpredictable query plans.
    Consider gathering statistics at C.1/C.2 and incorporate the stats to your import_schema_stats script.

E. "My table is created via dynamic SQL, and/or it is temporary, so I cannot use fixed statistics for it"

  1. Consider importing the desired statistics right after the table is created. Remember that the whole point is to have repeatable plans that were battle tested in development and load testing.

F. "The optimal plan for my query varies depending on bind value. Let's optimizer figure out the plan"

  1. It does not work. Well, if you do plug&pray kind of development, optimizer could figure out the plan for you, however I would prefer less "unpredictable" issues.
  2. If your query indeed requires multiple plans, then it should be multiple different queries. The idea is: single query has single "good" execution plan.

G. "I do not want to hint each and every table and index. It's the job for query optimizer"

  1. Optimizer cannot understand business logic (see A.3). So it cannot join tables properly. If you are lucky, it could create a good plan.
  2. Suppose you have 20-tables join query. Optimizer would never consider all the possible 20! permutations. Add hints to prune out the search space.
  3. Even if optimizer generates proper plan, pruning the search space by applying hints (e.g. no_merge, etc.) might help since it speeds up plan generation.
  4. Use sensible statistics as a base for the optimizer's decisions (so you could get sensible plans). Hint table access order, key indices to use, etc to ensure the plan would be good in production.
    Hints not only help optimizer, but they help on subsequent modifications of the query: it would be easier for the next engineer to follow the desired plan.

H. "My Oracle 27Y has super-duper-advanced query feedback. Why don't we just trust it?"

  1. Currently (as of Oracle 12c) it takes at least a couple of bad executions to trigger "adaptive logic". In other words, your clients would experience bad executions. It is better to stick with "good" plans from the start.
  2. I faced multiple cases when "clever" 11g optimizer tried to "tune" multi-hour job query (that was already tuned by trained professionals) and it resulted in never-ending query. Consider disabling optimizer_adaptive_features (12c) or setting _optimizer_use_feedback=false (11g).
  3. Not to mention "adaptive tuning of DML queries". Do you trust your database to re-execute each and every query multiple times? Of course it should not try committing the updates during "query tuning" phase, but, you know, bugs are everywhere.
  4. "Automatic tuning" requires non-trivial resources, so be prepared to have unlimited capacity (costs you $$$ in terms of database license).
  5. "Automatic tuning" cannot handle dynamic queries. In other words, when queries differ just in a bit, optimizer would have to build plan from scratch. It is much better to include proper hints into the query generator.
  6. Even Oracle 12c can't reverse a direction of a connect-by query. Connect by was introduced somewhere in 9c, so do not expect magic from the optimizer.

I. "My production box is much more powerful than a development one. I should definitely gather workload stats in PROD"

  1. Remember: the key point of having statistics is to have reasonable execution plans. It does not make sense to collect statistics for the only purpose of statistics gathering.
  2. If PROD box is more capable, that is good. It probably would execute the same execution plan faster. On contrary, it would be bad if your good-and-tuned index-using query runs goes mad of full scans in production just because "workload stats" is different.


J. "I've just imported additional 20-30% of data to my production system. I should gather statistics to reflect that"

  1. You should not. See I.1 
  2. If you want stable plans, consider not gathering the statistics.
  3. If the import touches brand-new tables that were not previously used, consider using fixed statistics for them like in C.1, C.2.

Pitfalls


P1. There are settings that you would miss in development anyway

For instance, db_cache_size influences query plans. It is bad, since it is something you cannot avoid unless you have exact copy of the production hardware for development and testing.
In my experience, it was just a couple of times when the parameter in question triggered plan change, so just try making your load test and UAT environments as close to the PROD as it could be.


P2. Not all the features are available in development

For instance, parallel query feature is available in Enterprise Edition only. It could cost a fortune to have all the development and test servers to have Enterprise Edition.
Here you just assume that you test parallel queries in SVT/UAT, and make sure you do not get unexpected parallel queries (e.g. by explicitly setting degree of parallelism to 1 for all your tables and indices)

P3. Oracle tries to gather user statistics on a scheduled basis

Starting from 10g, there is a job that gathers statistics on a scheduled basis. It is a nightmare since it often results to unpredictable performance degradation. Make sure you fix your stats (e.g. lock stats or reconfigure the job)


P4. Statistics does include min and max values

Table statistics (column part of it) includes minimum and maximum value for each column. This might lead to unexpected results since Oracle 10g+ uses that to estimate selectivity of the predicate (see more on this in Cost-Based Oracle Fundamentals by J.Lewis).
Suppose you have a column that increases over time (e.g. timestamp, id column, etc). If you freeze the max value in statistic at certain point in time, queries that use values that fall outside of min-max would get insane plans. The database would think: "well, I know there could not be values there". This would happen even for bind variables through bind peeking.
To mitigate the risk, consider manually specifying min-max values to "infinity", so your max-value never expires. 


Is fixed statistics a silver bullet?

Of course, it is not. I'm sorry, I have no much experience in "when fixed statistics do not help" area, but I'll try to fill some gaps here:
  1. Ad-hoc, single-shot queries. When the queries are ad-hoc, you want the database to understand the costs of table full scan vs random access, etc. Having relevant statistics would probably help here. However, hardware upgrade might be more sensible than parameter tweaking. For instance, upgrading to newer CPU SKUs, adding more RAM, using storage with lower response times (e.g. SSD for data and temp files) is free in terms of database license.
  2. You might need to tune a query without access to development team. For instance, if you tune third-party application that cannot add hints for some known reason.
    In this case, dbms_advanced_rewrite (10g+), outlines, baselines, sql patches, table statistics, system statistics, could help you to get the desired plan while keeping the application intact.
  3. When PROD hardware is drastically different from your dev environment, it might be the case instance parameters are good way to keep plans sane. For instance, "automatic degree of parallelism" is one of the parameters that makes sense to have on an instance by instance basis.
    In other words, in development you determine that query should use parallel plan, and you let the degree of parallelism be configurable.

What can we do about it?

I believe the best practices are:
  1. Use the same as in production OS, DB version, parameters when it is possible for your development.
  2. Create a fixed statistics and include it into your delivery, so query plans are repeatable.
  3. Do not gather workload statistics.
  4. Gather dictionary/fixed object (X$... kind of objects) statistics.
  5. Use production-like volume for development and testing.
  6. Use hints when there are at least two non-insane execution plans (G.2).
  7. Avoid /*+ RULE */, /*+ ORDERED */. Think of hints as "business-level statistics for the optimizer".
  8. Disable adaptive optimizer features (H.2)
Change log:
2015/05/07 -- B2 added (statistics is consuming to gather)
2015/05/06 -- G2, G3 added. J added (additional data import vs statistics). P4 added (min-max values).