Tuesday, May 5, 2015

Optimizer statistics vs stable plans

The topic was raised by Arup Nanda:
Well, statistics collection is tricky. I'll share my 10-years OLTP-mostly experience on the subject, and provide recommendations.

I focus on application development part here, not "database support" part. In other words, I provide recommendations for application developers to avoid performance problems in production.

There are definitely cases when DBA has to deal with given software as is while trying to make bad queries run faster. It is a bit different story, and it is out of scope of the current post.
I think if application follows the listed approach, it would be easier to support.

I would prefer to fold the post into a single tweet, but I do not have enough time. Hope you enjoy the reading.

Do we need plan changes?

To make a long story short: "one typically should want stable execution plans that are repeatable across DEV/QA/UAT/PROD environments". The reasoning is below.

Optimizer statistics as well as database parameters, baselines, sql patches, etc influence query optimizer when it produces execution plans.

One man says: "collect system, and table stats often to get good execution plans".
This good and bad piece of advice at the same time.

Suppose you develop a new database query. You know what data is supposed to be in tables (number of rows, number of distinct values, etc). You know how many rows a query is supposed to return. You know your target hardware, so you know the optimal execution plan for the query. At the end, you do stress volume testing, don't you?

So, you've just created the optimal execution plan for the particular query via applying proper indices and hints. Here goes the question: do you think stats gathering/optimizer tuning would improve the query?
How can it be the case provided the query has already been tuned?
The best stats gathering could do is "to keep the plan intact". What is the purpose of the gathering if we know if at the end it would "at best" "keep execution plans intact" and possibly "destroy the performance"? I think there is no much purpose.

Do you like performance issues that are reproducible "only in production"?
I do not.

Rough approximation of the optimizer is a function: QUERY_PLAN=OPTIMIZER(sql_fulltext, table_statistics, system_statistics, optimizer_parameters, hardware_configuration, ...).
In other words, given the same statistics, and the same parameters, optimizer would come up with the same plan for the given query. That is good. All we need to get the same execution plan is just to use production statistics, production parameters, etc.

Corner-cases

There are lots of corner-cases. Please feel free to add in comments if I miss something.

A. "I do not have a production data dump, so a plan from my dev server would not be relevant for the production"

  1. You can obtain a dump with all sensitive information replaced with ***. That works good for testing query duration, different indices, etc.
  2. You still better have some stress volume testing (SVT) environment to validate solution performance. Suppose the query ran fast in SVT. Why would you want it to have an execution plan that is different from production?
  3. SQL is declarative in theory only. In practice it is a leaky abstraction. You have to know business logic, ensure the plan is good (as in "as a sensible person would want the query to be executed"), and you need to ensure the query plan is stable. Even for "well modeled" database scheme Oracle cannot collect statistics for 2-table join. So you'd better put hints to ensure query plan uses appropriate for the task indices.

B. "My development server does not have enough capacity to hold production dump. My statistics and plans would be off anyway"

  1. Even if you can't afford full dump, it is very easy to afford "production statistics" in all environments.
    dbms_stats.export_schema_stats, and dbms_stats.import_schema_stats allow you to export statistics from one server and apply it to another one.
    It somewhat fine to have a tiny dev server while statistics lies Oracle that tables are huge. At least the plans would be much closer to the production ones.
  2. Gathering full statistics on production system is both time and resource consuming. You can't safely use sampling otherwise you might miss common values. You cannot gather statistics often with 100% estimate since it requires resources to scan all the tables and all the indices.

C. "I do not have production server yet. It's just start of the project"

  1. Consider generating some data, so you can test query performance.
  2. If you replace a system, consider using migrated data as a foundation of your development environment.

D. "Wait, you just said I should never collect statistics. Should I live with all defaults?"

  1. Definitely you do not want to live without statistics. This will cause dynamic sampling, causing excessive load, and unpredictable query plans.
    Consider gathering statistics at C.1/C.2 and incorporate the stats to your import_schema_stats script.

E. "My table is created via dynamic SQL, and/or it is temporary, so I cannot use fixed statistics for it"

  1. Consider importing the desired statistics right after the table is created. Remember that the whole point is to have repeatable plans that were battle tested in development and load testing.

F. "The optimal plan for my query varies depending on bind value. Let's optimizer figure out the plan"

  1. It does not work. Well, if you do plug&pray kind of development, optimizer could figure out the plan for you, however I would prefer less "unpredictable" issues.
  2. If your query indeed requires multiple plans, then it should be multiple different queries. The idea is: single query has single "good" execution plan.

G. "I do not want to hint each and every table and index. It's the job for query optimizer"

  1. Optimizer cannot understand business logic (see A.3). So it cannot join tables properly. If you are lucky, it could create a good plan.
  2. Suppose you have 20-tables join query. Optimizer would never consider all the possible 20! permutations. Add hints to prune out the search space.
  3. Even if optimizer generates proper plan, pruning the search space by applying hints (e.g. no_merge, etc.) might help since it speeds up plan generation.
  4. Use sensible statistics as a base for the optimizer's decisions (so you could get sensible plans). Hint table access order, key indices to use, etc to ensure the plan would be good in production.
    Hints not only help optimizer, but they help on subsequent modifications of the query: it would be easier for the next engineer to follow the desired plan.

H. "My Oracle 27Y has super-duper-advanced query feedback. Why don't we just trust it?"

  1. Currently (as of Oracle 12c) it takes at least a couple of bad executions to trigger "adaptive logic". In other words, your clients would experience bad executions. It is better to stick with "good" plans from the start.
  2. I faced multiple cases when "clever" 11g optimizer tried to "tune" multi-hour job query (that was already tuned by trained professionals) and it resulted in never-ending query. Consider disabling optimizer_adaptive_features (12c) or setting _optimizer_use_feedback=false (11g).
  3. Not to mention "adaptive tuning of DML queries". Do you trust your database to re-execute each and every query multiple times? Of course it should not try committing the updates during "query tuning" phase, but, you know, bugs are everywhere.
  4. "Automatic tuning" requires non-trivial resources, so be prepared to have unlimited capacity (costs you $$$ in terms of database license).
  5. "Automatic tuning" cannot handle dynamic queries. In other words, when queries differ just in a bit, optimizer would have to build plan from scratch. It is much better to include proper hints into the query generator.
  6. Even Oracle 12c can't reverse a direction of a connect-by query. Connect by was introduced somewhere in 9c, so do not expect magic from the optimizer.

I. "My production box is much more powerful than a development one. I should definitely gather workload stats in PROD"

  1. Remember: the key point of having statistics is to have reasonable execution plans. It does not make sense to collect statistics for the only purpose of statistics gathering.
  2. If PROD box is more capable, that is good. It probably would execute the same execution plan faster. On contrary, it would be bad if your good-and-tuned index-using query runs goes mad of full scans in production just because "workload stats" is different.


J. "I've just imported additional 20-30% of data to my production system. I should gather statistics to reflect that"

  1. You should not. See I.1 
  2. If you want stable plans, consider not gathering the statistics.
  3. If the import touches brand-new tables that were not previously used, consider using fixed statistics for them like in C.1, C.2.

Pitfalls


P1. There are settings that you would miss in development anyway

For instance, db_cache_size influences query plans. It is bad, since it is something you cannot avoid unless you have exact copy of the production hardware for development and testing.
In my experience, it was just a couple of times when the parameter in question triggered plan change, so just try making your load test and UAT environments as close to the PROD as it could be.


P2. Not all the features are available in development

For instance, parallel query feature is available in Enterprise Edition only. It could cost a fortune to have all the development and test servers to have Enterprise Edition.
Here you just assume that you test parallel queries in SVT/UAT, and make sure you do not get unexpected parallel queries (e.g. by explicitly setting degree of parallelism to 1 for all your tables and indices)

P3. Oracle tries to gather user statistics on a scheduled basis

Starting from 10g, there is a job that gathers statistics on a scheduled basis. It is a nightmare since it often results to unpredictable performance degradation. Make sure you fix your stats (e.g. lock stats or reconfigure the job)


P4. Statistics does include min and max values

Table statistics (column part of it) includes minimum and maximum value for each column. This might lead to unexpected results since Oracle 10g+ uses that to estimate selectivity of the predicate (see more on this in Cost-Based Oracle Fundamentals by J.Lewis).
Suppose you have a column that increases over time (e.g. timestamp, id column, etc). If you freeze the max value in statistic at certain point in time, queries that use values that fall outside of min-max would get insane plans. The database would think: "well, I know there could not be values there". This would happen even for bind variables through bind peeking.
To mitigate the risk, consider manually specifying min-max values to "infinity", so your max-value never expires. 


Is fixed statistics a silver bullet?

Of course, it is not. I'm sorry, I have no much experience in "when fixed statistics do not help" area, but I'll try to fill some gaps here:
  1. Ad-hoc, single-shot queries. When the queries are ad-hoc, you want the database to understand the costs of table full scan vs random access, etc. Having relevant statistics would probably help here. However, hardware upgrade might be more sensible than parameter tweaking. For instance, upgrading to newer CPU SKUs, adding more RAM, using storage with lower response times (e.g. SSD for data and temp files) is free in terms of database license.
  2. You might need to tune a query without access to development team. For instance, if you tune third-party application that cannot add hints for some known reason.
    In this case, dbms_advanced_rewrite (10g+), outlines, baselines, sql patches, table statistics, system statistics, could help you to get the desired plan while keeping the application intact.
  3. When PROD hardware is drastically different from your dev environment, it might be the case instance parameters are good way to keep plans sane. For instance, "automatic degree of parallelism" is one of the parameters that makes sense to have on an instance by instance basis.
    In other words, in development you determine that query should use parallel plan, and you let the degree of parallelism be configurable.

What can we do about it?

I believe the best practices are:
  1. Use the same as in production OS, DB version, parameters when it is possible for your development.
  2. Create a fixed statistics and include it into your delivery, so query plans are repeatable.
  3. Do not gather workload statistics.
  4. Gather dictionary/fixed object (X$... kind of objects) statistics.
  5. Use production-like volume for development and testing.
  6. Use hints when there are at least two non-insane execution plans (G.2).
  7. Avoid /*+ RULE */, /*+ ORDERED */. Think of hints as "business-level statistics for the optimizer".
  8. Disable adaptive optimizer features (H.2)
Change log:
2015/05/07 -- B2 added (statistics is consuming to gather)
2015/05/06 -- G2, G3 added. J added (additional data import vs statistics). P4 added (min-max values).

9 comments:

  1. Hi,

    Very good points here. I don't like changing plans (which can bring down the prod on occasions and make impossible to test improvements) but I don't like hints either (no error/warning when not used, must be documented and re-evaluate when upgrading oracle or datamodel).

    Both from theory and experience I think that plans change when there is no obvious optimal plan for the optimizer. With good data modeling and query tuning from dev (rare theses days) doing large I/O calls a few milliseconds faster, or having a few additional rows coming from a predicate filter, should not bring the optimizer to an orthogonal decision.

    Regards,
    Franck.

    ReplyDelete
    Replies
    1. Franck, do you consider user_tables, user_constraints, etc a "good data modelling"?

      I assume you faced performance issues when joining dictionary views (even with all the statistics collected). For a human being, there is an obvious single good plan, while optimizer often misses to push predicate here and there.

      Delete
  2. It sounds great, but only if you have just a few different well-designed systems. When you are alone and there are several absolutely different constantly changing systems (siebel, documentum, oebs, informatica, several different sap systems, different card and banking systems, etc) it is not so easy.

    Don't forget that data is growing exponentially and they can be differently skewed at different moments, and many old hinted queries becomes suboptimal at a certain point. Sometimes even partitioning schemes may become unsuitable. btw, if you use interval-partitioning or you have time-based partition splitting process, even copying statistics becomes not so trivial as it was at start.

    Strictly speaking, your points are very similar to the points for RBO or ORM(CRUD) fans

    ReplyDelete
    Replies
    1. >It sounds great, but only if you have just a few different well-designed systems.

      I meant listing not best practices of managing 3rd party software, but best practices of developing the software in the first place. I've added that to the beginning.

      > When you are alone and there are several absolutely different constantly changing systems

      That's true. Do you think if management of the zoo would become easier if animals follow the approach?

      >Don't forget that data is growing exponentially and they can be differently skewed at different moments, and many old hinted queries becomes suboptimal at a certain point

      A2 partially applies here. The queries should be tested with proper data volume in the first place.

      Delete
    2. >Strictly speaking, your points are very similar to the points for RBO or ORM(CRUD) fans

      Do you know a successfully used CBO planner that has no hints?

      For instance: CPU has prefetch instruction that is a hint. JIT compilers often have "force inline" hints. Good database optimizers have hints.

      Delete
    3. Vladimir, there is big difference between development stage and long usage . Tons of code doesn't change, although data constantly does. Your plan in the beggining may become suboptimal several years/oracle versions later. And we can not always correctly guess the "proper data volume" for the future. Especially if your software will use many different clients with different needs and different hardware.

      Even hints may become invalid or deprecated.

      > Do you think if management of the zoo would become easier if animals follow the approach?
      Just imagine: Previously, you had two tigers, three lions and one panda, and ten years later you have just one lion and ten pandas - Would you buy the same food and the same amount? Don't we need to count animals?

      >Do you know a successfully used CBO planner that has no hints?
      I don't understand what you mean. How presence of the hints leads to the including hints into all queries or to disallowing gathering statistics?

      Delete
    4. Let me ask in the following way: have you seen cases when the same query got different (as in completely different) good (as in "if just a single plan was tightly hinted" then the query would execute much longer) execution plans under different "clients/needs/hardware"?

      Are those cases often?
      Do you think it happens more frequent than "non-hinted query goes mad"?

      From my experience, it is very very rare case when optimizer can figure out good plan under different conditions. Much more often I see how a query goes mad due to missing hint and/or missing statistics.

      I do not include toy examples like "single table with single index" here. They are good to show what clustering factor is, but in practice I do not see cases when a 3-20 tables join query uses reasonable plan on its own consistently, and predictably.

      Delete
    5. > From my experience, it is very very rare case when optimizer can
      > figure out good plan under different conditions.
      > Much more often I see how a query goes mad due to missing
      > hint and/or missing statistics.

      Have you checked percentage?
      For example,
      count(suboptimal_hinted) / count(hinted) from v$sql
      and
      count(suboptimal_not_hinted) / count(not_hinted) ?

      Also "missing statistics" sounds strange to me in this context. that was my key point, that statistics must be always actual and relevant, but not old fixed statistics from development time.

      Delete
    6. >Have you checked percentage?

      Let's agree first there are queries that have decent plan anyway. For instance, dumb "insert into table values" does not need plan hints.
      Same goes for other trivial queries.
      It is not something that would "get better plan out of better statistics".

      When it comes to important queries, we aim to hint always.
      How otherwise would you tune a query?
      Don't you say all of the queries are tuned automatically "just" by using statistics?

      >count(suboptimal_hinted) / count(hinted) from v$sql

      Unfortunately there is no easy way to get "suboptimal" queries.

      I would say: "query with more than 3-5 tables almost always needs hinting".
      In tuning, we almost never "remove hints completely".
      Cases when we have to retune queries are rare.
      There are cases when we split a query in several ones (each dedicated to specific data pattern).

      Here's sample data:
      with sqls as (
      select sql_id, sql_fulltext from v$sql where command_type!=47 /*pl block*/ and parsing_user_id!=0
      )
      select count(distinct sql_id) total_sqls from sqls
      union all
      select count(distinct sql_id) hinted_sqls from sqls where sql_fulltext like '%*+%'

      gives 28075 total, 15402 hinted in dev server non-sys, non-plsql queries.

      > statistics must be always actual and relevant

      See B2 -- it would take you a long while to recompute all the statistics.
      As you say, "relevant", thus estimate_percent=>1.00. Partitioning is a separate beast since it is costly and incremental statistics management is a black art.

      > statistics must be always actual and relevant,

      There are _often_ cases when query gets bad plan due to stale/missing statistics.
      For instance:
      1) application loads data and lauches a query to process it. It either uses zero statistics or dynamic sampling.
      2) "oracle re-gather job" collected statistics over application tables and it ruined the plans. This happened quite often before we disabled the job.
      3) some person gathers statistics. Even with estimate_percent=100 queries often go mad. The bad thing is unless you fix your statistics you never know which query would be next.

      All three cases are healed with fixing the statistics.

      I wonder how would you ensure application gathers statistics right after data population and before any usage of the table.


      Did you estimate how many $$$ do you spend on CPUs that "gather relevant statistics" for you?

      Delete