next up previous
Next: About this document

Multi-Query Optimization

S. Sudarshan
Computer Science and Engineering Dept
I.I.T. Bombay
sudarsha@cse.iitb.ernet.in
http://www.cse.iitb.ernet.in/~sudarsha

Joint work with: Prasan Roy, S. Seshadri, Siddhesh Bhobhe

Overview

  • Database Query Languages and Query Optimization
  • Issues in common subexpressions
  • DAG representation of alternative query evaluation plans
  • Simple heuristics
  • Greedy heuristic and optimizations
  • Performance study

Relational Databases

  • Data represented in the form of tables
  • Employee data: relation emp

    1
    emp-name dept-name age sal
    Sudarshan CSE 33 100,000
    Phatak SIT 50 400,000
    Prasan CSE 25 108,000
    Arvind SOM 24 180,000

  • Department data: relation dept

    1
    dept-name dept-addr head
    CSE CSE Bldg, IIT B Dhamdhere
    SIT Math Bldg, IIT B Phatak
    SOM SOM Bldg, IIT B Korgaonkar

Database Query Languages

  • Database query languages are declarative
  • Philosophy: say what you want, not how to get it
  • Example: SQL
       Q1:  select emp.name, dept.addr
            from   emp, dept
            where  emp.dept-name = dept.dept-name
    
       Q2:  select dept-name, avg(salary)
            from   emp
            where  age < 35
            group by dept-name

Query Optimization

  • Relational algebra: an algebra on relations
    select: ()
    join: ()
    project: ()
    aggregation: ()
  • User level languages translated to relational algebra
  • Multiple expressions for same result (similar to (a+b)*c = a*b+a*c)
  • E.g.
  • Different expressions have different costs
  • Query optimizer responsible for finding best expression

Why Multi-Query Optimization

  • To tackle complex decision support queries
    • Sequence (batch) of related queries
    • Queries with common subexpressions
      • Esp. with use of views and WITH clause
      • Multi-output operators
        E.g. Engage.Fusion
  • Data warehouses, OLAP, ...

Multi-Query Optimization

  • Given: query, or a batch of queries
  • Goal: exploit common subexpressions
  • Problem: many alternative query plans with different costs
  • E.g.:
    • and - no common subexpressions
    • Rewrite : - now tex2html_wrap_inline582 common subexpression
    • Can compute tex2html_wrap_inline584 , store (materialize) and reuse
  • Problem: expression form with CSE may or may not be cheaper!
  • Problem: lots of alternative plans tex2html_wrap_inline586 expensive

Steps in MQO

  • Recognize possibilities of sharing
    • Extend Volcano's Query-DAG generation algorithm
  • Find best plan taking sharing into account
    • Local best plans may not be globally optimal
    • Exhaustive algorithms are doubly exponential
    • Need good heuristics

Related Work

  • Plain Query Optimization
    • System R: bottom-up
    • Volcano: top-down, extensible
  • Early MQO Papers
    Sellis et al. [TODS88], Park and Segev [ICDE88], Cosar et al. [CIKM93]
    • Not integrated with basic optimizer tightly enough, inefficient
  • Common Subexpression Elimination
    Shivakumar and Subramaniam [SIGMOD98]
  • Invariants in Nested Queries
    Rao and Ross [SIGMOD98]

Related Work (contd.)

  • Materialized View/Index Selection
    Roussopoulus [TODS82], Ross et al. [SIGMOD96], Gupta [ICDT97], Labio et al. [ICDE97], Yang, Karlapalem and Li [VLDB97]
    • Needs to handle updates
    • MQO is a subproblem
  • Optimization in Presence of Materialized Views
    Chaudhari et al. [ICDE95]

AND-OR DAG Representation

tex2html_wrap666

  • tex2html_wrap_inline588 : equivalence nodes
  • tex2html_wrap_inline590 : operation nodes

DAG Generation

  • Apply transformations
    • E.g:
    • Result of a transformation on E is equivalent to E.
      Added under the same equivalence node
    • New equivalence nodes may be created
    • Hashing based scheme for detecting duplicate derivations used in Volcano (a.k.a hash consing)
    • We use above for detecting common subexpressions
  • Physical DAG models physical properties e.g. sort order

Detecting Repeated/Common Subexpressions

  • Hashing scheme of Volcano
    • gives ids to equiv. nodes
    • hash function on (op, ids-of-inputs) help locate duplicate operations
    • detects identical expressions / (duplicate derivations)
    • transformations create new equiv nodes iff expression created is new
    • Equiv. node parents of identical expressions different tex2html_wrap_inline586 common subexpression
  • Additional unification step merges equivalence classes

Expression Subsumption

  • Given: and
  • Given and
  • Bottom-up traversal of the DAG adds new subsumption derivations (and new equivalence nodes)

Volcano Search Algorithm

  • Finds local best plan recursively, ignoring sharing possibilities
  • Uses costlimit for a directed search (branch & bound)
  • Cost of operation node = local cost + sum of input costs
  • Cost of equivalence node = min cost of children
  • Best plan of equivalence node:
    lowest cost child, plus plans for its inputs

Volcano-SH

  • Run plain Volcano on queries to find local best plans
  • Find common subexpressions amongst local best plans
  • Choose to materialize and share a common subexpression if:

  • Note: recomp-cost changes depending on materialization of descendants
  • May choose to materialize and share only part of a common subexpression
  • Number of uses depends on which shared parents are materialized

Volcano-RU

  • Given
  • When optimizing
    • Find best local plan, with special case for equivalence nodes that are in best local plans for :
      • set cost of node to min ( )
  • Note: Once a node is chosen to be materialised, mat-cost is not added for further sharing.
  • After optimizing all s, run Volcano-SH to find and share further common subexpressions (within queries)

The Greedy Heuristic

  1. Compute DAG and best plans for each node using Volcano
  2. Find benefit of materializing each unmaterialized node
    1. mark a node n as materialized
    2. set cost of node n = tex2html_wrap_inline630 of n
    3. recompute best plans for all nodes with above cost
      • this step done incrementally
    4. of n + tex2html_wrap_inline638 of n +
      sum best costs for root of DAG
    5. benefit of n = reduction in total cost due to materializing n
    6. unmark n, and unset cost of node n
  3. Choose node with max cost benefit,
    mark it, materialize it and set its cost to tex2html_wrap_inline630
  4. Repeat from Step 2, until no decrease in cost

Incremental Cost Update Propagation

  • Recomputing best plans from scratch on each benefit computation very expensive
  • Observation: Marking a node as materialized only affects its ancestors
  • Incremental Propagation:
    1. Start from marked node, and propagate cost changes up the DAG
    2. If cost of a node does not change, no need to propagate changes further up
  • Use topological sort numbering to decide which node to propagate changes to next (ensures no revisits)

Improvements to Greedy: Monotonicity Heuristic

  • Monotonicity Assumption: Benefit of materializing a node never increases as other nodes are materialized.
  • Monotonicity tex2html_wrap_inline586 old benefit estimates are overestimates on current benefit
  • ``old'' tex2html_wrap_inline654 computed in an earlier iteration of greedy
  • Key idea: Keep heap of nodes sorted on benefit overestimates
    • Repeat forever
      1. Pop top node t
      2. If benefit t is , break
      3. if benefit of t is current, t is the node with maximum benefit; delete from heap and materialize it
      4. else update its benefit, mark as current, reinsert in heap
  • Monotonicity generally holds; reduces opt. run time by 90%

Extensions

  • Index Selection
  • Nested Queries
  • Parameterized Queries with common subparts
    e.g. multiple calls on query/stored procedure with different parameter values
  • Space Constraints
    Greedy with benefit per unit space

Performance Study

  • Implemented all algorithms (17K lines Volcano + 3K lines MQO)
  • Benchmarks
    • Stand-Alone TPCD
    • Batched TPCD
    • Scaleup

Performance of Stand-Alone TPCD

4.5in tex2html_wrap668
4.5in tex2html_wrap670

Performance on SQL-Server 6.5

3.0in tex2html_wrap672

Performance of Batched TPCD

4.5in tex2html_wrap674
4.5in tex2html_wrap676

Scaleup Analysis

4.5in tex2html_wrap678
4.5in tex2html_wrap680

Empirical Complexity of Greedy

4.5in tex2html_wrap682
4.5in tex2html_wrap684

Conclusion

  • Greedy is best overall, followed by Volcano-RU and Volcano-SH
  • Volcano-SH, Volcano-RU have neglegible overheads.
    Overhead for Greedy acceptable for complex expensive queries
  • Greedy shows linear scaleup with number of queries
  • Benefits show up on TPCD benchmark running on Microsoft SQL-Server 6.5
  • Advise:
    Use Volcano-RU if query very cheap and syntactically complex.
    Use Greedy for all other cases

Future Work

  • Space Constraints
  • Query Caching
    • Sequence of queries
    • Need to predict future
    • Decide what to maintain in the cache
  • Large workloads: Query Abstraction
  • Scheduling - try to avoid materialization
  • View/Index Selection: Update Costs



next up previous
Next: About this document

Ajay K Dhawale
Fri Apr 30 17:56:43 IST 1999