Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-20473

Optimization for materialized views

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 3.0.0
    • 4.0.0-alpha-2
    • Hive
    • Can be reproduced on a Single node pseudo cluster.

    • Important

    Description

      Optimizer is taking advantage of materialized view only when the query syntax matches the way view was created. Here is an example.

      Source table on which materialized views are created

      +----------------------------------------------------+
      |                   createtab_stmt                   |
      +----------------------------------------------------+
      | CREATE TABLE `mysource`(                           |
      |   `id` int,                                        |
      |   `name` string,                                   |
      |   `start_date` date)                               |
      | ROW FORMAT SERDE                                   |
      |   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'  |
      | WITH SERDEPROPERTIES (                             |
      |   'field.delim'=',',                               |
      |   'serialization.format'=',')                      |
      | STORED AS INPUTFORMAT                              |
      |   'org.apache.hadoop.mapred.TextInputFormat'       |
      | OUTPUTFORMAT                                       |
      |   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' |
      | LOCATION                                           |
      |   'hdfs://xlhive3.openstacklocal:8020/warehouse/tablespace/managed/hive/mysource' |
      | TBLPROPERTIES (                                    |
      |   'bucketing_version'='2',                         |
      |   'transactional'='true',                          |
      |   'transactional_properties'='insert_only',        |
      |   'transient_lastDdlTime'='1535392655')            |
      +----------------------------------------------------+
      

      One of the materialized views "view_1" is created to fetch the data between IDs 1 and 2 using this statement

      select `mysource`.`id`, `mysource`.`name`, `mysource`.`start_date` from `default`.`mysource` where `mysource`.`id` between 1 and 2
      

      *When a SELECT is executed against the source table using the following SELECT statement, this works fine and can be validated with the explain plan.
      *

      0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id between 1 and 2;
      INFO  : Compiling command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c): explain select * from mysource where id between 1 and 2
      INFO  : Semantic Analysis Completed (retrial = false)
      INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
      INFO  : Completed compiling command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c); Time taken: 0.224 seconds
      INFO  : Executing command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c): explain select * from mysource where id between 1 and 2
      INFO  : Starting task [Stage-1:EXPLAIN] in serial mode
      INFO  : Completed executing command(queryId=hive_20180828062847_b313e0aa-686c-42f5-94e2-252dd836501c); Time taken: 0.006 seconds
      INFO  : OK
      +----------------------------------------------------+
      |                      Explain                       |
      +----------------------------------------------------+
      | STAGE DEPENDENCIES:                                |
      |   Stage-0 is a root stage                          |
      |                                                    |
      | STAGE PLANS:                                       |
      |   Stage: Stage-0                                   |
      |     Fetch Operator                                 |
      |       limit: -1                                    |
      |       Processor Tree:                              |
      |         TableScan                                  |
      |           alias: default.view_1                    |
      |           Select Operator                          |
      |             expressions: id (type: int), name (type: string), start_date (type: date) |
      |             outputColumnNames: _col0, _col1, _col2 |
      |             ListSink                               |
      |                                                    |
      +----------------------------------------------------+
      

      If the rewrite of the same SELECT is written using >= and <=, which should yield the same result, the optimizer does not take advantage of the materialized view, unless of course we create another view with this >= and <= syntax.

      0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id >= 1 and <=2;
      Error: Error while compiling statement: FAILED: ParseException line 1:49 cannot recognize input near '<=' '2' '<EOF>' in expression specification (state=42000,code=40000)
      0: jdbc:hive2://localhost:10000/default> explain select * from mysource where id >= 1 and id <=2;
      INFO  : Compiling command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca): explain select * from mysource where id >= 1 and id <=2
      INFO  : Semantic Analysis Completed (retrial = false)
      INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:Explain, type:string, comment:null)], properties:null)
      INFO  : Completed compiling command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca); Time taken: 0.226 seconds
      INFO  : Executing command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca): explain select * from mysource where id >= 1 and id <=2
      INFO  : Starting task [Stage-1:EXPLAIN] in serial mode
      INFO  : Completed executing command(queryId=hive_20180828063123_7a5bf6dd-5999-44d7-a2cf-3b2e634371ca); Time taken: 0.005 seconds
      INFO  : OK
      +----------------------------------------------------+
      |                      Explain                       |
      +----------------------------------------------------+
      | STAGE DEPENDENCIES:                                |
      |   Stage-0 is a root stage                          |
      |                                                    |
      | STAGE PLANS:                                       |
      |   Stage: Stage-0                                   |
      |     Fetch Operator                                 |
      |       limit: -1                                    |
      |       Processor Tree:                              |
      |         TableScan                                  |
      |           alias: mysource                          |
      |           filterExpr: ((id >= 1) and (id <= 2)) (type: boolean) |
      |           Filter Operator                          |
      |             predicate: ((id >= 1) and (id <= 2)) (type: boolean) |
      |             Select Operator                        |
      |               expressions: id (type: int), name (type: string), start_date (type: date) |
      |               outputColumnNames: _col0, _col1, _col2 |
      |               ListSink                             |
      |                                                    |
      +----------------------------------------------------+
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              shyamsunderrai@gmail.com Shyam Rai
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: