Hive Query Performance Tuning in Hadoop


Hive Query Language (HiveQL) Performance Tuning

HiveQL is a declarative language, Means users fire a declarative sentence (Query) on hive shell using hive syntax. Now Hive Internal mechanism comes into action and It converts Hive Query to MapReduce Job. The process of Query Parsing, Planning, Optimization and Execution is handled by Hive Internal Configurations.

However if we concerned about the performance optimization on hive queries, Then knowing about the internal mechanism of Apache hive and Low-Level Implementation of hive language would be more helpful for query optimization.

Tuning hive performance includes a number of factors from Configuration changes, Enabling and disabling hive features to correct and optimized hive query making.

Below are some of the factors which will let us know how to optimize hive query execution time

  1. Parallel Execution of hive query
  2. Local Mode
  3. Strict Mode
  4. Limit Tuning
  5. Using EXPLAIN
  6. Optimised Joins
  7. Number of Mappers and Reduces
  8. JVM Reuse
  9. Indexes
  10. Dynamic Partition Tuning

Parallel Execution of hive query for performance Tuning

As we all know the power of Parallel Processing in execution Framework. Hive converts a particular query into multiple stages (MapReduce, Sampling, Merge etc.) and by default, All those stage runs one after another in a sequential manner. But if we one or more stages in hive job are not dependent on each other, Than we can run those stages simultaneously, Thus improving overall hive query performance. The hive job may become much faster now.

To enable parallel execution in hive, We have to set parallel execution properties to TRUE in hive-site.xml Configuration file. Also if you don’t want to set it for all queries, You can use SET Hive Command for a specific query to be run on parallel.

Changing Parallel Execution Configuration for Performance Tuning

Local Mode uses for Hive Query Performance Tuning

Running hive query for small Dataset can be dramatically Tuned by using Local Mode hive versions. Because it can reduce the overhead of launching hive query on a cluster, Thus Entire Job is performed on a single machine.

You can enable Local Mode in hive by setting below properties on hive-site.xml Configuration file.

Hive will automatically apply above optimization If you are using local mode for query running. Local Mode can be activated through directly Set Hive Commands or Changing the hive-site.xml Configuration file with below text.

Strict Mode uses for Hive Query Performance Tuning

Strict Mode is the powerful Configuration in hive to restrict users to fire queries that may worsen effects on Hadoop cluster.

Strict Mode in a hive can be enabled by setting Strict Mode Property to TRUE.

Some of the restrictions STRICT MODE Puts on hive queries are –

Restriction to fire queries on Partitioned Tables, If not using Partition filter on where clause. The limitation is put because Partitioned Tables are huge tables and Hive puts restriction to scan all partitions to avoid uses of high resources.

Order By Clause having no LIMIT Clause can be restricted in Strict mode to avoid high utilization of resources.

In strict mode, Cartesian product is not allowed. If you really want to perform the operation, +set hive.mapred.mode=nonstrict+

“If previously you are working on SQL Queries, The common optimization techniques in SQL for joins is to include JOIN Condition in where predicates to exclude results, This will improve query execution time in SQL, But hive does not support this type of optimization. You have to have includes JOIN Condition in ON Clause. Otherwise, It will be converted in Cartesian Product and consume more Resources. ”

LIMIT Clause uses for Hive Query Performance Tuning

Limit Clause with queries is mostly used by the programmer’s, However for many types of queries, LIMIT Clause execute entire data set and then present the require amount of data.

Enabling Sampling with Limit Clause would improve those types of query performance. But you may get incorrect data for Aggregates Functions because sampling includes only the part of the data set.

To enable sampling with Limit clause you have to change following configuration to hive-site.xml file.

Using EXPLAIN Plan for Query Optimization in Hive

One of the most powerful features of Apache hive, when we take performance into consideration for HiveQL  is EXPLAIN Feature. It will reveal how Hive translates your query into MapReduce Job. Basically, It will give low-level implementation details for Hive query.

Below if the uses of EXPLAIN and DESCRIBE Feature of HiveQL

Using EXPLAIN will give query information and It will not execute Query. See Below Example-

It will print the abstract syntax key