Apache Hive • Joshita Mishra's Blog

Hive interview questions

What is Apache Hive, and how does it differ from traditional databases? Apache Hive is a data warehouse infrastructure built on top of Hadoop for querying and analyzing large datasets stored in HDFS using a SQL-like language called HiveQL. Schema-on-read : Hive reads data schema at query time, unlike traditional databases that enforce a schema-on-write. Batch Processing : Hive is designed for batch processing, making it slower than OLTP databases (online transaction processing) Not real-time: Queries are converted to MapReduce/Spark jobs, which are not suitable for real-time querying. Storage: Hive stores data in HDFS and operates on massive-scale datasets.

How does Hive support partitioning and bucketing, and when would you use each? Partitioning: Divides data into sub-directories based on column values, improving query performance by scanning only relevant partitions. Use Case: When filtering on specific column values like year=2023.

Bucketing: Divides data within partitions into fixed-sized buckets based on a hash of the column value. Use Case: For evenly distributed data or when performing joins on a specific column.

What are the different file formats supported in Hive, and which would you choose for optimal performance? Hive supports file formats like: TextFile: Default, plain-text; least efficient. SequenceFile: Binary format; supports compression. ORC (Optimized Row Columnar): Columnar storage, compression, indexing. Use Case: Best for analytical queries on large datasets. Parquet: Similar to ORC, widely used with Spark. Use Case: When interoperability with other tools is needed.

What is the difference between an internal table and an external table in Hive? Internal Table: Hive manages both table metadata and data. Dropping the table deletes the data. Use Case: Temporary or managed data. External Table: Metadata is stored in Hive, but data resides externally (e.g., HDFS or S3). Dropping the table doesn’t delete the data. Use Case: When sharing data between multiple tools.

Explain how Hive handles joins. What are the challenges with large datasets? Hive supports: Inner Joins, Outer Joins, Left Joins, and Right Joins.

Skewed Data: Can lead to unbalanced processing. Solution: Use MAPJOIN for smaller tables or bucketing.

Memory Usage: Large tables may exceed memory. Solution: Enable hive.auto.convert.join for map-side joins.

Slow Performance: Join operations generate large intermediate data. Solution: Optimize with partitioning and bucketing.

How does Hive support ACID properties? Its enabled through transaction tables Tables must be ORC format and transactional enabled (TBLPROPERTIES (“transactional”=“true”)).

What is a Hive Metastore, and why is it important?