Apache Iceberg • Joshita Mishra's Blog

Apache Iceberg

Apache iceberg is a highly performant table format It is useful where traditional table formats like hive are insufficient due to issues like scalability, performance or compatibility

Large scale data management
Evolving schema requirements
Need for data consistency and atomic operations
Improved query performance Partition pruning - identifies and scans relevant partitions reducing query time Hidden partitioning - abstracts partition logic removing need to specify partitions manually vectorized reads - supports columnar formats like parquet and ORC
Integration with modern tools : Apache spark, presto, flink, data lakes, aws s3, azure blob, gcs
allows processing engine to read / write data to the same table simultaneously
migration from hdfs or hive , if there are existing performance issues apache iceberg provides replacement

How is apache iceberg solving issues that hive has ? Hive Iceberg

Schema evolution limited and risky
Partitioning static and manual hidden and dynamic
Data integrity prone to inconsistency atomic operations 4, Metadata management centralized and slow distributed and efficient
performance limited advanced pruning and compaction
Acid transactions basic and slow native and performant
Engine support hiveQL multiple engine supported
Time travel& Rollback not supported fully supported

How does iceberg handle evolving schema It assigns a unique number to new column and then the when to writes new data to example parquet file it adds that column and fills the previous column values as null.

Explain Iceberg’s approach to partitioning and how it resolves issues with traditional static partitioning. Traditional partitioning issues

Data imbalance
Over partitioning, too many partitions
Static partitions can be hard to change
Suboptimal queries : may lead to irrevelant partitions

Icebergs dynamic partitioning :

Partition evolution : iceberg allows to change or modify partitioning as data grows, it changes it partition strategy Example a table might be provisioned by date, later it could be partitioned by region
Intelligent partitioning : iceberg analyses data and partitions based on value distributions
Hidden partitions: iceberg handles hidden partitions user doesnt have to worry about partitioning or do explicity partition like hive

Describe how Apache Iceberg ensures data consistency and supports concurrent writes and reads.

Maintains snapshot for insert and updates on the table, snapshot once created is imutable
Snapshot history
Consistent view, iceberg reads from a snapshot tables which means tables will always be consistent

How does Iceberg’s table versioning work, and what benefits does it provide for data management? Data files: The files that represent the actual data.

Metadata: Includes the table schema, partitioning rules, and a reference to the manifest lists.

Changes: The operation type (append, overwrite, delete) that created the snapshot.

Snapshot ID: A unique identifier for each snapshot.

Timestamp: The exact time when the snapshot was created. You can query the table in time travle query that specific table would be hit for the query

What are the key components of Iceberg’s metadata layer, and how does it improve query performance? Schema: Column definitions and types. Partition Specification: Rules for data partitioning. Snapshot List: All snapshots of the table, including timestamps and IDs. Table Properties: Configuration settings like write optimization or compaction. Improve performace : Predicate Pushdown schema evolution Columnar Formats

How does apache iceberg manage ACID properties

How does Apache Iceberg integrate with popular compute engines like Spark, Flink, and Presto? spark.read.format(“iceberg”).load(“path_to_table”).filter(“column = value”).show() SELECT * FROM iceberg_table TIMESTAMP AS OF ‘2024-01-01T00:00:00’;

How does Iceberg ensure efficient data storage and retrieval for massive datasets? Separation of Metadata and Data Iceberg uses metadata files to track table structure and data locations. Automatically applies partitioning logic (e.g., by date or range) without requiring users to manage partition keys explicitly.

Why Efficient: Avoids over-partitioning and data skew, leading to balanced data distribution and reduced query latency. Iceberg uses statistics stored in metadata to prune irrelevant files during queries. Eliminates the need to scan unnecessary files. Iceberg automatically combines small files during write operations, reducing the overhead caused by excessive metadata and file I/O. Iceberg allows partitions to be restructured without rewriting historical data.

Explain Apache Iceberg’s approach to handling time-travel queries. Provide an example use case where this feature is critical. Apache Iceberg supports time-travel queries by leveraging its snapshot-based architecture. Each write operation (insert, update, delete) creates a new snapshot of the table’s metadata and data files. These snapshots allow users to query the state of the table at any point in its history.

What challenges does Apache Iceberg address in modern data lakes, and how does it support features like ACID transactions? Schema Evolution: Challenge: Traditional data lake formats (e.g., Hive) struggle with non-breaking schema changes such as adding, removing, or renaming columns. Iceberg Solution: Iceberg supports schema evolution without rewriting data, enabling seamless updates with robust column tracking. Partitioning Limitations:

Challenge: Static partitioning leads to over-partitioning or skewed partitions, which can degrade query performance. Iceberg Solution: Hidden partitioning abstracts partition keys, enabling dynamic and efficient partition management while reducing query complexity. Data Consistency:

Challenge: Concurrent writes/read operations in traditional data lakes can lead to inconsistencies. Iceberg Solution: Implements Multi-Version Concurrency Control (MVCC) to isolate operations and ensure consistency. Query Performance:

Challenge: Scanning entire datasets due to lack of metadata optimization in traditional formats. Iceberg Solution: Uses metadata pruning (e.g., manifest files, partition stats) to reduce query scope and improve performance. Data Versioning:

Challenge: No native versioning, making debugging or reproducing historical queries difficult. Iceberg Solution: Supports time travel and snapshot-based versioning, allowing users to query or roll back to historical states. Deletes and Updates:

Challenge: Traditional data lakes require complex and costly mechanisms for row-level deletes or updates. Iceberg Solution: Supports row-level operations through metadata layers, making it efficient and ACID-compliant. Scalability:

Challenge: Handling petabyte-scale datasets with consistent performance is difficult with older formats. Iceberg Solution: Designed to manage massive datasets by separating metadata storage and query operations. ACID Compliance:

Challenge: Lack of atomicity, consistency, isolation, and durability (ACID) in traditional data lakes. Iceberg Solution: Provides full ACID support, ensuring: Atomicity: Operations either fully succeed or fail without partial updates. Consistency: Metadata ensures that queries always see a consistent view of data. Isolation: Concurrent operations do not interfere with each other via MVCC. Durability: Data is reliably committed through snapshots.