Published
- 5 min read
Apache Iceberg
Apache Iceberg
Apache iceberg is a highly performant table format It is useful where traditional table formats like hive are insufficient due to issues like scalability, performance or compatibility
- Large scale data management
- Evolving schema requirements
- Need for data consistency and atomic operations
- Improved query performance Partition pruning - identifies and scans relevant partitions reducing query time Hidden partitioning - abstracts partition logic removing need to specify partitions manually vectorized reads - supports columnar formats like parquet and ORC
- Integration with modern tools : Apache spark, presto, flink, data lakes, aws s3, azure blob, gcs
- allows processing engine to read / write data to the same table simultaneously
- migration from hdfs or hive , if there are existing performance issues apache iceberg provides replacement
How is apache iceberg solving issues that hive has ? Hive Iceberg
- Schema evolution limited and risky
- Partitioning static and manual hidden and dynamic
- Data integrity prone to inconsistency atomic operations 4, Metadata management centralized and slow distributed and efficient
- performance limited advanced pruning and compaction
- Acid transactions basic and slow native and performant
- Engine support hiveQL multiple engine supported
- Time travel& Rollback not supported fully supported
How does iceberg handle evolving schema It assigns a unique number to new column and then the when to writes new data to example parquet file it adds that column and fills the previous column values as null.
Explain Iceberg’s approach to partitioning and how it resolves issues with traditional static partitioning. Traditional partitioning issues
- Data imbalance
- Over partitioning, too many partitions
- Static partitions can be hard to change
- Suboptimal queries : may lead to irrevelant partitions
Icebergs dynamic partitioning :
- Partition evolution : iceberg allows to change or modify partitioning as data grows, it changes it partition strategy Example a table might be provisioned by date, later it could be partitioned by region
- Intelligent partitioning : iceberg analyses data and partitions based on value distributions
- Hidden partitions: iceberg handles hidden partitions user doesnt have to worry about partitioning or do explicity partition like hive
Describe how Apache Iceberg ensures data consistency and supports concurrent writes and reads.
- Maintains snapshot for insert and updates on the table, snapshot once created is imutable
- Snapshot history
- Consistent view, iceberg reads from a snapshot tables which means tables will always be consistent
How does Iceberg’s table versioning work, and what benefits does it provide for data management? Data files: The files that represent the actual data.
Metadata: Includes the table schema, partitioning rules, and a reference to the manifest lists.
Changes: The operation type (append, overwrite, delete) that created the snapshot.
Snapshot ID: A unique identifier for each snapshot.
Timestamp: The exact time when the snapshot was created. You can query the table in time travle query that specific table would be hit for the query
What are the key components of Iceberg’s metadata layer, and how does it improve query performance? Schema: Column definitions and types. Partition Specification: Rules for data partitioning. Snapshot List: All snapshots of the table, including timestamps and IDs. Table Properties: Configuration settings like write optimization or compaction. Improve performace : Predicate Pushdown schema evolution Columnar Formats
How does apache iceberg manage ACID properties
How does Apache Iceberg integrate with popular compute engines like Spark, Flink, and Presto? spark.read.format(“iceberg”).load(“path_to_table”).filter(“column = value”).show() SELECT * FROM iceberg_table TIMESTAMP AS OF ‘2024-01-01T00:00:00’;
How does Iceberg ensure efficient data storage and retrieval for massive datasets? Separation of Metadata and Data Iceberg uses metadata files to track table structure and data locations. Automatically applies partitioning logic (e.g., by date or range) without requiring users to manage partition keys explicitly.
Why Efficient: Avoids over-partitioning and data skew, leading to balanced data distribution and reduced query latency. Iceberg uses statistics stored in metadata to prune irrelevant files during queries. Eliminates the need to scan unnecessary files. Iceberg automatically combines small files during write operations, reducing the overhead caused by excessive metadata and file I/O. Iceberg allows partitions to be restructured without rewriting historical data.
Explain Apache Iceberg’s approach to handling time-travel queries. Provide an example use case where this feature is critical. Apache Iceberg supports time-travel queries by leveraging its snapshot-based architecture. Each write operation (insert, update, delete) creates a new snapshot of the table’s metadata and data files. These snapshots allow users to query the state of the table at any point in its history.
What challenges does Apache Iceberg address in modern data lakes, and how does it support features like ACID transactions? Schema Evolution: Challenge: Traditional data lake formats (e.g., Hive) struggle with non-breaking schema changes such as adding, removing, or renaming columns. Iceberg Solution: Iceberg supports schema evolution without rewriting data, enabling seamless updates with robust column tracking. Partitioning Limitations:
Challenge: Static partitioning leads to over-partitioning or skewed partitions, which can degrade query performance. Iceberg Solution: Hidden partitioning abstracts partition keys, enabling dynamic and efficient partition management while reducing query complexity. Data Consistency:
Challenge: Concurrent writes/read operations in traditional data lakes can lead to inconsistencies. Iceberg Solution: Implements Multi-Version Concurrency Control (MVCC) to isolate operations and ensure consistency. Query Performance:
Challenge: Scanning entire datasets due to lack of metadata optimization in traditional formats. Iceberg Solution: Uses metadata pruning (e.g., manifest files, partition stats) to reduce query scope and improve performance. Data Versioning:
Challenge: No native versioning, making debugging or reproducing historical queries difficult. Iceberg Solution: Supports time travel and snapshot-based versioning, allowing users to query or roll back to historical states. Deletes and Updates:
Challenge: Traditional data lakes require complex and costly mechanisms for row-level deletes or updates. Iceberg Solution: Supports row-level operations through metadata layers, making it efficient and ACID-compliant. Scalability:
Challenge: Handling petabyte-scale datasets with consistent performance is difficult with older formats. Iceberg Solution: Designed to manage massive datasets by separating metadata storage and query operations. ACID Compliance:
Challenge: Lack of atomicity, consistency, isolation, and durability (ACID) in traditional data lakes. Iceberg Solution: Provides full ACID support, ensuring: Atomicity: Operations either fully succeed or fail without partial updates. Consistency: Metadata ensures that queries always see a consistent view of data. Isolation: Concurrent operations do not interfere with each other via MVCC. Durability: Data is reliably committed through snapshots.